III. Languages for Parallel Processing - Springer Link

0 downloads 0 Views 7MB Size Report
4.5 Case Study: Gaussian Elimination. To further illustrate the use of HPF, we present a slightly more complex ex- ample. The problem that we consider involves ...
III. Languages for Parallel Processing Ian Foster Argonne National Laboratory The University of Chicago, U.S.A. 1. Motivation for Parallel Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1.1 Types of Parallel Language. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1.2 Classes of Parallel Programming Languages ................. 2. New Programming Paradigms. . ..... .. . . .. . . .. . . .. . . .. . ... . . .. 2.1 Strand: A Concurrent Logic Programming Language. . . . . . . .. 2.2 Strand Language Constructs .............................. 2.3 Programming Example ................................... 2.4 New Paradigms Summary ................................. 3. Language Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Compositional C++: An Overview ......................... 3.2 Concurrency in CC++ ................................... 3.3 Locality in CC++ ....................................... 3.4 Communication in CC++ ................................. 3.5 Mapping in CC++ ....................................... 3.6 Language Extensions Summary ............................ 4. Data-Parallel Languages ...................................... 4.1 Basic Data-Parallel Concepts .............................. 4.2 Fortran 90 .............................................. 4.3 HPF and Data Distribution ............................... 4.4 Concurrency ............................................ 4.5 Case Study: Gaussian Elimination ......................... 4.6 Data Parallelism Summary ................................ 5. Library-based Approaches ..................................... 5.1 The Message Passing Interface ............................. 5.2 Global Operations ....................................... 5.3 Asynchronous Communication ............................. 5.4 Modularity .............................................. 5.5 Library-based Approaches Summary ........................ 6. Future Directions ............................................ References ..................................................

93 94 95 96 97 98 101 103 104 105 106 107 109 115 120 120 121 123 127 131 134 138 139 140 145 150 152 157 158 159

Summary. This chapter is concerned with programming languages for parallel processing. We first review some basic principles and then use a series of four case studies to illustrate the practical application of these principles. These case studies involve representative systems based on new programming paradigms, explicitly parallel extensions to existing sequential languages, data parallelism, and messagepassing libraries, respectively.

J. Błażewicz et al. (eds.), Handbook on Parallel and Distributed Processing © Springer-Verlag Berlin Heidelberg 2000

III. Languages for Parallel Processing

93

1. Motivation for Parallel Languages Programming languages play an important role in computing, serving variously to simplify the expression of complex algorithms, to increase code portability, to facilitate code reuse, and to reduce the risk of programming errors. A program written in a high-level language is executed by an interpreter or, more commonly, translated by a compiler into appropriate low-level operations; in either case, the programmer is saved the labor of managing low-level resources directly. Parallel computation further complicates the already challenging task of developing a correct and efficient program. At the root of the additional difficulty is the increased complexity of the underlying hardware. A parallel program must manage not only those concerns that are familiar to us from sequential programming, but also the creation and destruction of multiple concurrent threads of control, the synchronization of thread activities, the communication of information among threads, the distribution of data and threads to processors, and so forth. Given this additional complexity, we might expect programming languages to play an even more significant role on parallel systems than on sequential computers. And indeed, experience shows that parallel languages can be very effective. For example: - Data-parallel languages can enable certain programs with sequential semantics to execute efficiently on a parallel computer; hence, a programmer need not write explicit parallel code. - Parallel functional languages can be used to express highly concurrent algorithms while preventing nondeterministic executions; hence, a programmer need not be concerned with race conditions. - Message-passing libraries (a very low level form of parallel "language") can be used to write portable programs that nevertheless achieve very high performance on distributed-memory parallel computers; hence, a programmer need not write different programs for different types of computer. However, the combination of platform complexity, demanding performance requirements, and a specialized user base has prevented any parallel language from acquiring the broad user base of mainstream sequential languages. Instead, we see a variety of languages proposed, some focused on generality and others on high performance but none enabling high-performance execution for a broad range of applications on a variety of parallel platforms. In the remainder of this section, we categorize these approaches to the design of parallel programming languages. In subsequent sections, we discuss the basic principles of parallel programming paradigms and then use a series of case studies to illustrate the practical application of these principles.

94

Ian Foster

1.1 Types of Parallel Language

Approaches to parallel programming language design can be categorized along a number of axes; we consider six such axes here. Language extensions vs. new language. The language may be based on extensions to an existing sequential language (e.g., C, Fortran, C++, Java) or may implement a new programming paradigm that is more naturally parallel (e.g., functional [KM77, McL90, Hud89], logic [CM81, Kow79, Sha89], or object-oriented [Agh86, Yon87, Wat88, CGH94, Gri91)). Language extensions can have the advantage of preserving existing expertise, code base, and tools, while new languages can eliminate semantic barriers to the expression of parallelism. Task parallelism vs. data parallelism. The language may focus on task parallelism, in which the number of threads of control in a program may vary over time or in which different threads may do different things at different times (e.g., occam [TW82], ConcertjC [AGG+94], Compositional C++ [CK93], Linda [CG89, CG89a)), or the language may focus on data parallelism, in which parallelism derives from the fact that different processes apply the same operations to different elements of a data structure (e.g., C* [TMC90], Data Parallel C [HQ91]' NESL [Ble90], High Performance Fortran [KLS+94)). Typically, task parallelism provides greater flexibility, but data parallelism is more effective for regular problems. Both approaches can also be combined in a single system (e.g., Fx [SSO+93], HPF IMPI [FKK +98)). Explicit parallelism vs. implicit parallelism. The language may be explicitly or implicitly parallel. In the former case, parallelism is expressed directly by the programmer; in the latter, parallelism is extracted by a compiler. Implicit parallelism is desirable if it allows the programmer to ignore low-level details of parallel algorithms, but explicit parallelism tends to provide greater control and hence can permit higher-performance implementations. Determinism vs. nondeterminism. The language may guarantee deterministic execution or may allow the programmer to specify nondeterministic executions. Determinism is desirable because it can eliminate an important class of often hard-to-detect errors. However, some algorithms require nondeterminism for efficient execution. Programming language vs. coordination language. The language may be intended as a complete programming language in its own right or may be designed for use as a coordination language [CG89, Co189, Ke189, LS91, FT90] that provides a parallel superstructure for existing sequential code. Architecture-specific language vs. architecture-independent language. The language may be specialized for a particular architecture (e.g., shared-memory computers or distributed-memory computers) or even a particular computer (e.g., occam and the Transputer) or may be intended as a general-purpose programming language.

III. Languages for Parallel Processing

95

In addition, the subject of "parallel processing languages" can reasonably be extended to embrace sequential languages-if compilation techniques are able to extract parallelism from sequential programs [Wo189, AL93]-and libraries that implement parallel programming paradigms. In this chapter, we consider the latter but not the former. We also explicitly exclude from discussion languages that are primarily intended for the specification of distributed systems and distributed algorithms. 1.2 Classes of Parallel Programming Languages From our initial discussion, the reader might conclude that the number of parallellanguages is large, and this is indeed the case. Fortunately, a number of review articles provide good surveys of this material. The most valuable, and most recent, is certainly that by Skillicorn and Talia [ST98], which considering its moderate length provides an impressively comprehensive and wellstructured survey of parallel programming paradigms and languages. More dated surveys include the excellent article by Bal et al. [BST89] and narrower but still useful articles by Shapiro [Sha89], Carriero and Gelertner [CG89], and Karp and Babb [KB88]. Given the wealth of existing survey material, we focus in this chapter not on details of historical development but on the nuts and bolts of parallel programming. To this end, we consider four important classes of parallel programming language and, in each case, use a specific system to illustrate the issues that arise when that parallel language is put to practice. The four classes of system and representative languages that we consider in this chapter are as follows: - New or nontraditional paradigms, that is, notations that are designed to simplify the expression of highly parallel algorithms. We choose as our representative a concurrent logic programming language, Strand. Such languages frequently permit concise and elegant formulations of certain classes of parallel algorithm, but their unfamiliar syntax and semantics can complicate the programming task. - Extensions to existing languages, using Compositional C+ + as our example. This discussion provides useful insights into how sequential language constructs can be extended in interesting ways to address parallelism. - Data-parallel languages, which exploit the parallelism inherent in applying the same operation to all or most elements of large data structures. We use Fortran 90 and High Performance Fortran as our representatives. Data parallel languages have proven extremely effective for certain classes of problem, but their suitability for more general applications remains unproven. - library-based approaches to parallelism, in which a parallel programming paradigm is implemented via function calls rather than by language constructs and a compiler. We use the widely used Message Passing Interface

96

Ian

Foster

standard as our representative. While language-based approaches do not typically benefit from the advantages of automatic checking and optimization that language-based approaches may enjoy, they provide considerable flexi bili ty. The material borrows heavily from the book Designing and Building Parallel Programs [Fos95], which covers both parallel algorithm design techniques and tutorial presentations of three of the parallel languages considered here.

2. New Programming Paradigms One view of parallel programming holds that sequential programming languages, being designed explicitly to represent the manipulation of von Neumann computers, are an inappropriate tool for parallel programming. Particularly if the goal is to express (or extract) large amounts of parallelism, new programming paradigms are required in which parallelism is explicitly or implicitly a first-class citizen, rather than an addition to a sequential programming model. This view has motivated explorations of numerous innovative programming models. For example: - In functional programming, the Church-Rosser property [CR36, McL90] (which holds that the arguments to a pure function can be evaluated in any order' or in parallel, without changing the result) is exploited to extract parallelism from programs implemented as pure functions [Hen80, Hud89, HC95, THL+98]. Considerable research has been conducted in functional programming, in the context of both "impure" languages such as LISP [HaI85] and more modern functional languages such as Haskell [Hud89, JH93, THM+96]. SISAL, a functional language incorporating arrays and interactive constructs, is particularly interesting in view of the high performance it has achieved in realistic scientific applications [CFD90]. - In logic programming, the parallelism implicit in both conjunctive and disjunctive specifications ("and" and "or" parallelism, respectively) is exploited. Researchers have investigated techniques in Prolog as well as in specialized parallel languages such as Concurrent Prolog [MTS+85, Sha87, Tay89], Guarded Horn Clauses [Ued85], Parlog [CG81, Gre87, Rin88], Strand [FT90, FKT90], and Program Composition Notation[CT91, FOT92]. - In Actor systems [Agh86], parallelism is expressed in terms of message passing between entities called actors. An individual actor is activated by the arrival of a message and may then perform some computation and/or send additional messages to other actors. Concurrent Aggregates [PLC95] extends this model to avoid the serialization inherent in the sequential processing of messages by a single actor. Concurrent object systems such as

III. Languages for Parallel Processing

97

ABCLj1 [Yon87], ABCLjR [Wat88], POOL-T [Ame88], and Concurrent Smalltalk [Yok87] extend sequential object-oriented languages, allowing multiple threads of control to exist. 2.1 Strand: A Concurrent Logic PrograDlming Language

We use as our case study for new programming paradigms the concurrent logic programming language Strand [FT90, FKT90] (see also [Fos96] for more details, and a comparison of Strand with the related Program Composition Notation [CT91, FOT92]). Strand uses a high-level specification of concurrent execution to facilitate the expression of highly parallel computations and an abstract representation of communication and synchronization to simplify the representation of complex communication structures. Its unfamiliar syntax and semantics are a significant obstacle to most programmers, but (as we shall see in the next section) it is possible to map many of its more useful features into more familiar languages. The Strand language design integrates ideas from earlier work in parallel logic programming [CG81, Gre87, Rin88, Ued85, Sha87], functional programming [KM77, McL90], dataflow computing [Ack82], and imperative programming [Dij75, Hoa78] to provide a simple task-parallel programming language based on four related ideas: -

single-assignment variables, a global, shared namespace, parallel composition as the only method of program composition, and a foreign-language interface.

Single-assignment variables provide a unified mechanism for both synchronization and communication. All variables in Strand follow the singleassignment rule [Ack82]: a variable is set at most once and subsequently cannot change. Any attempt by a program component to read a variable before it has been assigned a value will cause the program component to block. All synchronization operations are implemented by reading and writing these variables. New variables can be introduced by writing recursive procedure definitions. Strand variables also define a global namespace. A variable can refer to any object in the computation, even another variable. The location of the variable or object being referenced does not matter. Thus, Strand does not require explicit communication operations: processes can communicate simply by reading and writing shared variables. Traditional sequential programming languages support only the sequential composition of program components: that is, program statements are assumed to be designed one after the other, in sequence. In contrast, Strandlike certain other parallel languages, notably functional languages-supports only parallel composition. A parallel composition of program components

98

Ian Foster

executes as a concurrent interleaving of the components, with execution order constrained only by availability of data, as determined by the singleassignment rule. This feature allows Strand programs to provide succinct expressions of many complex parallel algorithms. The combination of single-assignment variables, a global namespace, and parallel composition means that the behavior of a Strand program is invariant to the placement and scheduling of computations. One consequence of this invariance is that Strand programs are compositional: a program component will function correctly in any environment [CM88, CT91]. Another consequence is that the specification of the location of a computation is orthogonal to the specification of the computation. To exploit these features, Strand provides a mapping operator that allows the programmer to control the placement of a computation on a parallel computer. By allowing modules written in sequential languages to be integrated into Strand computations, the foreign-language interface supports the use of Strand as a coordination language. Sequential modules that are to be integrated in this way must implement pure functions. The interface supports communication between foreign modules and Strand by providing routines that allow foreign-language modules to access Strand variables passed as arguments. 2.2 Strand Language Constructs

We present here a brief summary of Strand language concepts; details are provided elsewhere [FT90]. The syntax of Strand is similar to that of the logic programming language Prolog. A program consists ofaset of procedures, each defined by one or more rules. A rule has the general form m,n~O,

where the rule head H is a function prototype consisting of a name and zero or more arguments; the Gi are guard tests; "I" is the commit operator; and the Bj are body processes: calls to Strand, C, or Fortran procedures, or to the assignment operator ":=". If m 0, the "I" is omitted. Procedure arguments may be variables (distinguished by an initial capital letter), strings, numbers, or lists. A list is a record structure with a head and a tail and is denoted

=

[headl tail] .

A procedure's rules define the actions that the process executing that procedure can perform. The head and guard of the rule define the conditions under which an action can take place; the body defines the actions that are to be performed. When a procedure executes, the conditions defined by the various heads and guards are evaluated in parallel. Nonvariable terms in a rule head must match corresponding process arguments, and guard tests must succeed. If the conditions specified by a single rule hold, this rule is selected for execution, and new processes are created for the procedures in its body. If two or more rules could apply, one is selected nondeterministic ally. It suffices

III. Languages for Parallel Processing

99

to ensure that conditions are mutually exclusive to avoid nondeterministic execution. If no condition holds, an error is signaled. For example, the following procedure defines a consumer process that executes either actionl or action2, depending on the value of variable X. consumer (X) X -- "msg" action1(X) . consumer (X) :- X =\= "msg" I action2(X). In this procedure, X is a variable; "msg" is a string; and == and =\= represent equality and inequality tests, respectively. Notice that this procedure is deterministic. 2.2.1 Communication and synchronization. As noted above, all Strand variables are single-assignment variables. A shared single-assignment variable can be used both to communicate values and to synchronize actions. For example, consider concurrently executing producer and consumer processes that share a variable X: producer(X), consumer(X) The producer may assign a value to X (e.g., "msg") and thus communicate this value to the consumer: producer(X) :- X := "msg" As shown above, the consumer procedure may receive the value and use it in subsequent computation. The concept of synchronization is implicit in this model. The comparisons X == "msg" and X =\= "msg" can be made only if the variable X is defined. Hence, execution of consumer is delayed until producer executes and makes the value available. The single-assignment variable would have limited utility in parallel programming if it could be used to exchange only a single value. In fact, a single shared variable can be used to communicate a sequence or stream of values. This is achieved as follows. A recursively defined producer process incrementally constructs a list structure containing these values. A recursively defined consumer process incrementally reads this same structure. Fig. 2.1 illustrates this technique. The stream_comm procedure creates two processes, stream..producer and stream_consumer, that use the shared variable X to exchange N values. The producer incrementally defines X to be a list comprising N occurrences of the number 10: [10, 10, 10, ... , 10]

The statement Out : = [loI0utl], which defines the variable Out to be a list with head 10 and tail Outl, can be thought of as sending a message on Out. The new variable Outl is passed to the recursive call to stream_producer, which either uses it to communicate additional values or, if N==O, defines it to be the empty list [].

100

Ian Foster

The consumer incrementally reads the list S, adding each value received to the accumulator Sum and printing the total when it reaches the end of the list. The match operation [ValIInl] in the head of the first stream-.eonsumer rule determines whether the variable shared with stream..producer is a list and, if so, decomposes it into a head Val and tail In1. This operation can be thought of as receiving the message Val and defining a new variable Inl that can be used to receive additional messages.

stream_colDIII(N) :stream-producer(N, S), stream-consumer(O, S). stream-producer(N, Out) N

>

0

I

% N is number of messages % Accumulator initially 0

% More to send (N > 0): % Send message "10"; % Decrement count;

Out := [loI0utl], Nl is N - 1, stream-producer(Nl,Outl). stream-producer(O, Out) Out := [].

% Recurse for more. % Done sending (N == 0): % Terminate output.

stream-consumer (Sum, [Vall Inl]) Suml is Sum + Val, stream-consumer(Suml, Inl). stream_consumer (Sum , []) :print (Sum) •

% Receive message: % Add to accumulator; % Recurse for more. % End of list (In == []): % Print result.

Fig. 2.1. Producer/consumer program.

2.2.2 Foreign interface. "Foreign" procedures written in C or Fortran can be called in the body of a rule. A foreign-procedure call suspends until all arguments are defined and then executes atomically, without suspension. This approach achieves a clean separation of concerns between sequential and parallel programming, provides a familiar notation for sequential concepts, and enables existing sequential code to be reused in parallel programs. 2.2.3 Mapping. The Strand compiler does not attempt to map processes to processors automatically. Instead, the Strand language provides constructs that allow the programmer to specify mapping strategies. This approach is possible because the Strand language is designed so that mapping affects only performance, not correctness. Hence, a programmer can first develop a program and then explore alternative mapping strategies by changing annotations. The technique is illustrated below; we shall see similar ideas applied in CC++ and in High Performance Fortran.

III. Languages for Parallel Processing

101

2.3 Programming Example We use a genetic sequence alignment program [BBF +89] to illustrate the use of Strand. The goal is to line up RNA sequences from separate but closely related organisms, with corresponding sections directly above one another and with indels (dashes) representing areas in which characters must be inserted or deleted to achieve this alignment. For example, Fig. 2.2 shows (a) a set of four short RNA sequences and (b) an alignment of these sequences. augcgagucuauggcuucggccauggcggacggcucauu augcgagucuaugguuucggccauggcggacggcucauu augcgagucuauggacuucggccauggcggacggcucagu augcgagucaaggggcucccuugggggcaccggcgcacggcucagu

(a) augcgagucuauggc----uucg----gccauggcggacggcucauu augcgagucuauggu----uucg----gccauggcggacggcucauu augcgagucuauggac---uucg----gccauggcggacggcucagu augcgaguc-aaggggcucccuugggggcaccggcgcacggcucagu (b) Fig. 2.2. RNA sequence alignment.

The algorithm uses a divide-and-conquer strategy which works basically as follows. First, "critical points"-short subsequences that are unique within a sequence-are identified for each sequence. Next, "pins"-critical points that are common to several sequences-are identified. The longest pin is used to partition the problem of aligning the sequences into three smaller alignment problems, corresponding to the subsequences to the left of the pin in the pinned sequences, the subsequences to the right of the pin, and the unpinned sequences (Fig. 2.3). These three subproblems then are solved by applying the alignment algorithm in a recursive fashion. Finally, the three subalignments are combined to produce a complete alignment. This genetic sequence alignment algorithm presents many opportunities for parallel execution. For example, critical points can be computed in parallel for each sequence, and each alignment subproblem produced during the recursive application of the algorithm can be solved concurrently. The challenge is to formulate the algorithm in a way that does not obscure the basic algorithm structure and that allows alternative parallel execution strategies to be explored without substantial changes to the program. The Strand implementation has this property. The procedures in Fig. 2.4 implement the top level of the algorithm. The align-