SPMD Programming in Java - CiteSeerX

84 downloads 54044 Views 180KB Size Report
implementing a nancial application in Java on the IBM POWERparallel ... identi es a set of library methods that facilitate data parallel programming in Java.
SPMD Programming in Java Susan Flynn Hummel Ton Ngo Harini Srinivasan IBM T.J. Watson Research Center Abstract We consider the suitability of the Java concurrent constructs for writing highperformance SPMD code for parallel machines. More speci cally, we investigate implementing a nancial application in Java on the IBM POWERparallel system SP. Despite the fact that Java was not speci cally targeted to such applications and architectures per se, we conclude that ecient implementations are feasible. Finally, we propose a library of Java methods to facilitate SPMD programming.

1 Motivation Although Java was not speci cally designed as a high-performance parallel-computing language, it does include concurrent objects (threads), and its wide-spread acceptance makes it an attractive candidate for writing portable computationally-intensive parallel 

Also with Polytechnic University and Cornell Theory Center.

1

applications. In particular, Java has become a popular choice for numerical nancial codes, an example of which is arbitrage { detecting when the buying and selling of securities is temporarily pro table. These applications involve sophisticated modeling techniques such as successive over relaxation (SOR) and Monte Carlo methods [19]. In this paper, we use an SOR code for evaluating American options (see Figure 1) [19], to explore the suitability of using Java as a high-performance parallel-computing language. This work is being conducted in the context of a research e ort to implement a Java run-time system (RTS) for the IBM POWERparallel System SP machine, which is designed to e ectively scale to large numbers of processors. The RTS is written in C with calls to MPI (message passing interface) [18] routines. The typical programming idiom for highly parallel machines is called data-parallel or single-program multiple-data (SPMD), where the data provides the parallel dimension. Parallelism is conceptually speci ed as a loop whose iterates operate on elements of a, perhaps multidimensional, array. Data dependences between parallel-loop iterates lead to a producer-consumer type of sharing, wherein one iterate writes variables that are later read by another. The communication pattern between iterates is often very regular, for example a bidirectional ow of variables between consecutive iterates (as in the code in Figure 1). This paper explores the suitability of the Java concurrency constructs for writing SPMD programs. In particular, the paper: 

identi es the di erences between the parallelism supported by Java and data parallelism.



discusses compiler optimization of Java programs, and the limits imposed on such 2

public void

SORloop()

f

int

i, k;

double y; for (int t

= 0; t < Psor.Convtime; t++) f

error = 0.0;

for (i = 0; i if (i == 0)

< Psor.N-1; i++) f

y = values[i+1]/2;

else if

(i == Psor.N-1)

y = values[i-1]/2;

else y = (values[i-1] + values[i+1])/2; y = max(obstacle[i], values[i]+Psor.omega*(y-values[i])); values[i] = y;

g g g

Figure 1: Java SOR code for evaluating American options

3

optimizations because of the memory consistency model de ned at the Java language level; 

discusses the key features of data-parallel programming, and how these features can be implemented using the Java concurrent and synchronization constructs;



identi es a set of library methods that facilitate data parallel programming in Java.

Organization of the Paper In Section 2, we elaborate on the di erences between data parallelism and the type of parallelism that Java was designed to support, which we call control parallelism. In the subsequent sections, we discuss the concurrent programming features of Java, and their suitability for writing data-parallel programs. Section 3 discusses the Java parallel programming model, and impact on performance optimizations in multi-threaded Java programs. In Section 4, we discuss data-parallel programming using the concurrent and synchronization constructs supported by Java; we illustrate data parallel programming in Java using the Successive Over Relaxation code for evaluating American options. Finally, in Section 5, we propose a library of methods that facilitate SPMD-programming in Java on the SP.

2 Control versus Data Parallelism Although there is a continuum in the types of parallelism found in systems and numeric applications, it is instructive to compare the two extremes of what we call control and data parallelism (respectively):

4



Parallelism in Systems: Parallel constructs were originally incorporated into programming languages to express the inherent concurrency in uniprocessor operating systems [8]. These constructs were later e ectively used to program multipleprocessor and distributed systems [20, 12, 11, 6, 14]. This type of parallelism typically involves heterogeneous, \heavy weight" processes, i.e., each process executes di erent code sequences and demarcates a protection domain. The spawning of processes usually do not follow a regular pattern during program execution, and processes compete for resources, for example, a printer. It is thus important to enforce mutually exclusive access to resources; however, it is often not necessary to order accesses to resources.



Parallelism in Numerical Applications: The goal of parallel numeric applications is performance, and the type of parallelism di ers signi cantly from that found in parallel and distributed systems. Data parallelism [5] is homogeneous, \light weight", and regular|for example, the iterates of a forAll loop operating on an array. Parallel iterates are cooperating to solve a problem, and hence, it is important to ensure that events, such as the writing and reading of variables, occur in the correct order. Thus, producer-consumer synchronization must be enforced. Iterates may also need to be synchronized collectively, for example to enforce a barrier between phases of execution or to reduce the elements of an array into a single value using a binary operation (such as addition or maximum). A barrier can sometimes subsume individual produce-consume synchronization between iterates.

As discussed in the next section, the Java parallel constructs were designed for interactive and Internet (distributed) programming, and lie somewhere between the above two 5

extremes.

3 The Java Parallel Programming Model Based on the classi cation described in the previous section, the parallel programming model provided in Java ts best under control parallelism. This section presents a high level view of the support in Java for parallelism; for further details, the reader is referred to the Java language speci cation [13]. Java has built-in classes and methods for writing multithreaded programs. Threads are created and managed explicitly as objects; they perform the conventional thread operations such as

,

,

,

,

, etc. The

start run suspend resume stop

ThreadGroup

class

allows a user to name and manage a collection of threads. The grouping can be nested but cannot overlap; in other words, a nested thread group forms a hierarchical tree. A ThreadGroup can be used to collectively suspend, resume, stop and destroy all thread members; however, this class does not provide a method to create or start all thread members simultaneously. Java adopts a shared-memory model, allowing an object to be directly accessible by any thread. The shared-memory model is easily mapped onto uniprocessor and shared-memory multiprocessor machines. It can also be implemented on disjoint-memory multicomputers, such as the SP, using message passing to implement a virtual shared memory. For distributed machines, Java's Remote Method Invocation provides an RPClike interface that is appropriate for coarse grained parallelism. Java threads synchronize through statements or methods that have a synchronized attribute. These statements and methods function as monitors: only one thread is 6

allowed to execute the code in the statements or methods at a time. Locks are not provided explicitly in the language, but are employed implicitly in the implementation of monitors (they are embedded in the bytecode instructions monitorexit

monitorenter

and

[17]). Conceptually, each object has an associated lock and wait queue;

consequently, a synchronized block of code is associated with the objects it accesses. When two threads compete for entry to synchronized codes by attempting to lock the same object, the thread that successfully acquires the lock continues execution, while the other thread is placed on the wait queue of the object. When executing synchronized code, a thread can explicitly transfer control by using notify, notifyAll and wait methods.

notify

moves one randomly chosen thread from the wait queue of

the associated object to the run queue, while notifyAll moves all threads from the wait queue to the run queue. If the thread owning the monitor executes a wait, it relinquishes the lock and places itself on the wait queue, allowing another thread to enter the monitor. The Java shared memory model is not the sequentially consistent [15] model that is commonly used for writing multi-threaded programs. Instead, Java employs a form of weak consistency [4], to allow for shared-data optimization opportunities. In the sequential consistency model, any update to a shared variable must be visible to all other threads. Java allows a weaker memory model wherein an update to a shared variable is only guaranteed to be visible to other threads when they execute synchronized code. For instance, a value assigned by a thread to a variable outside a synchronized statement may not be immediately visible to other threads because the implementation may elect to cache the variable until the next synchronization point (with some restrictions). The following subsection discusses the impact of the Java shared variables rules on compiler optimizations. 7

3.1 Optimizing Memory Accesses in Multithreaded Programs The goal of SPMD programming is high performance, and exploiting data locality is arguably the most important optimization on parallel machines. The ability to perform data locality optimizations, such as caching and prefetching, depends on the underlying language memory model. Communication optimizations [7, 2] require an analysis of memory access patterns in an SPMD program, and therefore, the ability to perform communication optimizations also depends on the language memory model. The Java memory model [13] is illustrated in Figure 2. Every Java thread has, in addition to a local stack, a working memory in which it can temporarily keep copies of the variables that it accesses. The main memory contains the master copy of all variables. In the terminology of Java, the interactions between the stack and the working memory are assign, use, lock, and unlock; the interactions between the working memory and the main memory are store and load, and those within the main memory are write and read. The Java speci cation [13] states that an \... implementation is free to update main memory in an order that may be surprising." The intention is to allow a Java implementation to maintain the value of a variable in the working memory of a thread; a user must use the synchronized keyword to ensure that data are transferred between threads, i.e., the main shared memory is made consistent. Alternatively, a variable can be declared volatile so that it will not be cached by any thread. Any update to a volatile variable by a thread is immediately visible to all other thread; however, this disables some code optimizations1. 1

Note that the term volatile adopted by Java designates a variable as noncacheable, whereas in parallel

programming this term has traditionally meant that a variable is cacheable but may be updated by other processors [1].

8

Java Virtual Machine Stack

Main Memory

Working memory assign

store

write

use

load

read

Figure 2: Java memory abstraction

9

Java de nes the required ordering and coupling of the interactions between the memories (the complete set of rules is complex and is not repeated here). For instance, the read/write actions to be performed by the main memory for a variable must be executed in the order of arrivals. Likewise, the rules call for the main memory to be updated at synchronization points by invalidating all copies in the working memory of a thread on entry to a monitor (monitorenter) and by writing to main memory all newly generated values on exit from a monitor (monitorexit). Although the Java shared data model is weakly consistent, it is not weak enough to allow oblivious caching and out of order access of variables between synchronization points. For example, Java requires that a) the updates by one thread be totally ordered and b) the updates to any given variable be totally ordered. Other languages, such as Ada [1], only require that updates to shared variables be e ective at points where threads (tasks in Ada) explicitly synchronize, and therefore, a RTS can make better use of memory hierarchies [10]. A Java implementation does not have enough freedom to fully exploit optimization opportunities provided by caches and local memories. A stronger memory consistency model at the language level forces the implementation to adopt at least the same, if not stronger, memory consistency model. A strong consistency model at the implementation level increases the complexity of code optimizations|the implementation will have to consider interactions between threads at all points where shared variables are updated, not just at synchronization points [21]. If whole program information is available, then the implementation can determine which variables are shared, and which shared variables are not updated in synchronized code. Such information is necessary to perform many standard compiler optimizations, e.g., code motion, in the presence of threads and synchronization. With the separate

10

compilation of packages, where whole program information is not always available, it is dicult to distinguish between shared and private variables. In such situations, therefore, several common code optimizations are e ectively disabled [22].

4 Data Parallelism in Java Data parallelism arises from simultaneous operation on disjoint partitions of data by multiple processors. Conceptually, each thread executes on a di erent virtual processor and accesses its own partition of the data. Synchronization is necessary to enforce the necessary ordering for correctness. Communication is necessary when a thread has to access a data partition owned by a di erent thread, and as described earlier, the producer/consumer relationship conveniently encapsulates both of these requirements. Determining the producer (the thread that generates the data) and the consumer (the thread that accesses the data) is straightforward since the relationship is dictated by the algorithm. The following features are idiomatic in data-parallel programming: 

Creating and starting multiple threads simultaneously.



Producer-consumer operation for synchronization and data communication between threads.



Collective synchronization of a set of threads.

Although Java provides language constructs for parallelism, there are several factors that make expressing data parallelism in Java awkward. First, as noted in the previous section, there is no mechanism for creating and starting threads in a ThreadGroup simultaneously. A forAll must be implemented as a serial 11

loop in Java: SPMDThread spmdthread[] = new SPMDThread[NumT];

for (i = 0; i NumT; i++) f spmdthread[i] = new SPMDThread();