A Java-based Parallel Programming Support ... - Semantic Scholar

5 downloads 12065 Views 112KB Size Report
parallel programming toolkit in Java, specifically targeting an integrated ap- proach on ... of message passing code to be custom written for a specific application.
Technical Note DHPC-081

A Java-based Parallel Programming Support Environment K.A.Hawick and H.A.James Department of Computer Science, University of Adelaide, SA 5005, Australia Tel +61 8 8303 4519, Fax: +61 8 8303 4366 Email: [email protected] December 1999 Abstract The Java programming language and environment is stimulating new research activities in many areas of computing, not the least of which is parallel computing. Parallel techniques are themselves finding new uses in cluster computing systems. Although there are excellent software tools for scheduling, monitoring and message-based programming on parallel clusters, these systems are not yet well integrated and do not provide very high-level parallel programming paradigm support. We have prototyped a multi-paradigm parallel programming toolkit in Java, specifically targeting an integrated approach on Beowulf-style cluster computers. Our JUMP system builds on ideas from the message-passing community as well as from distributed systems technologies. We believe our system promises high performance for parallel programming as well as integrating with meta-computing frameworks for sharing resources across administrative boundaries. The ever-improving Java development environment allows us access to a number of techniques that were not available using the message-passing systems of the past. In addition to the usual object-oriented programming benefits, these include: language reflection; a rich variety of remote and networking techniques; dynamic classloading; and code portability. Using pure Java does at present still limit performance of some applications, but we believe that as a research software system, the benefits of scoping out the design issues of a system like JUMP now, are significant. We expect Java Virtual Machine (JVM) performances to continue to improve. Our system supports the usual messaging primitives although in a more natural style for a modern object oriented program. We are using our JUMP model framework to research some of the long sought after parallel programming goals of support for parallel I/O; irregular and dynamic domain decomposition and in particular irregular mesh support. Keywords: JUMP; message passing; multi-paradigm support; metacomputing; programming environments.

1

Introduction

The Java programming language and the various environments that support it have found many uses in modern computing. Not least of these is the area of parallel and high performance programming. The JavaGrande Forum [15] has been a strong 1

voice in the investigation of and call for improvements in the performance of Java Virtual machines. Parallel computing is an area that Java promises to aid considerably with its integrated threads package and various language features such as reflection; dynamic loading and its networks and remote methods packages. We have investigated Java as a vehicle for distributed computing systems infrastructure in some depth [13] and are now considering its capabilities for more tightly coupled parallel systems. In this paper we describe our Java Ubiquitous Message Passing (JUMP) software system for message passing. We have taken a fresh look at many of the difficulties and requirements of message passing programming and how Java programmers would use these differently to C or Fortran programmers who have been the mainstay users driving message passing systems. This is a necessary step to achieve our main goal, which is to consider how higher level layers of libraries can be used to ease the burden of parallel programming using various decomposition paradigms. Parallel programming has come a long way in some respects since the early 1980s. It is widely accepted and used for many performance intensive applications. New generations of programmers are accepting parallel programming methods as mainstream and not some obscure part of computer science as was once the case. However, parallel computing at the message passing level remains cumbersome and error prone. We do not claim to have found a revolutionary new approach and means to solve this problem. We do believe there is merit in considering higher level library packages that embody past experience in a way that can be more readily used. In this paper we describe the general approach we are taking with our JUMP system to support parallel decomposition strategies in a way that does not require a lot of message passing code to be custom written for a specific application. Some of the classic decomposition strategies for parallel applications include: geometric or regular domain decomposition in an appropriate number of dimensions; scattered spatial or semi-random decomposition; unordered task decomposition. It is our thesis that many applications do not simply require one of these strategies for an optimal parallelisation, but rather many realistic codes require several. Our JUMP system is therefore designed to allow several parallel decomposition harnesses to co exist in the same running program, reusing communications infrastructure, but controlled by a different thread of execution. Multi threaded programs may therefore use different decompositions for different data structures across the same set of participating processors. This idea is illustrated in figure 1. A common set of processors is employed by several decomposition frames running as part of the same program. Each frame is supported by a decomposition thread, and locality information is shared between the running threads/frames. An interesting example application is some sort of weather simulation on the globe. The atmospheric fluid flow is often modeled as a finite difference system of equations on a regular mesh or sometimes on a partially regular mesh, whereas measured data is assimilated onto this mesh from arbitrarily scattered points. Time stepping the fluid mesh and assimilating data may be computationally similar in terms of performance requirement or number of raw calculations to be performed therefore no single decomposition is ideal. The same processors need to be used for both parts of the calculation but different decompositions are optimal for load balancing the two parts of the computation. Our JUMP architecture is designed to support this idea. In this paper we describe our JUMP architecture and discuss some performance insights gained. We also consider the impact of emerging technologies like Jini and JavaSpaces services as a vehicle for implementing JUMP.

2

Processor set frame Separate thread controls each decomposition frame

decomposition frames

Projection of decomposition onto processor set

frame could be regular decomposition in some dimensionality or scattered spatial decomposition or simple unordered task farm decomposition or semi irregular (but with some locality) mesh decomposition

Figure 1: Multiple decomposition frames operating simultaneously in a multithreaded parallel program, each decomposition frame mapping to the same set of processors.

2

JUMP Design and Architecture

Despite the previous work done to provide MPI support for Java [2–4, 7, 16, 21] we believe that there is still work to be done in creating a pure Java open source MPI system. Our system is designed and implemented as a simple pure Java “MPIlike” system. Not all MPI bindings are sensible in an object-oriented programming environment such as Java. We propose a rethink of the message-passing interface specification in the light of object-oriented technology becoming more popular. As well as supporting the arguments normally used in MPI, the send() and recv() methods have been overloaded to accept fewer arguments, since providing information about data types and array sizes is available in Java through the use of reflection. This is an example of where the syntax of MPI is “unnatural” in Java. The Java language seems an obvious choice for the preferred implementation language because of both its platform-independent bytecode representation and the platform’s sheer ubiquity. The Java language provides a number of other benefits too: single inheritance of classes ensures programming simplicity for new users by not requiring them to locate and compile their programs with arcane-named libraries; Java is strongly-typed throwing an exception when a received object is typecast to an invalid type, unlike C’s default behaviour; Java’s security managers are able to specify the permissible actions on resources, and the security managers are enforced. JUMP builds on our past experiences in building multi-threaded daemons [13] in Java for software management. We also utilise our experience in classloading in constructing code server databases of dynamically invocable Java bytes code from databases [19]. JUMP programs can be run in a similar fashion to conventional

3

parallel programs written in MPI or PVM. We are still experimenting with the configuration of such programs [10] and are trying to determine where configuration information should best be located. At present we utilise a flat hostfile of participating node names in a fashion similar to PVM. Similar to both the MPI and PVM environments, the JUMP architecture is composed of three levels: user code; a user-level API; and underlying message-passing libraries (or in this case, base class). The similarity between JUMP and other message-passing environments is shown in figure 2. We have measured various aspects of the behaviour of the Java 2 platform on different platforms. Object creation for simple objects takes on the order of 150 clock cycles on a modern machine, whereas the thread and socket creation overheads are on the order of millions of clock cycles [17]. This suggests that using objectoriented technologies is entirely feasible for constructing message-passing software systems. However, the high cost of thread and socket creation has driven our design objective to reuse these wherever possible [12].

User JUMP Class

User MPI Program

imports

extends Abstract JUMP Baseclass (with API)

MPI API

imports

imports

MPI Communications Substrate

JUMP Communications Substrate

Figure 2: MPI and JUMP are both three-layer message-passing architectures. In MPI the user imports the MPI API library which in turn imports the MPI communications substrate; the JUMP user class extends the base class, which provides the API and initialisation routines. The JUMP communications substrate is accessed via an implementation-independent interface. Figure 3 gives an overview of the JUMP architecture. The architecture is divided into two parts: the JUMP base class and the JUMP daemon. An abstract JUMP base class contains all the “standard” methods that the user can use for communications, initialisation and finalisation of the computation. The JUMP daemon (JUMPd) runs on each machine in the host pool, waiting to receive requests to start slave instances of user JUMP code. The daemon only exists as a bootstrap mechanism to initialize slave instances of user code. Figure 3 shows a user class extending the JUMP base class, which initiates communications through its local JUMP daemon. A network of socket connections are established between participating daemons on an as-needed basis. These can be reused to minimise socket instantiation overheads. Once a socket is established for that program between two

4

nodes, no other will be created in our present implementation.

Dotted denotes separate Host Dashed denotes separate Java VM User Class Solid denotes one or more objects instantiated and running

JUMP user class and daemon JUMP Base Class

running in various JVMs and hosts

C o m m s JUMP daemon

JUMP daemon

Comms

Node Table

Node Table

Launching Host

Slave Host or Node

Figure 3: Relationships between User Class extending JUMP base class and JUMPd daemon running inside separate Java Virtual Machines. The user writes JUMP code by extending the JUMP base class. Extending the base class allows the user’s code to be used by the JUMP environment. The user is only required to supply two methods in order to extend the base class: a public void jumpRun(String []args) method and public static void main(String []args). The former method is the main body of the user’s code, run after initialisation, while the latter is used to initialise the computation. The main method usually consists of a call to create a new instance of the user class and a call to jumpRun. The user chooses where in their main method they will call jumpRun so the master instance is able to load any data it may require to distribute to the slaves. The current version of the JUMP environment communicates via TCP/IP and Internet-domain sockets. The steps involved in the initialization of the system are shown in figure 4. Figure 4 i) shows the state of the system when the user executes the main() method of their program. An initialization routine in the JUMP base class contacts the local JUMP daemon (figure 4 ii) which runs in its own JVM and therefore may not share the same classpath as the user program. As shown in figure 4 iii), the local JUMP daemon sends initialization requests to an appropriate number of remote machines (the addresses of which are selected either by the user or the daemon). When each remote JUMP daemon has successfully instantiated a copy of the user program, the daemons send details of the port on which the program is listening to the master program, which distributes a complete host table (figure 4 iv).

5

Java encapsulation and overloading allows us to provide default communicator tags to partition message space appropriately to avoid library layers colliding with each others messages. We are still considering the best syntactic form for the user to provide these explicitly when needed. In order for JUMP to function on a truly federated cluster of compute nodes, we cannot assume the presence of a shared file system. For this reason, we use a custom classloader [19] to transfer the necessary code to the remote machine. The mechanism for remote classloading is shown in figure 5. When the user program is instantiated (through the main() method) a copy of the program’s byte code is serialized (figure 5 i). The serialized code is sent with the run-time arguments in an initialisation request to the local JUMP daemon (figure 5 ii). The message is then distributed to each of the remote JUMP daemons that have been chosen to participate in the computation (figure 5 iii). If the class representing the user’s program references any other (non-core-Java classes) not in the remote daemon’s classpath, then when they try to instantiate a new instance of the user program, a ClassNotFoundException will be raised. Our classloader traps this exception (figure 5 iv) and sends a request for the class bytecode to the master user program (figure 5 v). The master user program replies to the request with a serialized copy of any requested class (figure 5 vi). Steps v) and vi) are repeated as many times as necessary to resolve all object on which the user program depends. Once all objects are resolved the slave copy of the user program is instantiated and the port to which the slave is listening is returned through the local JUMP daemon to the master program. To allow for the possibility of multiple, independent user programs running on the same physical machine, messages between user classes and the daemons, and between daemons are tagged with the source and destination host IP and the port on which the program is listening. With the exception of the JVM crashing or the machine running out of memory, user programs are protected from interference of other programs running on the same machine.

3

Future Directions for JUMP

JUMP uses a similar approach to message communicator tags and grouping mechanisms that MPI specifies and which was we believe pioneered in the Edinburgh CHIMP system [6]. This is important to allow library layers of more sophisticated communications to operate with internal messaging that does not impact on the explicit messages sent by user program code. We anticipate the JUMP platform will provide a platform with which we can investigate the challenging area of data layout specification. For example HPF [14] provides the ability to distribute data in block and cyclic patterns, depending on the way the data is to be used. We plan to introduce an extra layer between the user’s code and the JUMP base class which will allow communications to be automatically routed to the correct user jump instance in the current data layout. For example, consider a weather simulation that combines the previous predictions with observed data (from ships, aircraft and weather balloons). The Earth is likely to be modeled as a regular grid; therefore the code that operates on the previous predictions for the surface of the Earth for which there is no observed data is likely to be a regular grid decomposition. However observational data is not spread uniformly across the surface of the Earth. It is tightly clustered around certain points (most corresponding with shipping lanes and flight paths); this observational data may be best processed using a task farming model with the ability to update the 6

User Code (Master)

JVM on host 1

i) User Code (Master)

Local Daemon

Socket

JVM on host 1

JVM on host 1 Host List

ii) User Code (Master)

Socket

JVM on host 1

Local Daemon

JVM on host 1

Distributed System

Remote Daemon

JVM on host 2

Partial host table returned Remote Daemon

iii) User Code (Master)

JVM on host 3

Socket

JVM on host 1

Local Daemon

JVM on host 1

Distributed

Remote Daemon

System

Remote User Code (Slave)

JVM on host 2

Updated host table distributed Remote Daemon

iv) User Code (Master)

Remote User Code (Slave)

JVM on host 3

Socket

JVM on host 1

Local Daemon

JVM on host 1

Distributed

Remote Daemon

System

JVM on host 2

Remote Daemon

v)

Remote User Code (Slave)

Remote User Code (Slave)

JVM on host 3

Figure 4: Initialization of the JUMP environment. i) The user executes their program. ii) A socket is created that contacts the local daemon. iii) The local JUMP daemon selects the appropriate number of remote hosts and their JUMP daemons an initialization request. They respond with the host IP/port that the slave program listens to. iv) When all the remote daemons have responded favourably to the initialization request, the complete host table is distributed from the master JUMP daemon.

7

i) User Code (Master)

ii)

Opens socket and sends own class bytecode to daemon with number of remote copies and arguments

User Code (Master)

Local Daemon

Local daemon sends user byte code to remote daemon.

iii) User Code (Master)

Local Daemon

Remote Daemon

Local Daemon

Remote Daemon

iv) User Code (Master)

Remote daemon tries to instantiate new instance of user class.

v) Remote Daemon

User Code (Master)

Custom ClassLoader traps ClassNotFoundException and sends request for byte code to Master instance of user code.

vi) User Code (Master)

Master instance sends back byte code for requested Classes.

Remote Daemon

vii) Remote Daemon

When remote daemon has all necessary classes, Slave instance of user code is created.

Figure 5: JUMP distributed classloading mechanism. i) The user program loads a serialized copy of its class byte code in preparation for distribution. ii) The serialized bytecode is sent to the local JUMP daemon. iii) The bytecode is sent with an initialisation request and run-time parameters to the remote JUMP daemons. iv) The remote JUMP daemon tries to instantiate an instance of the user program on the remote machine. v) If the user program uses other non-Java core classes, a request is sent to the master instance of the user program. vi) The requested classes are sent to the remote daemon, where they are cached. Steps v) and vi) are repeated as required. vi) The slave instance is instantiated and the runJump() method is prepared with the run-time arguments. 8

cells of the Earth grid that are affected. When this extra layer of functionality is added to the JUMP environment we expect the architecture to look like that shown in figure 6. This figure shows the proposed three-level architecture incorporating support for data layout specification. JUMP Communications Substrate imports

Abstract JUMP Baseclass (with API)

JUMP Data Layout Specification

extends

User JUMP Class

Figure 6: Future versions of JUMP will include an extra layer of functionality to allow different data layout specifications and used by users. While the current version of JUMP communicates via sockets, it is not reliant on any particular communications technology. The current version is written using TCP/IP and Internet-domain sockets; the architecture does not preclude alternate versions that use, for example, RMI and JavaSpaces [8]. We are experimenting with the tuple space approach offered by JavaSpaces. The tuple space model where messages are tagged with complex predicate objects is an attractive mechanism for building higher level systems. However, a limitation of that system is lack of support for “spaces of spaces”. A single node would have to host the entire communications space, which we believe is not suitable for a high performance communications system. We are investigating ways to enable scalable tuple spaces. We have compared this with the performance that would arise if all communications in our system were brokered through a single host. This will typically present too much message congestion for any practical applications. We are waiting for the JavaSpaces technology to mature further before we use it to implement JUMP. JavaSpaces technology is a Linda-based distributed computing technology developed by Sun. JavaSpaces has been implemented on top of Jini [1]. Whilst Lindabased paradigms can potentially simplify the process of distributed programming, the performance of such systems has been considered to be poor and our performance measurements confirm this. Performance was measured using a ’Ping-Pong’ program running on the ATM connected Alpha cluster. With the communicating machines and the JavaSpaces system executing on the same machine, the Ping Pong latency was measured to be 484 ms. If the JavaSpaces system executes on a different machine, the measured latency drops to 325 ms. These measurements show that JavaSpaces introduces very significant communications overhead, that currently limit it’s usefulness. We have also found that JavaSpaces has significant hardware requirements, both in terms of CPU power and memory requirements. Current versions of JavaSpaces exhibit some instability and we are having difficulty running JavaSpaces on our Solaris and Windows NT systems. Based on our results and experiences, we believe that JavaSpaces is currently unsuitable for high performance distributed computing. As the performance and stability improves in future versions of the product, it is likely to become a more viable alternative to MPI systems.

4

Summary and Conclusions

We have described the design and architecture of our JUMP Java-base message passing system. We have discussed the performance and object oriented design goals that have led us in this direction. Although we envisage a need for an MPI compliant Java messaging system, we have been driven by research interests in high level layers of parallelism to develop JUMP. We have found that the performance restrictions on

9

advanced Java technologies such as Jini and JavaSpaces and also Java RMI are too limiting for the applications we hope to support. This tradeoff may change as JVM’s improve in performance, however at present we believe there is interesting work to be done in developing parallel support for various data decomposition strategies.

References [1] Ken Arnold, Bryan O’Sullivan, Robert W. Schiefler, Jim Waldo and Ann Wollrath. The Jini Specification, Addison Wesley Longman, June 1999. ISBN 0-201-61634-3. [2] Mark Baker, Bryan Carpenter, Geoffrey Fox, Sunh Hoon Ko, and Xinying Li. mpiJava: A Java MPI Interface. http://www.npac.syr.edu/projects/pcrc/papers/mpiJava/ [3] Bryan Carpenter, Vladimir Getov, Glenn Judd, Tony Skjellum, and Geoffrey Fox. MPI for Java: Position document and draft API specification. Java Grande Forum Technical Report JGF-TR-03, November 1998. [4] Kivanc Dincer. Ubiquitous Message Passing Interface implementation in Java: JMPI. Proc. 13th Int. Parallel Processing Symp. and 10th Symp. on Parallel and Distributed Processing. IEEE, 1998. [5] Dynamic Object Group. DOGMA.

DOGMA Home Page.

http://ccc.cs.byu.edu/-

[6] Edinburgh Parallel Computing Centre. CHIMP concepts. June 1991. [7] Adam J. Ferrari. JPVM: Network Parallel Computing in Java, 1998. [8] Eric Freeman, Susanne Hupfer and Ken Arnold. JavaSpaces Principles, Patterns, and Practice, Addison Wesley Longman, June 1999. ISBN 0-201-309556. [9] W. Gropp, E. Lusk, N. Doss and A. Sjkellum. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard, Argonne National Laboratories, 1996. [10] K. A. Hawick. The Configuration Problem in Parallel and Distributed Systems. DHPC-076, Department of Computer Science, The University of Adelaide, November 1999. [11] K. A. Hawick, R. S. Bell, A. Dickinson, P. D. Surry and B. J. N. Wylie. Parallelisation of the Unified Model Data Assimilation Scheme. Proc. Workshop of Fifth ECMWF Workshop on Use of Parallel Processors in Meteorology, Reading, November 1992. European Centre for Medium Range Weather Forecasting (ECMWF). [12] K. A. Hawick, H. A. James, J. A. Mathew and P. D. Coddington. Java Technologies for Cluster Computing Technocal Report DHPC-077, Department of Computer Science, The University of Adelaide, November 1999. [13] K. A. Hawick, H. A. James, A. J. Silis, D. A. Grove, K. E. Kerry, J. A. Mathew, P. D. Coddington, C. J. Patten, J. F. Hercus and F. A. Vaughan. DISCWorld: An Environment for Service-Based Metacomputing. Future Generation Computing Systems (FGCS), (15)623–635, 1999.

10

[14] High Performance Fortran Forum. High Performance Fortran Language Specification version 2.0. High Performance Fortran Forum, January 1997. [15] Java Grande Forum. Java Grande Forum Home Page. [16] P. Martins, L. M. Silva, and J. Silva. A Java Interface for WMPI. LNCS, 1497:p121–, 1998. [17] J. A. Mathew, P. D. Coddington and K. A. Hawick. Analysis and Development of Java Grande Benchmarks. Proc. of the ACM 1999 Java Grande Conference, April 1999. [18] J. A. Mathew, H. A. James and K. A. Hawick. Development Routes for Message Passing Parallelism in Java Technical Report DHPC-082, Department of Computer Science, The University of Adelaide, January 2000. [19] J. A. Mathew, A. J. Silis, and K. A. Hawick. Inter Server Transport of Java Byte Code in a Metacomputing Environment. Proc. TOOLS Pacific (Tools 28), 1998. [20] Message Passing Interface Forum. MPI: A Message Passing Interface Standard. [21] MPI Software Technology web page. January 2000.

11