Java as a Front-End to High-Performance Computing ... - CiteSeerX

7 downloads 519 Views 155KB Size Report
Java Development Kit JDK 1.1.2 was used throughout, on the di erent platforms. Al- though it is only a single processor version, the SGI version performs very ...
Java as a Front-End to High-Performance Computing Resources Andrew J. Silis and K.A.Hawick Department of Computer Science, University of Adelaide SA 5005 Australia fdin,[email protected] Fax: +61 8 8303 4366, Tel +61 8 8303 4519

Abstract The Java programming language

is nding uses in many areas of computing, and although Java Virtual Machines are becoming faster and more ecient, their performance is still poor compared to native code. In particular, numerical operations are still better suited to be run using languages such as Fortran or C on high performance hardware. We describe our e orts in using Java as a front-end or client program for high-performance systems and in particular for parallel supercomputers running remote native code. We review various approaches to providing powerful back-end compute services to Java client programs and report on some performance gures and discuss their implications for performance accelerated Java programs in the context of a simple performance model. We also consider other non-technical, operational issues in setting up such a system to make use of legacy supercomputers. Distributed systems, Java, parallel computing, remote access, supercomputers. Keywords:

1 Introduction As part of our ongoing work to develop a high-level service-based distributed computing environment we identi ed a need to access legacy software modules which are optimised to run eciently on supercomputers. Our DISCWorld system[8] is being developed

largely in Java and it was therefore necessary to consider how Java client programs could effectively invoke supercomputer programs remotely. The more general problem of invoking native and remote code[?] is of increasing importance as Java becomes widely used on distributed systems - particularly as embedded client programs, invokable from a World Wide Web(WWW) environment. Java programs allow a number of possible mechanisms for remote computing, which we explore in this paper. Several choices exist, including: TCP/IP and direct socket interfacing; Java Remote Method Invocation(RMI); interfacing CORBA to JVM's; use of Remote Procedure Calls(RPC); and use of remote shells. We discuss these in section 2. These enable a Java client to connect to and control remote resources in various ways, each of which are best suited for di erent types of applications or services. There are a number of factors which make Java an attractive environment for user programs. Java provides a convenient interface building mechanism which can allow application domain-speci c interfaces to be constructed to provide di erent viewpoints, of the same services and resources, to types of users with di erent expertises. For example a set of numerical algebra services might be made available on supercomputers for scientists to use, or some higher-level services might provide yet another layer on top of basic services. One of the powers of a Java client as part of a

WWW service, is that quite complex processing can be carried out at the WWW-browser client-end prior to invoking a service on a backend server. This might involve parsing complex queries to ensure they make sense before transmission, and thus reducing the latency overhead of communicating multiple times with the server. Java language programs are capable of altering how the operating system is accessed by imposing security restrictions on Java Applets such as Java client programs embedded in a web page. Various mechanisms are provided by a Java Virtual Machine (JVM) to access other computers, but a particular limitation is the availability of a JVM on every participating platform. We faced the particular problem of integrating super-computational programs running on a Connection Machine (CM5) as services for a Java client program. The CM5 is one of the few platforms for which there is no JVM available. It is interesting to consider the usage model for supercomputers and how this has changed recently. The batch model is no longer the only common one for supercomputer facilities. It is strongly desirable to be able to use supercomputers interactively and for this to be e ective, they must be interfaced with suitable graphical programming environments. Supercomputers are expensive and although personal workstations and PCs are constantly improving in performance, it is still valuable to extend the lifetime of legacy supercomputer facilities by making them more accessible to users. It is wasteful if users must implement numerical computations on lower performance platforms in languages such as Java if these operations can be more e ectively run on existing supercomputer facilities. The diculty is providing the integration glue to allow client programs to interact with supercomputer servers. We believe Java itself provides good mechanisms for interfacing to supercomputer technology. For preference, mechanisms for this will integrate within the Java language itself, maximising ease of use and portability. Not all the software mechanisms we discuss are avail-

able on all machines unfortunately. We outline several mechanisms for accessing remote computers and their suitability in various circumstances. We describe our e orts to extend the lifetime of two supercomputer resources | a TMC Connection Machine CM5, and a SGI Power Challenge. The CM5 has specialist capabilities that give it a performance edge over personal workstations, but is less easy to write general purpose programs than using Java on a workstation or PC. The Power Challenge can run user Java programs as well as multi-processor native code. The relative cost of these supercomputers is high if they are compared to general purpose workstations, but as specialist accelerators on which a high throughput and utilisation can be maintained, the relative cost is more favourable. Many of the present generation of general purpose multi-processor computers can run Java Virtual Machines. It is not yet clear how e ectively future JVM's will make use of multi-processor capabilities, but genuine multithreading is likely to be feasible soon. Other parallel programming mechanisms for Java are now being considered, but at the present time, the only way to make use of supercomputers from a Java program is to adopt the clientserver model and to invoke native code running on the supercomputer from a client Java program.

2 Remote Access Mechanisms We review the alternative mechanisms that can potentially be used to invoke remote resources from a Java client program. In some circumstances a solution may be the only possible one due to technical constraints such as alternative required software not being available on a particular platform. One widely used mechanism is that of running a remote shell between Unix platforms. This is a heavy-weight approach as it requires a process on the client machine to invoke a process on the remote machine to carry out the re-

quest. Data transfer can be e ected through a remote copy mechanism or possibly through a shared le system, if a suitable relation of trust [7] exists between the client and server platforms. This approach has signi cant startup overheads as well as using up memory resources on both the client and server. It also presupposes Unix on both platforms and user account access to both. Its advantage is that it is relatively simple to build with system-provided software. Remote procedure calls (RPC) provide a somewhat lighter-weight approach, whereupon a single server process is always available on the server platform to handle all incoming requests. This uses up less memory than for remote shells, and this mechanism is now available on non-Unix platforms such as Windows NT. A disadvantage is the relatively complex and error prone software needed to set up remote services. This is not easily undertaken by a non-systems aware programmer. The Java Remote Method Invocation (RMI) package provides a convenient mechanism for Java programs running within di erent JVM's to interact and to invoke computations on each other. This is well integrated within existing Java environments, but pre-supposes that a JVM exists for the server platform. Independent of the RMI, network access mechanisms are provided in the Java environment for making connection-less and connection-oriented communications between Java programs running on di erent JVM's as well as with other programming systems that can communicate at a socket and port level. This is convenient for platforms with no JVM but requires care in establishing protocols and communications software, that must be implemented in full by the user. Recent work provides software for interfacing Java to CORBA[1] services. These are generally not light-weight and although this is convenient for interfacing to legacy CORBA code, CORBA is not widely used on supercomputers. Indeed there is scope for work to successfully integrate super-computational services within a CORBA environment.

There appears to be a need for an additional layer of software to meet the requirements of distributed high-performance computing, particularly between Java client programs and a server structure that includes supercomputer resources. This idea is not a new one, the concept of high-level simulation-ondemand services running on remote supercomputers has been considered before[4]. Only recently however, with the widespread uptake of a network-friendly programming environment such as Java, has an infrastructure for this approach become viable. A number of projects are working towards this goal. The SNIPE[9] project is developing an alternative to the highly successful PVM[6] parallel programming system that will integrate with Java. The Nexus[5] communications infrastructure also has support for Java clients. The Netsolve[3] project provides a mechanism for interfacing to linear algebra packages through WWW clients in a similar manner to the experiments we describe in section 4, as does the Ninf[10] project. All these systems operate at a relatively low operation granularity however. In this paper, we discuss the e ect of granularity on the tradeo point at which the client/server approach becomes worthwhile, in the context of typical network and systems performance. Our own DISCWorld[8] system is designed to operate with higher level services such as full application components which, we believe, better amortises latency and data transfer overhead e ects. Our system also aims to provide multiple versions of a given service that include light-weight portable Java code that can run on inexpensive servers, as well as selected highly optimised native code that can run on supercomputers. It is very important to understand the present and likely tradeo s for remote service computations such as these, to be able to determine which version to invoke.

3 A Time Tradeo Model In this section we develop a simple descriptive model of the tradeo s for remote computation. The model has many similarities to the models used for parallel computation, however, the granularity of operations and consequently also the balance points are di erent. Consider the following example of a user running a Program P1 on a remote supercomputer, that requires input data D1in and produces output data D1out of sizes N in and N out bytes respectively. Suppose there is an e ective latency L seconds and a network bandwidth B bytes per second respectively, between the user's computer { his workstation for example { and the supercomputer. Client User/Application on eg Workstation or PC

Invoke Request

Network Events

Event Sequence

Compute Request & Initial Input

Supercomputer Server eg CM5 or T3D

Receive Input Data Request Waits in Queue Compute Job Runs

Read Results

Output Data Transfer

Results Written

Figure 1: Event Schedule in Invoking Connection Machine Service The sequence of events in the model is illustrated in gure 1. The total time to completion for the user's program is:

T Total = T1send + T1wait + T1supercompute + T1recv (1) where T1send = L + N in=B and T1recv = L + N out=B . For simplicity assume the user does not have to queue his job for the supercomputer and thus T wait  0. In general assume that Tisupercompute  Ticompute for any compute component i. However the time may be dominated by the data transfer costs unless the operations involved are computationally intensive ones. We discuss this e ect in section 4.

The likely utility of deploying the supercomputer instead of the user's PC or workstation can be improved if we look at running whole applications | the traditional supercomputer batch model where it is not feasible to use workstations or PCs as the compute times are so large. There is an increasing demand for near interactive time applications which are still too compute intensive for workstations. This is particularly so for decision support related applications. Fortunately for these applications it is common to be able to break up the desired remote computation into a series of computational components with clearly de ned interaction stages. These computation components might communicate with one another using potentially large data sets | possibly larger than the original input data sent by the user when initiating the request. It is inecient to transmit this data to and from the user between operations. It is better to retain them on the supercomputer server's own storage system under management of the remote user. Consider a program P2 which consists of m component computations Cj ; j = 1; ::m. Computational stages take times Tj to complete, including any time to load data from the previous stage and write out intermediate results to be used by the next stage. Suppose that the user interaction between stages involves transmitting some reduced quantity from the output of a computational stage and waiting on some decision response from the user before progressing to the next computational stage. To a good approximation this is likely to be similar in length regardless of whether the program is run on the user's workstation or on the remote supercomputer { at least compared to other time components. The total time (ignoring interaction delays) is therefore:

T Total = T send +

Xm Tj

j =1

supercompute +

T recv (2)

where only the initial bulk data send and nal bulk data receive times are included, and

The computational components might be operators in a suite or library provided on the supercomputer. These might be related to any collection of operations with a common set of well de ned data types. For example numerical operators that operate on vectors and matrices of oating point numbers or image operations that operate on pre-determined image le formats. This model described can be used to explore how useful it is to construct libraries of operator \accelerators" for supercomputers that can provide fast services to client applications on remote user workstations and PCs. The idea can also be extended to specialist service providers in a peer based server-less architecture for distributed computing [8]. The remote computation idea can be made more attractive still for client operations which do not need to transmit bulk data at all but can succinctly specify the data they want to operate upon in the expectation that it already exists within the supercomputer's proximity. This might well be likely for bulk data at a supercomputer facility which may also have a bulk data store, closely integrated with the supercomputer. It is still necessary for the client user or application to be able to specify which large data entity it wishes the supercomputer to access. This may be viable with prearranged codes, or with a suitable cross-mounted le system or by some intermediate database that can provide the supercomputer programs with the necessary le handles. This approach might not be the most attractive reason for acquiring a new supercomputer, but it does provide a framework for deciding to retain a legacy system and con gure it to provide services to client applications, thus prolonging its useful lifetime.

4 Performance Examples We considered a number of common computationally intensive components as useful services for a supercomputer back-end to run on behalf of a a Java client program. We describe three examples here: one using a matrix inversion kernel; a fast Fourier transform kernel; and another using a cellular automata simulation as a full application component. Linear algebra libraries are easily encapsulated and well understood, and well implemented on supercomputers. We compared implementations of the Gauss Jordan dense matrix inversion routine using the best coded implementations we could nd for platforms running pure Java code, and some Fortran implementations, with the Connection Machine Scienti c Software Library (CMSSL) implementation. Elapsed Time / seconds (better than +/- 10ms)

are costed using the latency and bandwidth as before. The result is that most of the work is retained on the supercomputer and also as little bulk data is transmitted back and forth as possible.

10000 Matrix Inversion Time (Gauss Jordan, 64 bit) 1000 100 10 1

’Ultra1-JDK-1.1’ ’Ultra2-JDK-1.1’ ’NT-JDK-1.1’ ’Alpha-JDK-1.1’ ’PwrChall-JDK-1.1’ ’PwrChall-F90’ ’Alpha-F95’ ’CM5-64’

0.1 0.01 0.001 0

100

200 300 Matrix (edge) size N

400

500

Figure 2: Timing for Java and Fortran Implementations of Dense Matrix Inversion Figure 2 shows the elapsed times for various matrix sizes run on: Sun Ultra(Solaris); Pentium PC 166MHz(NT); DEC Alpha Workstation 166MHz(Digital Unix); SGI 20 Processor Power Challenge (IRIX); and Connection Machine CM5 with 64 processors (SunOS on host processor). The graph shows a number of signi cant features. All implementations show the expected O(N 3 ) time dependence for the Gauss Jordan algorithm for large matrix sizes

tion or PC running a JVM or compiled optimised Fortran code, there is a smooth increase in the time curves for small N . There is a signi cant overhead in starting up a computation on the CM5, which only crosses below the other time curves for matrices above circa 1282 . Generally, the compiled Fortran codes are always faster than the Java implementations with the anomalous behaviour of the Power Challenge multi processor Fortran code. Unlike all other machines, the Power Challenge was not unloaded with other users, and this effect is probably attributable to multi-processor e ects. These interference e ects from other users become signi cant above matrix sizes of circa 3002 . This is an important caveat to the discussion, in that supercomputers are only useful as accelerators if a useful timeslot can be obtained on them. Java Development Kit JDK 1.1.2 was used throughout, on the di erent platforms. Although it is only a single processor version, the SGI version performs very well, outperforming the multi-processor Fortran code for N above circa 320, since it does not incur multiprocessor/multi-user con icts. Nevertheless, for matrix sizes above 1282 , the optimised library on the CM5 outperforms all other platforms. Furthermore, it is capable of signi cantly larger problem sizes than the Java platforms, simply due to its greater memory. We timed its performance out to matrix sizes of 40962 with smoothly scalable performance. Consider however, the cost of transferring a matrix of data to and from a networked supercomputer. A matrix of 1282 64-bit numbers occupies at least 128kB, and possibly more depending upon whether a portable format need be used for transfer. Consider some simple estimates of transfer performance across typical networks and le stores. For lightly loaded 10Base-T ethernet we typically obtain an e ective transfer bandwidth to a remote le-system of 0.3 MB/s, with a latency of approximately 1.3 seconds. This can be improved if we use our local ATM connectivity between machines.The

latency is the greatest concern, however and although we can arrange to pipe data into a running program rather than store it on a CM5's le system and reduce it somewhat, the latency in invoking a supercomputer program and effectively transferring data to it remains significant at around 1 second. For the matrix solution example, this pushes the break-even point for using the CM5 up to matrix sizes circa 3002 where the bandwidth transfer rates (with 10Base-T ethernet) are also circa 1 second. Only at sizes in excess of 5122 does the computational complexity clearly justify transferring data back and forth. There are various ways to tackle the latency overhead. One is to attempt to carry out more complex operations at the server side and avoid transferring intermediate results back and forth between client and server during a series of calculations. This poses the problems of data management at the server side, which must now retain state between user connections. Elapsed Time / seconds (better than +/- 10ms)

N . Not surprisingly, for a dedicated worksta-

10 2D Complex Fourier Transform Time 1

0.1

’Ultra1-JDK-1.1’ ’Alpha-JDK-1.1’ ’PwrChall-JDK-1.1’ ’CM5-64’

0.01

0.001

0.0001 0

200

400 600 Array (edge) size N

800

1000

Figure 3: Timing for Java and CM5 Implementations of Complex-Complex Fast Fourier Transforms Another common example of a computationally intensive kernel operation is a fast Fourier transform. We implemented this on various JVM's and again using the Connection Machine CMSSL. The results are shown in Fig-

ure 3. The results show the crossover point for the CM5 outperforming the other platforms at 2D array edge size of around 150. Furthermore the arithmetic performance of the Sun UltraSparc 1 was again poorer than that of the AlphaStation. Apart from overhead/startup effects all the timing curves show the expected O(N log N ) complexity time dependence. Consider another example where a whole simulation might be run on the supercomputer server, but where the computational complexity is no longer so biased in favour of the supercomputer. We investigated the times for running a cellular automaton simulation of a bush- re model. The model involves a regular mesh automaton, with nearest neighbour update rules and a number of iterations are performed on the con guration to evolve it in time. Such models are widely used in computational physics and are nding uses in decision support simulation systems. We investigated a con guration size of 1282 , which although relatively small for operational purposes allowed us to fully investigate the timing behaviour of a JVM as well as the CM5. We timed two scenarios: running the simulation in pure Java and displaying the result from the Java application; running the simulation remotely on the CM5 and transferring the results back to the Java application for display. We varied the number of iterations of the simulation to provide a variable computational load. For our simple model, the best optimised code we can write in Java yielded 31 iterations per second possible on a DEC Alpha running JDK 1.1.2, and in data parallel Fortran on the CM5 with 64 processors, 654 iterations per second were possible. A factor of approximately 20 improvement. This ratio becomes higher as larger model con guration sizes are used, and caching e ects slow down the workstation, whereas the real memory of the CM5 does not a ect scaling. However, while there is e ectively no overhead in transferring data within a pure Java implementation on the client, the overhead of transferring data to and from the CM5, invoking the program, and running it,

even with no other users and no queueing, is approximately 8 seconds. This is just acceptable for near interactive simulation in a decision support system. The break-even point for using the CM5 is therefore circa 270 iterations of our small system, at which point it is equally fast to run locally or remotely. This would be less useful on a such a small simulated con guration as we have described here, although for larger systems such as 10242 the CM5 once again signi cantly outperforms the capabilities of a Java client.

5 Discussion and Conclusions Recent performance gures for Just-in-Time (JIT) compilers for Java such as MS Jview and Cafe Java[11] indicate Gauss-Jordan matrix inversion performance of approximately a factor of 5 better than those we report for Sun's JDK 1.1. Inevitably Java implementations' performance ratings are improving and with hardware support in the future, are likely to compare very favourably with native code on workstations and PCs. It may also be that highperformance mechanisms will be found to run Java code on supercomputers. This is an exciting possibility. Nevertheless, in the foreseeable future it is likely that there will continue to exist signi cant and measurable tradeo points beyond which it is worthwhile to invoke remote supercomputers from client programs. We believe the encapsulation of supercomputer resources as remote services available to client programs is an important paradigm for present day computing. This is irrespective of what language the client programs are written in. The Java language and environment has a signi cant advantage over other systems because it provides for many user accessible mechanisms to build client/server programs. Java client programs appear a particularly important target to support. The range of mechanisms available for this perform best in di erent granularity regimes as we have discussed. The client/server approach provides a convenient mechanism to allow legacy supercomput-

ers to remain useful as well as being an exciting new way to link together heterogeneous applications on new supercomputers to obtain good performance on the most ecient platform for a given task. In this paper we have largely ignored I/O effects. However there appears to be a need for some work to quantify the I/O performance of Java Virtual Machines on various platforms. In particular, to investigate the eciency of various le bu ering techniques in the context of remote computation. Some work has already been done to investigate how parallel computing techniques can be applied within Java bytecode[2]. We hope to experiment with tools such as Gannon's javab to improve the performance of kernel modules such as described in this paper. We have developed a simple model for understanding the tradeo points for using local and remote computation within DISCWorld. We are incorporating this model into the DISCWorld software itself to provide a mechanism to perform dynamic scheduling.

Acknowledgements This work was carried out under the DHPCI project of the Research Data Network and Advanced Computational Systems Cooperative Research Centres (CRC) established under the Australian Government's CRC Program.

References [1] \CORBA - A Guide to Common Object Request Broker Architecture", Ron BenNatan, Pub. McGraw-Hill, 1995. [2] \javab a Prototype Bytecode Parallelization Tool", Aart J.C. Bik, Dennis B. Gannon, Computer Science Department, Indiana University 1997. [3] \Netsolve: A Network Server for Solving Computational Science Problems", Henri Casanova and Jack Dongarra, Proc. Super-Computing 96.

[4] \A Scalable Paradigm for E ectivelyDense Matrix Formulated Applications", G.Cheng, G.C.Fox and K.A.Hawick, Proc. HPCN 1994, Munich, Germany April 1994. Volume 2, PP202. [5] \The Nexus Approach to Integrating Multithreading and Communication", Ian Foster, Carl Kesselman, Steven Tuecke, MCD, Argonne National Lab, [6] \PVM: Parallel Virtual Machine - A User's Guide and Tutorial for Networked Parallel Computing", A.Geist, A.Beguelin, J.Dongarra, W.Jiang, R.Manchek, V.Sunderam, MIT Press, 1994. [7] \Authenticated Transmission of Discoverable Portable Code Objects in a Distributed Computing Environment", Duncan A. Grove, Andrew J. Silis, J.A.Mathew, K.A.Hawick, To appear Proc. Int. Conf. Parallel and Distributed Processing Techniques and Applications (PDPTA) 1998. [8] \DISCWorld: An Integrated Data Environment for Distributed HighPerformance Computing", K.A.Hawick, A.L.Brown, P.D.Coddington, J.F.Hercus, H.A.James, K.E.Kerry, K.J.Maciunas, J.A.Mathew, C.J.Patten, A.J.Silis, F.A.Vaughan, Proc 5th IDEA Workshop, Fremantle, February 1998. [9] \Scalable Networked Information Procesing Environment (SNIPE)" Keith Moore, Graham E. Fagg, Al Geist, Jack Dongarra, Proc. Supercomputing SC'97, San Jose, November 15-21 1997. [10] \Ninf: Network based Information Library for Globally High Performance Computing", S. Sekiguchi, M. Sato, H. Nakada, S. Matsuoka and U. Nagashima, Proc. Parallel Object-Oriented Methods and Applications, Santa Fe, Feb., 1996. [11] \Cafe Java JIT Speed Tests", See: http://www.intergalact.com/java/speedtest.