workshop proceedings - Semantic Scholar

12 downloads 0 Views 2MB Size Report
Jun 26, 2004 - Ville de Saint-Malo – Service Communication – Photos : Manuel CLAUZIER ... Chokchai Leangsuksun, Louisiana Tech University, USA.
First International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters (COSET-1) June 26th, 2004 Saint-Malo, France

Ville de Saint-Malo – Service Communication – Photos : Manuel CLAUZIER

Held in conjunction with 2004 ACM International Conference on Supercomputing (ICS ’04)

WORKSHOP PROCEEDINGS

First International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters COSET-1

Clusters are not only the most widely used general high-performance computing platform for scientific computing but also according to recent results on the top500.org site, they have become the most dominant platform for high-performance computing today. While the cluster architecture is attractive with respect to price/performance there still exists a great potential for efficiency improvements at the software level. System software requires improvements to better exploit the cluster hardware resources. Programming environments need to be developed with both the cluster and human programmer efficiency in mind. Administrative processes need refinement both for efficiency and effectiveness when dealing with numerous cluster nodes. The goal of this one-day workshop is to bring together a diverse community of researchers and developers from industry and academia to facilitate the exchange of ideas and to discuss the difficulties and successes in this area. Furthermore, to discuss recent innovative results in the development of cluster based operating systems and programming environments as well as management tools for the administration of high-performance computing clusters.

COSET-1 Workshop co-chairs Stephen L. Scott Oak Ridge National Laboratory P. O. Box 2008, Bldg. 5600, MS-6016 Oak Ridge, TN 37831-6016 email: [email protected] Christine A. Morin IRISA/INRIA Campus universitaire de Beaulieu 35042 Rennes cedex, France email: [email protected]

Program Committee Ramamurthy Badrinath, HP, India Amnon Barak, Hebrew University, Israël Jean-Yves Berthou, EDF R&D, France Brett Bode, Ames Lab, USA Ron Brightwell, SNL, USA Emmanuel Cecchet, INRIA, France Toni Cortès, UPC, Spain Narayan Desai, ANL, USA Christian Engleman, ORNL, USA Graham Fagg, University of Tennessee, USA Paul Farrell, Kent State University, USA Andrzej Goscinski, Deakin University, Australia Liviu Iftode, Rutgers University, USA Chokchai Leangsuksun, Louisiana Tech University, USA Laurent Lefèvre, INRIA, France John Mugler, ORNL, USA Raymond Namyst, Université de Bordeaux 1, France Thomas Naughton, ORNL, USA Hong Ong, University of Portsmouth, UK Rolf Riesen, SNL, USA Michael Schoettner, University of Ulm, Germany Assaf Schuster, Technion, Israël

COSET-1 Program 9:00-9:05 Opening 9:05-10:05 Session 1: Cluster Operating System Services Session chair: Christine Morin, INRIA Parallel File System for Networks of Windows Workstations Jose Maria Perez, Jesus Carretero, Felix Garcia, Jose Daniel Garcia, Alejandro Calderon, Universidad Carlos III de Madrid, Spain An application-oriented Communication System for Clusters of Workstations Thiago Robert C. Santos and Antonio Augusto Frohlich, LISHA, Federal University of Santa Catarina (UFSC), Brazil 10:05-10:35 Session 2: Application Management Session chair: Christine Morin, INRIA A first step toward autonomous clustered J2EE applications management Slim Ben Atallah, Daniel Hagimont, Sébastien Jean and Noël de Palma, INRIA Rhône-Alpes, France 10:35-11:00 Coffee break 11:00 - 12:30 Session 3: Highly Available Systems for Clusters Session chair: Stephen Scott, ORNL Highly Configurable Operating Systems for Ultrascale Systems Arthur B. Maccabe and Patrick G. Bridges, The University of New Mexico, USA Ron Brightwell and Rolf Riesen, Sandia National Laboratories, USA Trammell Hudson, Operating Systems Research, Inc., USA Cluster Operating System Support for Parallel Autonomic Computing A. Goscinski, J. Silcock, M. Hobbs, Deakin University, Australia Type-Safe Object Exchange Between Applications and a DSM kernel R. Goeckelmann, M. Schoettner, S. Frenz and P. Schulthess, University of Ulm, Germany 12:30-14:30 Lunch

14:30-16:00 Session 4: Cluster Single System Image Operating Systems Session chair: Christine Morin, INRIA SGI's Altix 3700, a 512p SSI system. Architecture and Software environment. Jean-Pierre Panziera, SGI OpenSSI speaker TBA SSI-OSCAR Geoffroy Vallée, INRIA 16:00 -16:20 Coffee break 16:20-17:20 Session 4 (continued): Cluster Single System Image Operating Systems Session chair: Geoffroy Vallée, INRIA Millipede Virtual Parallel Machine for NT/PC clusters Assaf Schuster, Technion Genesis cluster operating system Andrzej Goscinski, Deakin University 17:20-18:00 Panel: SSI: Software versus Hardware Approaches Moderator: Stephen Scott, ORNL

A Parallel File System for Networks of Windows Worstations José María Pérez

Jesús Carretero

José Daniel García

Computer Science Department. Universidad Carlos III de Madrid Av. De la Unversidad, 30 Leganes 28911, Madrid, Spain +34 91 624 91 04

Computer Science Department. Universidad Carlos III de Madrid Av. De la Unversidad, 30 Leganes 28911, Madrid, Spain +34 91 624 94 58

Computer Science Department. Universidad Carlos III de Madrid Av. de la Universidad Carlos III, 22 Colmenarejo 28270, Madrid, Spain +34 91 856 13 16

[email protected]

[email protected]

[email protected]

ABSTRACT The usage of parallelism in file systems allows the achievement of high performance I/O in clusters and networks of workstations. Traditionally this kind of solution was only available for UNIX systems, requires the usage of special servers and the usage of special APIs, which leads to the modification, and/or recompilation of existing applications. This paper presents the first prototype of a Parallel File System, called WinPFS, for the Windows platform. It is implemented as a new Windows File System and it is integrated within the Windows kernel components, which implies that no modification or recompilation of applications is needed to take advantage of parallel I/O. WinPFS uses shared folders (through the usage of the CIFS/SMB protocol) to access remote data in parallel. The proposed prototype has been developed under the Windows XP platform, and has been tested with a cluster of Windows XP nodes and a Windows 2003 Server node.

Categories and Subject Descriptors D.4.3 [Operating Systems]: File Systems Management – distributed file systems. C.2.4 [Computer-Communications Networks]: Distributed Systems – network operating systems.

Keywords Parallel I/O, Cluster, Windows.

1. INTRODUCTION In the last years, the need of high performance data storage has grown as the capacity of disks and the applications needs have grown [1][2][3]. One approach to overpass the bottleneck that characterizes typical I/O systems is the usage of parallel I/O approach [1]. This technique allows the creation of large storage systems, by joining several storage resources, to increase the scalability and performance of the I/O system and to provide load balancing. The usage of parallelism in file systems relies on the fact that a distributed and parallel system consists on several nodes with storage devices. Performance and bandwidth can be increased if data accesses are exploited in parallel. Parallelism in file systems is obtained by using several independent server nodes, each one supporting one or more secondary storage devices. Data are striped among those nodes and devices to allow parallel accesses to different files, and parallel accesses to the same file. Initially, this idea was used in RAID [4] (Redundant Array of Inexpensive Disks). However, when a RAID is used in a traditional file server, the I/O bandwidth is limited by the server memory bandwidth. However, if several servers are used in parallel, performance can be increased in two ways: 1. Allowing parallel access to different files by using several disks and servers. 2. Striping data using distributed partitions, allowing parallel access to the data of the same file. However, current parallel file systems and parallel I/O libraries lack of generality and flexibility for general purpose distributed environments. Furthermore, all parallel file systems do not use standard servers,

which makes it very difficult to integrate those systems in existing networks of workstations due to the need of installing new servers that are not easy to use and that are only available for specific platforms (usually some UNIX flavour). Moreover, those systems are implemented outside the operating system, so that a new I/O API is needed to take advantage of parallel I/O with the modification of existing applications. Most of the software related to high performance I/O is only available for UNIX environments, or has been created as UNIX middleware. The work presented in this paper tries to fulfil the lack of this kind of systems in the Windows environment, presenting a way to achieve parallel I/O in Windows platforms. In this paper, we present a parallel file system for Windows clusters and/or networks of workstations called WinPFS. This system integrates the existing servers in an organization, using protocols like NFS, CIFS or WebDav in order to obtain Parallel I/O, without needing complex installations, providing support to existing applications, high performance and low overhead thanks to the integration with the Windows kernel. In section 2, some works related to parallel I/O are presented. Section 3 presents the design of WinPFS. Section 4 describes some evaluations of the first WinPFS prototype. Finally, section 5 presents our conclusions and future work.

2. RELATED WORK Three different parallel I/O software architectures can be distinguished: •





Application libraries basically consist of a set of highly specialized I/O functions. Those functions provide a powerful development environment for experts with specific knowledge of the problem to be modeled by using this solution. A representative example is MPI-IO [5], an I/O extension of the standardized message passing interface MPI. Parallel file systems operate independently from applications, thus allowing more flexibility and generality. Examples of parallel file systems are: Vesta [6], PIOUS [7], Galley [8], ParFiSys [9]. Intelligent I/O systems hide the physical disk access to the application developer by providing a transparent logical I/O environment. The user

describes what she wants and the system tries to optimize the I/O requests applying optimization techniques. This approach is used in, ViPIOS[10]. The main problem with parallel I/O software architectures and parallel I/O techniques is that they often lack of generality and flexibility, because they create only tailor-made software for specific problems. On the other hand, parallel file systems are specially conceived for multiprocessors and multicomputers, and do not integrate appropriately in general purpose distributed environments as clusters of workstations. Last years some file systems had emerged, such as PVFS [11], which can be used in Linux clusters, but they need the installation of special servers. Other solutions, such as Expand [12], can use existing standard NFS servers to accomplish parallel I/O, which implies that no new severs are needed in a cluster, but the standard linux NFS. Usually the client side is implemented as a user level library and designed with UNIX in mind. Another important way to accomplish high performance I/O is the usage of MPI-IO [5]. Some implementations of MPI have been adapted to Windows: MPICH, MPI-PRO, and WMPI. But, usually the Windows I/O part is not optimized using parallel I/O techniques.

3. WINPFS DESIGN The main motivation for the WinPFS design is to build a parallel file system for networks of Windows workstations using standard data servers. To satisfy this goal, authors designed and implemented a parallel file system using CIFS/SMB servers. This paper describes the first prototype of WinPFS. The goals of the proposed architecture are: •

• • •

To integrate existing storage resources using shared folders (CIFS, WebDAV, NFS, etc) rather than installing new servers. This is accomplished by using Windows Redirectors. To simplify setup. Only a Windows driver is needed to make use of the system. To be easy to use. Existing applications must work without modification and without recompilation. To enhance performance, scalability and capacity of the I/O system, through the usage of parallel and distributed file systems mechanism: request

splitting, balanced data allocation, load balancing, etc. Win32

....

MPI-IO

Client

WinPFS NFS

CIFS

HTTP WebDAV

Local

....

Next sections show the remote data access and file stripping techniques used in WinPFS, the I/O request management ant the usage of WinPFS.

Redirectors

Clients NFS CIFS HTTP-WebDav ...

Distributed partition 2

Intranet

Site 1

Site 3

Site 2

Distributed partition 1

Figure 1. WinPFS installed in an Intranet To accomplish most of the former goals, the proposed design is based on a new Windows kernel component that implements the basis of the file system, isolating users from the parallel file system, and the use of protocols to connect to different network file systems. Figure 1 shows how a client application, using any Windows interface (for instance, Win32), can access distributed partitions in a cluster or network of workstations using WinPFS. Communication with the data servers can be performed with any available protocol through kernel components, called redirectors, which redirect requests to remote servers with a specific protocol. In our first prototype, we have only considered CIFS/SMB servers and the issues related to coordinate several of them. The situation of WinPFS into the Windows kernel can be observed in Figure 2. Win32

POSIX

(Windows system services). The I/O Manager, who is in charge of identifying the driver who is going to deal with each request, receives the requests in kernel mode. WinPFS register itself as a virtual remote file system that allows the driver to receive the requests.

DOS

3.1 Remote Data Access and File Striping From the user point of view, Windows operating system provides access to remote data storage nodes through shared folders (local folders exported to remote computers). WinPFS create a new-shared folder, called \\PFS, in the client side. Therefore, the users can use parallel files through the shared folder mechanism. From the kernel point of view, the accesses to remote data are performed based on several mechanisms as CIFS (Common Internet File System), also known as SMB protocol, UNC (Universal Naming Convention) and a special class of drivers, called redirectors, which receive I/O requests from the users and send them to the appropriate servers. The parallel file system implemented identifies all the requests trying to access a virtual path (\\PFS) and processes them. For example, if we want to create a file in WinPFS, we must do something like: CreateFile(\\PFS\file.txt) instead of CreateFile(C:\tmp\file.txt) or CreateFile(\\server1\tmp\file.txt). In order to achieve high performance, load balancing and higher storage capability, a file is striped through several nodes. Our file system must coordinate the access to several of those remote folders in order to achieve load balancing and data distribution. Striping leads to the creation of one or more requests to access data based on the buffer size and the current offset. Then the requests are sent to one or several redirectors in order to access remote storage nodes (See Figure 3).

Native NT API

I/O Manager WinPFS Local

CIFS

WebDav

Netware

NFS

Figure 2. WinPFS in the Windows I/O subsystem A user application uses an available interface (Win32, POSIX, etc), whose calls are converted in system calls

3.2 I/O Request Management The Windows NT family has a layered I/O model, which allows the definition of several layers to process a request in the I/O subsystem. Each of those layers is a driver that can receive a request and pass it (or additional requests) to lower layers (drivers) in the I/O stack. This model allows the insertion of new

layers (drivers) in the path of an I/O request, for example for encryption or compression. WinPFS takes advantage of this mechanism in order to provide parallel I/O functionalities.

Create: The requests (IRPs) are replicated and sent to each server in which the file is going to be distributed on. •

Read, write: The main request is split into smaller subrequests that are sent to some server. For example, if we want to read 256Kbytes and the stripping unit is 64Kbytes, four subrequests of 64Kbytes are created (one for each of four shared folders).



Create Directory: The requests are replicated in all the shared folders, in order to make the directory tree consistent through all the shared folders. This means that if we want to create a directory: \\PFS\tmp, this directory is created in every shared folder: \\server1\share1\tmp, \\server1\share2\tmp, \\server2\share1\tmp, …



Metadata management, control, security: This kind of request needs a different approach. Some of then do not require to split requests and/or to access the remote servers.

Figure 3. Serving a request to several remote servers To support this model the Windows I/O subsystem presents two major features: the I/O Manager and the I/O request packets (IRP).The I/O Manager is in charge of receiving requests from the user in form of NT services (system calls), creating an IRP in order to describe the requests and delivering them to the appropriate device, which has an associated driver (in our case WinPFS). All the I/O requests are delivered to drivers as I/O request packets (IRPs). That way, the I/O subsystem presents a consistent interface to all kernel-mode drivers. This interface includes typical I/O operations: open, close, read, write, etc [13]. Apart from creating the IRP, the I/O Manager must identify the device and kernel component which is going to complete a request. In the case of remote storage this work is supported by the MUP (Multi UNC Provider) that identifies the kernel component (network redirector, or WinPFS) in charge of a specific network name (Figure 4, steps 2-3). Once the driver is identified (in our case WinPFS), the IRP is passed to it (Figure 4, steps 4). Then WinPFS creates one or several subrequests to be sent to remote servers and/or to access local information (cache, cached metadata, etc). The way in which the subrequests are created depends on the kind of request. It can be classified in the following categories:

As an example, the basic steps to create/open a file are presented bellow (Figure 4): 0. The I/O Manager receives a request and creates an IRP. This request contains the file name in the form: \\PFS\... 1. The I/O Manager sends the IRP packet to the MUP. 2. The MUP module has to look for some redirector (network file system) that recognizes the \\PFS string as a network name. The MUP asks all the redirectors until someone answer affirmatively. 3. The MUP module indicates to the I/O Manager that the request must go to the redirector that recognized the network name. For this reason, our driver is created as a redirector that recognizes all the requests with the prefix \\PFS. 4. The I/O Manager sends the request to WinPFS. 5. The request packet received is split in several parallel subrequests, also in the form of IRPs, and are sent to redirectors (CIFS, NFS, etc). In order to create the subrequests, the driver has to know where the data of the parallel file are stored, what servers are available, and which protocols can be used for each request.

Figure 4 . WinPFS steps to serve request 6. The redirectors send the requests to the remote servers and they send/receive data stripes for each server. In the example, data is striped using a round-robin policy. However, several policies can be used to allocate parallel file data on the remote servers. 7. Once the requests are served, the driver must join the results waiting in a kernel event for the completion of all the subrequests. Steps 0...3 are only needed when a file is created or opened. Once requests have finished, the users receive a File Handle (associated with a File Object in kernel space) that allows to access directly to WinPFS. Other two important requests in a file system are reading and writing. Those requests come from the user indicating a buffer where the data are sent/received. That buffer must be split to send to each server a part of them. In order to optimize the file system, the implementation must avoid the copy of any buffer. The Windows kernel provides the mechanisms needed to perform those operations without copy. The buffers that come from the users are received in kernel structures called MDL (Memory Description List). WinPFS splits the request in subrequests, by creating a new MDL

called a Partial MDL, and the buffers from the original MDL are mapped in the Partial MDLs, so no copy is needed. Then this new buffer (MDL) is send to the appropriate redirector that sends/receives data to/from the remote servers.

3.3 Using WinPFS From the administration point of view, the installation of WinPFS only requires three steps: • • •

To install a driver in the client nodes. To share folders in the servers, this can be accomplished though the Explorer or Windows administrative tools. To indicate the shared folders to be used by WinPFS in the registry of the client nods.

From the user point of view, using the parallel file system only requires that all the paths are prefixed with \\PFS, but we plan to implement the necessary mechanisms to map the remote name (\\PFS) to common driver letters as D:. Apart from the detail of using the naming convention \\PFS\file, nothing more is needed. WinPFS can be used with the Win32 API, POSIX, cygwin or whatever I/O API that finally uses Windows Services. Other important features to take into account are:







Caching: WinPFS caching is supported by the caching mechanisms performed by the redirectors. Therefore, now caching is limited to the Windows caching model, but in the future more advanced caching algorithms could be implemented. In addition, the caching mechanisms can be disabled through the Win32 API. Security and Authentication: Security and authentication issues are solved by the operating system. If we work in a Windows “Domain” there are no authentication problems, because the domain users can access all the resources. Of course, the access to shared folders (including WinPFS parallel partitions) can be controlled indicating what users can access the resource; this is accomplished with the Windows security model. If we want to use servers through several not trusted domains, or in workgroups, some changes must be done to our prototype to incorporate the authentication mechanisms. Data consistency between clients: Other important feature is the consistency between several clients accessing a parallel/distributed file. At the moment, this is solved using the default mechanism used by the CIFS redirector. The CIFS protocol use a mechanism called oplocks (opportunistic locks) that enable a protocol to maintain the consistency between clients [14].

4. Evaluation In order to measure the performance of the first prototype of WinPFS, we have made some evaluation tests. The test creates a file of 100 Mbytes that is written sequentially and then read sequentially using a static buffer size (several executions are performed with different buffer sizes). We have disabled the client cache with the option FILE_FLAG_NO_BUFFERING in the creation of the file (CreateFile). We do not need to use special features, as IOCTL, of any kind; the Windows API provides this feature and others. With this test, we want to measure the access performance through a network to remote disks (write) and to access the server’s cache (read). We write the file with WinPFS, thus the file is striped and sent to several servers that write it to disk, and then we read it from the server, where data are maintained in caches. We have joined two clusters of four nodes (see Figure 5). Each node is a BiProcessor Pentium III

1GHz with 1GByte of main memory, 200 GByte disks and a GigaEthernet network, with two 3Comm GigaEthernet switches and four nodes connected to each of them.

PC

PC

PC

PC

PC

GigaEthernet Switch

GigaEthernet Switch

PC

PC

PC

Figure 5. Evaluation Infrastructure From the operating system point of view, a Windows Domain with one computer running Windows 2003 Server and other seven computers running Windows XP Professional has been created. The test was executed with 8 clients running the application simultaneously and with different configurations of the I/O system. First, the evaluation was done with the simple sharing folders mechanisms available by default in Windows (CIFS). This allowed us to evaluate the performance provided by one server with different number of clients. Then, we tested WinPFS with different number of servers in parallel: PFS88 (8 servers used in parallel), PFS44 (4 servers used in parallel), and PFS84 (4 servers used in parallel and selected from a set of 8 servers). Figure 6 shows the results obtained in the write part of the test with 8 clients (one application running in each node). As can be seen, the performance obtained by WinPFS is higher that a single server attending 8 clients. With one single CIFS server 40 Mbits/s are achieved with 64KB and 128KB application buffer. WinPFS obtains 150 Mbits/s with four servers (PFS44), 200 Mbits with four servers selected from eight (PFS84), and almost 250 Mbits/s using eight servers (PFS88). The reason is that the I/O requests are distributed between 8 servers instead of one. Therefore, in the single server case, the I/O throughput is the one provided by one disk. Figure 7 shows the performance obtained in the read part of the test. In this case, the performance is higher with WinPFS (1200 Mbits/s). The CIFS server obtains good results (600 Mbits/second)

because all the data are in main memory, so no disk access is required.

7

6

5

300

4

CIFS

250

Speedup

PFS88

3

Bandwidth (Mbits/S)

PFS44 2

200 PFS84

1

150 0 1K

2K

4K

8K

100

16K

32K

8 Clients 4 Clients 2 Clients 64K

Buffer Size

128K

1 Client 256K

512K

1M

50

Figure 8. Write Speedup (CIFS vs. WinPFS 88)

0 1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

Buffer Size (Bytes)

Figure 6. Write Results for 8 clients (Write to remote disks)

1

0,8 1400

0,6

CIFS PFS88

1200

Speedup 0,4 PFS44 PFS84

Bandwidth (Mbits/S)

1000

0,2

800

0

600

-0,2 1K

400

2K

4K

8K

16K

32K

Buffer Size

64K

128K

8 Clients 4 Clients 2 Clients 1 Client 256K

512K

1M

200

0 1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

Figure 9. Read Speedup (CIFS vs. WinPFS 88)

Buffer Size (Bytes)

Figure 7. Read Results for 8 clients (Read from Servers Cache) One thing that must be remarked is that the network connecting the two clusters imposed a limit of 1Gbit/s (120 Mbytes/s) to the system. If we use one server in one subcluster and the client in the other subcluster, we never would overpass this limit. However, as can be see in Figure 7, WinPFS can overpass this limit due to the use of several servers in parallel (some of one cluster and other from the second cluster). To clarify the results obtained, Figure 8 and Figure 9 show the speedup obtained with the PFS88 (8 servers in parallel) configuration with respect to the simple CIFS server. As can be seen, the speedup in the read part is smaller (less than a 100% of improvement). As commented before, this is because data were served from server cache without disks accesses.

The write test takes into account the accesses to disks, and the results show that under those circumstances a speedup factor of 5 (500% of improvement) is achievable, reaching almost a 700% of improvement with buffers of big sizes. We think that the read can achieve this level of improvement if the data were flushed to disks. As can be seen with one client, the speedup is about a factor of one in the writes and not exists for reads, or even WinPFS works worst with bigger application buffers. The last is because WinPFS strips the buffer in 64Kbytes buffers, so we are limited to the performance obtained with 64Kbytes buffers. This may be solved with a bigger stripping unit, but this may impose a limit in the system parallelism.

5. CONCLUSIONS AND FUTURE WORK In this paper, we have described the design of a parallel file system for clusters of Windows

workstations. This system provides parallel I/O features that allow the integration of existing storage resources by sharing folders and using a driver (WinPFS). Our approach proposes to complement the Windows kernel routing all the requests for the network name \\PFS to a driver that splits the requests and uses several data servers in parallel. The integration of the file system into the kernel provides higher performance that other solutions that use libraries, and it provides no differences from the user point of view, so that the user can execute its applications without rewriting or recompiling them. WinPFS achieves high scalability, availability and performance by using several servers in parallel. WinPFS also allows us to obtain a high capacity storage node with a set of workstations. In the test, a 1.6 Terabytes system has been built using 200 GBytes disks. In the test, the performance limits of the systems are two: disks in the write operations, and network bandwidth in the read operations. With the usage of redirectors, a client can stripe files over CIFS, NFS and WebDAV servers independently that those servers reside in Windows or UNIX servers, NAS, or whatever other storage protocol that is supported through redirectors. Future work is going on to dynamically add and remove storage nodes to the cluster, on data allocation and load balancing for heterogeneous distributed systems, and on parallel usage of heterogeneous resources and protocols (CIFS, NFS, WebDAV, etc) in a network of workstations, addressing the implications to performance, management and security. In addition, we will use the Active Directory Service provided by Windows to create a metadata repository, so that all clients can obtain a consistent image of the parallel files.

8. REFERENCES [1]

Peter M. Chen, David A Patterson. Maximizing Performance in a Striped Disk Array. Proceedings of the 17th Annual International Symposium. On Computer Architecture, ACM SIGARCH Computer Architecture News. 1990.

[2]

J. Gray. Data Management: Past Present, and Future. IEEE Computer, Vol. 29, Nº 10, 1996, pp. 38-46.

[3]

A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, S. Tuecke. The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets. Journal of Network and Computer Applications, 23: 187-200, 2001

[4]

D. A. Patterson, G. Gibson, and R. H. Katz. A Case for Redundant Array of Inexpensive Disks (RAID). In Proceedings of ACM SIGMOD, pages 109-116, Chicago, IL, June 1988.

[5]

MPI Forum. 1997. MPI-2: Extensions to the Message-Passing Interface. http://www.mpiforum.org

[6]

P. Corbett, S. Johnson, and D. Feitelson. Overview of the Vesta Parallel File System. ACM Computer Architecture News, vol. 21, no. 5, pp. 7--15, Dec. 1993.

[7]

S. A. Moyer and V. S. Sunderam. PIOUS: A Scalable Parallel I/O System for Distributed Computing Environments. Proceedings of the Scalable High-Performance Computing Conference, 1994, pp. 71--78.

[8]

N. Nieuwejaar and D. Kotz. The Galley Parallel File System. Proceedings of the 10th ACM International Conference on Supercomputing, May 1996.

[9]

J. Carretero, F. Perez, P. de Miguel, F.Garcia, and L.Alonso, Performance Increase Mechanisms for Parallel and Distributed File Systems. Parallel Computing: Special Issue on Parallel I/O Systems. Elsevier, no. 3, pp. 525-542, Apr. 1997.

[10]

Fuerle, T., O., Schikuta, E., and Wanek, H. 1999. Meta-ViPIOS: Harness distributed I/O resources with ViPIOS. Journal of Research Computing and Systems, 4(2):124-142

[11]

P.H. Carns, W.B. Ligon III, R.B. Ross, and R. Takhur, PVFS: A Parallel File System for Linux Clusters.Tech. Rep. ANL/MCS-P804-0400, 2000.

[12]

F. Garcia, A. Calderon, J. Carretero, J.M. Perez, J. Fernandez, The Design of the Expand Parallel

6. ACKNOWLEDGMENTS This work has been partially support by Microsoft Research Europe, by the Community of Madrid under the 07T/0020/2003 1 contract, and by the Spanish Ministry of Science and Technology under the TIC2003-01730 contract.

7. ADITIONAL AUTHORS Additional authors: Felix Garcia ([email protected]) and Alejandro Calderón ([email protected]).

File System. International Journal of High Performance Computing Applications, 2003. [13]

Rajeev Nagar. Windows NT File System Internals. A Developer’s Guide. O’Reilly, 1997. Pp. 158

[14]

SNIA (Storage Networking Industry Association). Common Internet File System (CIFS). Technical Reference. Revision: 1.0, 2002. pp. 6-10

An Application-Oriented Communication System for Clusters of Workstations Thiago Robert C. Santos and Antonio Augusto Frohlich Laboratory for Software/Hardware Integration (LISHA) Federal University of Santa Catarina (UFSC) 88049-900 Florianopolis - SC - Brazil PO Box 476 Phone: +55 48 331-7552 Fax: +55 48 331-9770 {robert | guto}@lisha.ufsc.br http://www.lisha.ufsc.br/ {robert | guto}

Abstract The present paper proposes an application-oriented communication sub-system to be used in S NOW, a high-performance, application-oriented parallel-programming environment for dedicated clusters. The proposed communication sub-system is composed of a baseline architecture and a family of lightweight network interface protocols. Each one of these protocols is built on top of the baseline architecture and can be tailored in order to satisfy the needs of specific classes of parallel applications. The family of lightweight protocols, along with the baseline architecture that supports it, consists of a customizable component in E POS, an application-oriented, component-based operating system that is the core of S NOW. The idea behind providing a set of low-level protocol implementations instead of a single monolithic protocol is that parallel applications running on clusters can improve their communication performance by using the most appropriate protocol for their needs.

Keywords: lightweight communication protocols, application-oriented operating systems, userlevel communication.

1 Introduction Clusters of commodity workstations are now commonplace in high-performance computing. In fact, commercial off-the-shelf processors and high-speed networks evolved so much in recent years that most of the hardware features once used to characterize massively parallel processors (MPP) are now available in clusters as well. Nonetheless, the majority of the clusters in use today rely on commodity run-time support systems (run-time libraries, operating systems, compilers, etc.) that have usually been designed in disregard of both parallel applications and hardware. In such systems, delivering a parallel API like MPI is usually achieved through a series of patches or middleware layers that invariably add on overhead for applications. Therefore, it seems logical to suppose that a run-time support system specially designed to support parallel applications on clusters of workstations could considerably improve on performance and also on other software quality metrics (e.g. usability, correctness, adaptability). Our supposition that ordinary run-time support systems are inadequate to support high-performance computing is sustained by a number of research projects in the field focusing on the implementation of message passing [8, 9] and shared memory [10, 11, 12] middlewares and on user-level communication [4, 5, 6]. If ordinary operating systems could match with parallel application’s needs, delivering adequate run-time support for the most traditional programming paradigms with minimum overhead, many of theses researches would be hard to justify outside the realm of operating systems. Indeed, 1

the way ordinary operating systems handle I/O is largely based on multitasking concepts such as domain protection and resource sharing. This impacts the way recurring operations like system calls, CPU scheduling, and application’s data management are implemented, making little room for novel technological features [2]. Not surprisingly, system designers often have to push the operating system out of the way in order to implement efficient schedulers and communication systems for clusters. In addition to that, commodity operating systems usually target reconfigurability at standard conformance and device support, failing to comply with applications’ requirements. Clusters have been quenching the industry’s thirst for low-end supercomputers for years as HPC service providers deploy cost-effective solutions based on cluster systems. There are all kinds of applications running on clusters today, ranging from communication-intensive distributed databases to CPU-hungry scientific applications. Having the chance to customize a cluster’s run-time support system to satisfy particular applications’ needs could improve the system’s overall performance. Indeed, systems such as P EACE [13] and C HOICES [14] have already confirmed this hypothesis in the 90s. In this paper, we discuss the use of dedicated run-time support system, or, more specifically, of dedicated communication systems, as effective alternatives to support communication-intensive parallel applications in clusters of workstations. The research that subsided this discussion was carried out in the scope of the S NOW project [19], which aims at developing a high-performance, application-oriented parallel-programming environment for dedicated clusters. Actually, S NOW’s run-time system comes from another project, namely E POS, which takes on a repository of software components, an adaptive component framework, and a set of tools to build application-oriented operating systems on demand [16]. The remainder of this paper is structured as follows. Section 2 gives an overview of E POS. Section 3 presents a redesign of E POS communication system aimed at enhancing support for network interface protocols and describes the baseline architecture that supports these protocols. Section 4 elaborates on related works. Conclusions are presented in Section 5, along with the directions for future work.

2 An Overview of E POS E POS, the Embedded Parallel Operating System, is a highly customizable operating system developed using state-of-the-art software engineering techniques. E POS consists of a collection of reusable and adaptable software components and a set of tools that support parallel application developers in “plugging” these components into an adaptive framework in order to produce a variety of run-time systems, including complete operating systems. Being fruit of Application-Oriented System Design [15], method that covers the development of application-oriented operating systems from domain analysis to implementation, E POS can be customized to match the requirements of particular parallel applications. E POS components, or scenario-independent system abstractions as they are called, are grouped in families and kept independent of execution scenario by deploying aspect separation and other factorization techniques during the domain engineering process, illustrated in Figure 1. E POS components can be adapted to be reused in a variety of execution scenarios. Usability is largely improved by hiding the details of a family of abstraction behind an hypothetical interface, called the family’s inflated interface, and delegating the selection of proper family members to automatic configuration tools. An application written based on inflated interfaces can be submitted to a tool that scans it searching for references to the interfaces, thus rendering the features of each family that are necessary to support the application at run-time. This tool, the analyzer, outputs a specification of requirements in the form of partial component interface declarations, including methods, types and constants that were used by the application. The primary specification produced by the analyzer is subsequently fed into a second tool, the configurator, that consults a build-up database to further refine the specification. This database holds information about each component in the repository, as well as dependencies and composition rules that are used by the configurator to build a dependency tree. Additionally, each component in the 2

Families of Abstractions

Frameworks

Problem Domain

adapter

inflated i

aspect config. feature

aspect

Family

adapter Scenario

Member

Member

Member adapter

aspect

Member

Figure 1: An overview of Application-Oriented System Design. repository is tagged with a “cost” estimation, so that the configurator will chose the “cheapest” option whenever two or more components satisfy a dependency. The output of the configurator consists of a set of keys that define the binding of inflated interfaces to abstractions and activate the scenario aspects and configurable features eventually identified as necessary to satisfy the constraints dictated by the target application or by the configured execution scenario. application program

inflated interfaces

framework

system instance

analyzer

configurator

generator

info

aspects

components

adapters

Figure 2: An overview of E POS generation tools. The last step in run-time systems generation process is accomplished by the generator. This tool translates the keys produced by the configurator into parameters for a statically meta programed component framework and triggers the compilation of a tailored system instance. Figure 2 brings an overview of the whole procedure.

3

3

E POS Communication System

E POS communication system is designed around three major families of abstractions: communicator, channel, and network. The communicator family encompasses communication end-points such as link, port, and mailbox, thus acting as the main interface between the communication system and application programs1 . The second family, channel, features communication protocols, so that application data fed into the communication system via a communicator gets delivered at the destination communicator accordingly. A channel implements a communication protocol that would be classified at level four (transport) according to the ISO/OSI reference model. The third family in E POS communication system, network, is responsible for abstracting distinct network technologies through a common interface2 , thus keeping the communication system itself architecture-independent and allowing for flexible combinations of protocols and network architectures. Commu− nicator

Channel

Network

Figure 3: An overview of E POS communication system. Previous partial implementations of E POS communication system for the Myrinet high-speed network architecture confirmed the lightness of its core, delivering unprecedented bandwidth and latency to parallel applications running on S NOW [17]. Nonetheless, E POS communication system original design makes it hard to split the implementation of network interface protocols [1] between the processors in the host machine and in the network adapter. Besides, it is very difficult to specify a single network interface protocol that is optimal for all parallel applications, since different applications impose different traffic patterns on the underlying network. Instead of developing a single, highly complex, all encompassing protocol, it appears more feasible to construct an architecture that permits fine-grain selection and dynamic configuration of precisely specified low-level lightweight protocol mechanisms. In an applicationoriented environment, this set of low-level protocols can be used to customize communication according to applications’ needs. E POS design allows for the several network interface protocols that arise from the design decisions related to network features to be grouped into a software component with a defined interface, a family, that can be easily accessed by the communication system of the OS. E POS’ framework implements mechanisms for fine-grain selection of modules according to applications’ needs. These same mechanisms can be used to select the low-level lightweight protocols that better satisfy the applications’ communications requirements. Besides, an important step towards an efficient, application-oriented communication system for clusters is to better understand the relation between the design decisions in low-level communication software and the performance of high-level applications. Grouping the different low-level implementations of communication protocols in a component coupled with the communication system has the additional advantage of allowing an application to experiment with different communications schemes, collecting metrics in order to identify the best scheme for its needs. In addition to that, to structure communication in such a modular fashion enhances maintainability and extensibility.

3.1 Myrinet baseline architecture The baseline communication architecture that supports the low-level lightweight protocols for the Myrinet networking technology must be simple and flexible enough not to hinder the design and implementation 1

The component nature of E POS enables individual elements of the communication system to be reused in isolation, even directly by applications. Therefore, the communicator is not the only visible interface in the communication system. 2 Each member in the network family is allowed to extend this interface to account for advanced features.

4

of specific protocols. The highest bandwidth and lowest latency possible are desired since complex protocol implementations will definitely affect both of them. User-level communication was the best answer from academia to the lack of efficient communication protocols for modern, high-performance networks. The baseline architecture described in this section follows the general concepts behind successful userlevel communication systems for Myrinet. Figure 4 exhibits this architecture, highlighting the data flow during communication as well as host and NIC memory layout.

Host (Epos)

Host (Epos) Messages

Physical Memory

Frames

Non−swappable Flat address space

NIC

4

NIC

Send Ring Receive Ring

1

3

2

Unsolicited Ring

OS

OS

Tx DMA Requests

Rx FIFO Queue

Tx FIFO Queue

Rx DMA Requests

Figure 4: The Myrinet Family baseline architecture. The NIC memory holds the six buffers that are used during communication. Send Ring and Receive Ring are circular buffers that hold the frames before they are accessed by the Network-DMA engine, the responsible for sending/receiving frames to/from the Myrinet network. Rx DMA Requests and Tx DMA Requests are circular chains of DMA control blocks, used by the Host-DMA engine for transferring frames between host and NIC memory. Rx FIFO Queue and Tx FIFO Queue are circular FIFO queues used by the host processor and LANai, Myrinet’s network interface processor, to signal for each other the arrival of a new frame. The size of these buffers affects the communication performance and reliability and the choice of their sizes is influenced by the host’s run-time system, memory space considerations, and hardware restrictions. Much of the overhead observed in traditional protocol implementations is due to memory copies during communication. Some network technologies provide host-to-network and network-to-host data transfers, but Myrinet requires that all network traffic go through NIC memory. Therefore, at least three copies are required for each message: from host memory to NIC memory in the sending side, from NIC to NIC and from NIC memory to host memory in the receiving side. Write-combining, the DMA transfers start-up overhead and the fact that a DMA control block has to be written in NIC memory for each Host-DMA transaction make write PIO more efficient than DMA for small frames. The baseline architecture copies data from host to NIC using programmed I/O for small frames (less than 512 bytes) and Host-NIC DMA for large frames. Since reads over the I/O bus are much slower than writes, the baseline architecture uses DMA for all NIC-to-host transfers. During communication, messages are split into frames of fixed size that are pushed into the communication pipeline. The frame size that minimizes the transmission time of the entire message in the pipelined data transfer is calculated [18] and the baseline architecture uses this value to fragment messages. Besides, the maximum frame size (MTU) is dynamically configurable. For each frame, the sender 5

host processor uses write PIO to fill up an entry in the Rx DMA Requests (for large frames) or to copy (1) the frame directly to the Send Ring in NIC memory (for small frames). It then triggers a doorbell, creating a new entry in the Tx FIFO Queue and signaling for the LANai processor that a new frame must be sent. For large frames, the transmission of frames between host and NIC memory is carried out asynchronously by the Host/NIC DMA engine (1) and the frame is sent as soon as possible by LANai after the corresponding DMA finishes (2). Small frames are sent as soon as the doorbell is rung, since at this point the frame is already in NIC memory. A similar operation occurs in the receiving side: when a frame arrives from the network, LANai receives it and fills up an entry in the Rx DMA Requests chain. The message is assembled asynchronously in the Unsolicited Ring circular buffer in host memory (3). The receiving side is responsible for copying the whole message from the Unsolicited Ring before it is overwritten by other messages (4). Note that specific protocol implementations can avoid this last copy using rendezvous-style communication, where the receiver posts a receive request and provides a buffer before the message is sent, a credit scheme, where the sender is requested to have credits for the receiver before it sends a packet, or even some other technique, achieving the optimal three copies. The host memory layout is defined by the operating system being used. Besides, the Myrinet NIC impose some constraints on the usage of its resources that must be addressed by the OS. The most critical one relates to the Host/NIC DMAs: the Host-DMA engine can only access contiguous pages pinned in physical memory. Most communication systems implementations for Myrinet address this issue by letting applications pin/unpin the pages that contain its buffers on-the-fly during communication or by using a pinned copy block. The problem with these approaches is that they add extra overhead since pinning/unpinning memory pages requires system calls, which implies in context saving and switching, and using a pinned copy block adds extra data copies in host memory. In E POS, where swapping can be left out of the run-time system by mapping logical address spaces contiguously in physical memory, this issue does not affect the overall communication.

Host (GNU/Linux)

Host (GNU/Linux) Messages Frames

NIC

1

Application Address Space Send Ring

NIC 5

Receive Ring

2

4

3

Copy block Non−swappable

Figure 5: The Myrinet Family baseline architecture in a GNU/Linux host. Figure 5 shows the memory layout and dynamic data flow of an implementation of the baseline architecture in a Myrinet GNU/Linux cluster. Issues such as address translation, kernel memory allocation and memory pinning had to be addressed in this implementation. Besides, a pinned copy block in kernel 6

memory is used to interface Host/NIC DMA transfers, which adds one extra copy for each message in the sending side. Figure 6 exhibits a performance comparison between the GNU/Linux baseline architecture and the low-level driver GM (version 1.6.5), provided by the manufacturer of Myrinet. A round trip time test was performed in order to compare the two system’s latency. Comparison between the baseline architecture’s and GM’s latency 1000



GM Baseline



100



  10

1

 

1





4



16

 

64 Frame size





256





1024

4096

Figure 6: Comparison between the baseline architecture’s and GM’s latency (in microseconds) for different frame sizes (in bytes). Many Myrinet protocols assume that the Myrinet network is reliable and, for that reason, no retransmission or time-out mechanism is needed. Indeed, the risk of a packet being lost or corrupted in a Myrinet network is so small that reliability mechanisms can be safely left out of the baseline architecture. Alternative implementations that assume unreliable network hardware and recover from lost, corrupted, and dropped frames by means of time-outs, retransmission, and hardware supported CRC checks are addressed by specific protocol implementations since different application domains may need different trade off between reliability and performance. The presented architecture may drop frames because of insufficient buffer space. The baseline architecture rests on a NIC hardware mechanism to partially solve this problem. Backpressure, Myrinet’s hardware link-level flow control mechanism, is used to prevent overflow of network interface buffers, stalling the sender until the receiver is able to drain frames from the network. More sophisticated flowcontrol mechanisms must be provided by specific protocol implementations since specialized applications may only require limited flow-control from the network, performing some kind of control on their own. Besides, the architecture supports only point-to-point messages. Multicast and broadcast are desirable since they are fundamental components of collective communication operations. Lightweight protocols that provide these features could be easily implemented on top of point-to-point messages or using more efficient techniques [3]. Finally, the proposed baseline architecture provides no protection since there is a large number of parallel applications running on dedicated environments. 7

3.2 Myrinet low-level lightweight protocols While the baseline architecture is closely related to the underlying networking technology, low-level lightweight protocols are designed according to the communication requirements of specific classes of parallel applications. The lightweight protocols in the Myrinet family are divided into two categories: Infrastructure and High-Performance protocols. Infrastructure protocols provide communication services that were left out of the baseline architecture: transparent multicasting, QoS, connection management, protection schemes, reliable delivery and flow control mechanisms, among others. In order to keep latencies low it would be desirable to efficiently execute the entire protocol stack, up to the transport layer, in hardware. Programmable network interfaces can be used to achieve that goal. Infrastructure protocols exploit the network processor to the maximum, using more elaborate Myrinet control programs in order to offload a broader range of communication tasks to the LANai. The communication performance is affected due to the trade-off between performance and MCP complexity but for some specific classes of applications this is a small price to pay for the communication services provided. High-performance protocols deliver minimum latency and maximum bandwidth to the run-time system. They usually consist in minimal modifications in the baseline architecture that are required by applications or in protocols that monitor the traffic patterns and dynamically modify the baseline architecture’s customization points in order to address dynamic changes in application requirements.

4 Related Works There are several communication systems implementations for Myrinet, such as AM, BIP, PM, and VMMC, to name a few. Although these communication systems share some common goals, performance being one of them, they have made very different decisions in both the communication model and implementation, consequently offering different levels of functionality and performance. From the several published comparison between these implementations one can conclude that there is no single best low-level communication protocol, since the communication patterns of the whole run-time system (application and run-time support) influences the impact of the low-level implementation decisions of a given communication system on applications’ performance. Besides, run-time system specifics greatly influences communication system implementations’ functionality. While the Myrinet communication systems mentioned before try to deliver a generic, all-purpose solution for low-level communication, the main goal of the presented research is customization of lowlevel communication software. The architecture we propose should be flexible enough to allow that a broad range of the implementation decisions behind each one of the several Myrinet communication systems be supported as a lightweight protocols. Although our work has focused on Myrinet, there are some other networks for which the same concepts can be applied. Cluster interconnection technologies that are also implemented with a programmable NIC that can execute a variety of protocol tasks include DEC’s Memory Channel, the interconnection network in the IBM SP series and Quadrics’ QsNet.

5 Conclusions The widespread of cluster systems brings up the necessity of improvements in the software environment used in cluster computing. Cluster system software must be redesigned to better exploit clusters’ hardware resources and to keep up with applications’ requirements. Parallel-programming environments need to be developed with both the cluster and applications efficiency in mind. 8

In this paper we outlined the design of a communication sub-system based in low-level lightweight protocols, along with the design decision related to this sub-system’s baseline architecture for the Myrinet networking technology. Experiments are being carried out to determine the best values for the architecture’s customization points in different traffic pattern conditions. We intend to create an efficient, application-oriented communication system for clusters and the redesign of E POS communication system was one more step towards that goal. We believe that it is necessary to better understand the relation between the design decisions in low-level communication software and the performance of high-level applications. The proposed lightwight communication protocols, along with the application-oriented run-time system provided by E POS, will be used in order to evaluate how different low-level communication schemes impact on parallel applications’ performance.

References [1] Raoul A. F. Bhoedjang, Tim Ruhl, and Henri E. Bal. User-level Network Interface Protocols. IEEE Computer, 31(11):53–60, November 1998. [2] IEEE Task Force on Cluster Computing. Cluster Computing White Paper, Mark Baker, editor, online edition, December 2000. [http://www.dcs.port.ac.uk/˜mab/tfcc/WhitePaper]. [3] M. Gerla, P. Palnati, and S. Walton. Multicasting protocols for high-speed, wormhole-routing local area networks. In Proceedings of the SIGCOMM, pages 184–193, 1996. [4] Loic Prylli and Bernard Tourancheau. BIP: a New Protocol Designed for High Performance Networking on Myrinet. In Proceedings of the International Workshop on Personal Computer based Networks of Workstations, Orlando, USA, April 1998. [5] Steven S. Lumetta, Alan M. Mainwaring, and David E. Culler. Multi-Protocol Active Messages on a Cluster of SMP’s. In Proceedings of Supercomputing’97, Sao Jose, USA, November 1997. [6] Hiroshi Tezuka, Atsushi Hori, Yutaka Ishikawa, and Mitsuhisa Sato. PM: An Operating System Coordinated High Performance Communication Library. In High-Performance Computing and Networking, volume 1225 of Lecture Notes in Computer Science, pages 708–717. Springer, April 1997. [7] T. von Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: A User-Level Network Interface for Parallel and Distributed Computing. In Proceedings of the 15th ACM SOSP, pages 40–53. Copper Mountain, Colorado, December 1995. [8] W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing, 22(6):789–828, September 1996. [9] Greg Burns, Raja Daoud, and James Vaigl. LAM: An Open Cluster Environment for MPI. In Proceedings of the Supercomputing Symposium, pages 379–386, 1994. [10] Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy. User-level Interprocess Communication for Shared Memory Multiprocessors. ACM Transactions on Computer Systems, 9(2):175–198, May 1991. [11] Jorg Cordsen. Virtuell Gemeinsamer Speicher,. PhD thesis, Technical University of Berlin, Berlin, Germany, 1996. [12] H. Hellwagner, W. Karl, M. Leberecht, and H. Richter. SCI-Based Local-Area Shared-Memory Multiprocessor. In Proceedings of the International Workshop on Advanced Parallel Processing Technologies - APPT’95, Beijing, China, September 1995. 9

[13] Wolfgang Schroder-Preikschat. The Logical Design of Parallel Operating Systems. Prentice-Hall, Englewood Cliffs, U.S.A., 1994. [14] Roy H. Campbell, Nayeem Islam, and Peter Madany. Choices, Frameworks and Refinement. Computing Systems, 5(3):217–257, 1992. [15] Antonio Augusto Frohlich. Application-Oriented Operating Systems. Number 17 in GMD Research Series. GMD - Forschungszentrum Informationstechnik, Sankt Augustin, August 2001. [16] Antonio Augusto Frohlich and Wolfgang Schroder-Preikschat. High Performance Applicationoriented Operating Systems – the EPOS Aproach. In Proceedings of the 11th Symposium on Computer Architecture and High Performance Computing, pages 3–9, Natal, Brazil, September 1999. [17] Antonio Augusto Frohlich and Wolfgang Schroder-Preikschat. On Component-Based Communication Systems for Clusters of Workstations. In Proceedings of the First IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID 2001), pages 640–645, Brisbane, Australia, May 2001. [18] Antonio Augusto Frohlich, Gilles Pokam Tientcheu, and Wolfgang Schroder-Preikschat. EPOS and Myrinet: Effective Communication Support for Parallel Applications Running on Clusters of Commodity Workstations. In Proceedings of 8th International Conference on High Performance Computing and Networking, pages 417–426, Amsterdam, The Netherlands, May 2000. [19] Antonio Augusto Frohlich, Philippe Olivier Alexander Navaux, Sergio Takeo Kofuji, and Wolfgang Schroder-Preikschat. Snow: a parallel programming environment for clusters of workstations. In Proceedings of the 7th German-Brazilian Workshop on Information Technology, Maria Farinha, Brazil, September 2000.

10

A First Step towards Autonomous Clustered J2EE Applications Management Slim Ben Atallah1 1

Daniel Hagimont2

Assistant professor

Sébastien Jean1 2

Noël de Palma1

Senior researcher

INRIA Rhône-Alpes – Sardes project 655 avenue de l’Europe , Montbonnot Saint Martin 38334 Saint Ismier Cedex, France Tel : 33 4 76 61 52 00, Fax : 33 4 76 61 52 52

[email protected] ABSTRACT A J2EE application server is composed of four tiers: a web frontend, a servlet engine, an EJB server and a database. Clusters allow for replication of each tier instance, thus providing an appropriate infrastructure for high availability and scalability. Clustered J2EE application servers are built from clusters of each tier and provide the J2EE applications with a transparent view of a single server. However, such applications are complex to administrate and often lack deployment and reconfiguration tools. This paper presents JADE, a java-based environment for clustered J2EE applications deployment. JADE is the first attempt of providing a global environment that allows deploying J2EE applications on clusters. Beyond JADE, we aim to define an infrastructure that allows managing as autonomously as possible a wide range of clustered systems, at different levels (from operating system to applications).

General Terms Management, Experimentation.

Keywords Clustered J2EE Applications, Deployment, Configuration.

1. INTRODUCTION J2EE-driven architectures are now a more and more convenient way to build efficient web-based ecommerce applications. Although this multi-tiers model, as is, suffers from a lack of scalability, it nevertheless benefits from clustering techniques that allow by means of replication and consistency mechanisms to increase application bandwidth and availability. However, J2EE applications are not really easy and comfortable to manage. Their deployment process (installation and configuration) is as complex as tricky, no execution monitoring

mechanism really exists and dynamic reconfiguration remains a goal to achieve. This lack of manageability makes it very difficult to take fully advantage of clustering capabilities, i.e. expanding/collapsing replicas sets as needed, and so on and so forth… This paper presents the first results of an ongoing project that aims to provide system administrator with a management environment that is as automated as possible. Managing a system means being able to deploy, monitor and dynamically reconfigure such a system. Our first experiments target the deployment (i.e. installation/configuration) of a clustered J2EE application. The contribution in this field is JADE, a java-based application deployment environment that eases administrator’s job. We show how JADE allows deploying a real benchmark application called RUBIS. The outline of the rest of this paper is as follows. Section 2 recalls clustered J2EE applications architecture and life cycle and shows the limits of existing deployment and configuration tools. Then, Section 3 presents JADE, a contribution to ease such application management by providing automatic scripting based deployment and configuration tools. Section 4 resets this work in a wider project that consists of defining a component-based framework for autonomous systems management. Finally, Section 5 concludes and presents future work.

2. ADMINISTRATION OF J2EE CLUSTERS: STATE-OF-THE-ART AND CHALLENGES This introductory section recalls clustered J2EE applications architecture and life cycle before showing the limits of associated management tools.

2.1 Clustered J2EE Applications and their Lifecycle

Business logic tier

Web tier

mod_jk plugin HTTP SQL AJP13

JDBC

RMI Tomcat

Presentation tier

end-user

Database tier

Figure 1. J2EE Applications Architecture . J2EE application servers [1], as depicted in Figure 1, are usually composed of four different tiers, either running on a single machine or on up to four ones: - A web tier, as a web sever (e.g. Apache [2]), that manages incoming clients requests and, respectively depending if those relate to static or dynamic content, serves them or route them to the presentation tier using the appropriate protocol (e.g. AJP13 for Tomcat). - A presentation tier, as a web container (e.g. Tomcat [3]), that receives forwarded request from the web tier, interacts with the business logic tier (using the RMI protocol) to get related data and finally dynamically generates a web document presenting the results to the end-user. - A business logic tier, as an Enterprise JavaBeans server (e.g. JoNAS [4]), that embodies application logic components (providing them with non-functional properties) which mainly interact with the database storing application data by sending SQL requests by the way of the JDBC framework. - A database tier, as a database management system (e.g.

The global architecture of clustered J2EE applications is depicted in Figure 2 and detailed below in the case of an {Apache, Tomcat, JoNAS, MySQL} cluster. Apache clustering is managed through HTTP load balancing mechanisms that can involve hardware and/or software helpers. We cite below some well-known general-purpose techniques [6] that apply to any kind of web servers: • • •



Level-4 switching, where a high-cost dedicated router can distribute up to 700000 simultaneous TCP connections over the different servers RR-DNS (Round-Robin DNS), where a DNS server periodically changes the IP address associated to the web site hostname Microsoft’s Network Load Balancing or Linux Virtual Server that use modified TCP/IP stacks allowing a set of hosts to share a same IP addresses and cooperatively serve requests TCP handoffs, where a front-end server establishes TCP connections and lets a chosen host directly handle the related communication.

MySQL server [5]), that manages application data. The main motivations of clustering are scalability and faulttolerance. Scalability is a key issue in case of web applications that must serve billion requests a day. Fault-tolerance does not necessarily apply to popular sites, even if it is also required in this case, but to applications where information delivery is critical (as commercial web sites for example). Both scalability and fault-tolerance are offered through replication (and consistency management for the last). In the case of J2EE applications, database replication provides application with service availability when machine failures occur, as well as efficiency by load balancing incoming requests between replicas.

Tomcat clustering is made by using the load balancing feature of Apache’s mod_jk plugin. Each mod_jk can be configured in order to balance requests on whole or of a subset of Tomcat instances, according to a weighted round-robin policy. No common mechanism exists to manage business logic tiers replicas, but ad’hoc techniques have been defined. For example, JoNAS clustering can be achieved by using a dedicated “cluster” stub instead of the standard RMI stub in Tomcat in order to interact with EJB. This stub can be seen as a collection stub that manages load balancing, assuming that whatever the JoNAS instance where a bean has been created, its reference is bound in all JNDI registries. Database clustering solutions often remain commercial, like Oracle RAC (Real Application Cluster) or DB2 cluster and require using a set of homogeneous full replicas. We can however cite C-JDBC [7], an open source JDBC clustering middleware that allows using heterogeneous partial replicas providing with consistency, caching and load balancing. J2EE applications life cycle consists in three main steps that are detailed below: deployment, monitoring and reconfiguration.

cluster stub

end-user

HTTP Load balancer

HTTP

RMI

JNDI replica

JDBC

JNDI replica

JNDI replica

mod_jk + LB

Web tiers

Presentation tiers

JDBC clustering middleware

AJP13

Business logic tiers

Database tiers

Figure 2. Clustered J2EE Applications Architecture Deployment At the deployment step, tiers must firstly be installed on hosts and be configured to be correctly bound to each other. Then, application logic and data can be initialized. Application tiers are often delivered through installable packages (e.g. rpms) and the configuration is statically expressed in configuration files that statically map components to resources. Monitoring Once the application has been deployed on the J2EE cluster, one needs to know both the system and the application states to be aware of problems that may arise. Most common issues are due either to hardware faults such as a node or network link failure, or inappropriate resource usage when a node or a tier of the application server becomes a bottleneck. Reconfiguration Once a decision has been taken (e.g., extension of a J2EE tier on new nodes to handle increased load), one must be able to perform appropriate reconfiguration, avoiding as most as possible to stop the associated component.

2.2 Deployment, Monitoring and Reconfiguration Challenges Currently, no integrated deployment environment exists for clustered J2EE applications. Each tier must be installed manually and independently. Identically, the whole assembly, including clustering middleware, must be configured manually mainly through static configuration files (and there also no configuration consistency verification mechanism). Consequently, the deployment and configuration process is a complex task to perform. J2EE cluster monitoring is also a weakly offered feature. It is obviously possible to see hosts load or to use SNMP to track failures, but this is not enough to get pertinent information about application components. There is no way to monitor an apache web server, and even if JoNAS offer JMX interfaces to see what applications are running, cluster administrator can not gather load evaluations at application level (but only the amount of memory used by the JVM). Finally, database servers usually do not offer monitoring features, except in few commercial products. In terms of reconfiguration, no dynamic mechanism is really offered. Only Apache server enables to dynamically take into account configuration file changes, others tiers need to be stopped and restarted in order to apply low-level modifications.

. In this context, in order to alleviate the burden of application administrator, to take advantage of clustering and thus to be able to optimize performance and resource consumption, there is a crucial need for a set of tools: - an automated deployment and configuration tool, that allows to easily and user-friendly deploy and configure a entire J2EE application, - an efficient application monitoring service that automatically gathers, filters, and notifies events that are pertinent to the administrator, - a framework for dynamic reconfiguration. Research work that is directly related to this topic is provided by the Software Dock [8.]. The Software Dock is a distributed, agent-based framework for supporting the entire software deployment life cycle. One major aspect of the Software Dock research is the creation of a standard schema for the deployment process. The current prototype of the Software Dock system includes an evolving software description schema definition. Abstractly, the Software Dock provides infrastructure for housing software releases and their semantic descriptions at release sites and provides infrastructure to deploy or “dock” software releases at consumer sites. Mobile agents are provided to interpret the semantic descriptions provided by the release site in order to perform various software deployment life cycle processes. Initial implementations of generic agents for performing configurable content install, update, adapt, reconfigure, and remove have been created. However Software Dock does not deal with J2EE deployment as well as with clustered environment.

3. JADE: J2EE APPLICATIONS DEPLOYMENT ENVIRONMENT In this section, we present JADE, a deployment environment for clustered J2EE applications. We firstly give an overview of the architecture and follow with the example of a benchmark application deployment called RUBIS.

3.1 Architecture Overview JADE is a component-based infrastructure which allows the deployment of J2EE applications on cluster environment. As depicted in Figure 3, JADE is mainly composed of three levels defined as follows:

Figure 3. JADE Architecture Overview.

Konsole level In order to deploy software components, JADE provides a configuration shell language. The language introduces a set of deployment commands described as follows: • • • • • • •

“start daemon”: starts a JADE daemon on a node “create”: creates a new component manager “set”: sets a component property “install”: installs component directories “installApp”: installs application code and data “start”: starts a component ”stop”: stops a component.

The use of configuration commands is illustrated in the RUBIS deployment use case in Appendix. The Shell commands are interpreted by the command invoker that builds deployment requests and submit them to the deployer engine. JADE provides a GUI konsole which allows deploying of software components of cluster nodes. As shown in Figure 3,

each started component is managed through its own GUI konsole. The GUI konsole also allows managing existing configuration shells. Deployment level It describes component repository, deployment engine and component manager: -

-

-

The repository provides access to several software releases (Apache, Tomcat, …) and associated component managers. It provides a set of interfaces for instantiating deployment engine and component manager. The deployment engine is the software responsible of performing specific tasks of the deployment process on the cluster nodes. The deployment process is driven by the deployment tools using interfaces provided by the deployment engine. Component manager allows setting component properties at launch time and also at run time.

Cluster level The cluster level illustrates the components deployed and started on cluster nodes. At this stage, deployed components are able to be managed. The JADE deployment engine is a component-based infrastructure. It provides the interface required to deploy the application on the required nodes. It is composed by component factory and by components deployer on each node involved. When a deployment shell runs a script, it begins with the installation of component factories on required nodes and then interacts with factories to create component deployer. The shell can then execute the script invoking component deployer. Component factory exposes an interface to remotely create and destroy component managers. Components deployers are wrappers that encapsulate legacy code and expose interface that allows installing tiers from the repository onto the local node, configuring the local installation, loading the application from the repository on tiers, configuring the application and starting/stooping tiers and the application. The JADE command invoker submits deployment and configuration requests to the deployment engine. Even if currently the requests are implemented as synchronous RMI calls to the deployment engine interface, other connectors (such as MOM) should be easily plugged in the future. A standard deployment script can perform the following actions: install the tiers, configure a tier instance, load the application on tiers, configure the application, start tiers. An example of deployment script is given in Appendix. A standard undeployment script should stop the application and tiers and should uninstall all the artefacts previously installed.

3.2 RUBiS deployment scenario RUBiS [9] provides a real-world example of the needs for improved deployment activities support. This example is used to design a first basic deployment infrastructure. RUBiS is an auction site prototype modelled after eBay.com that is used to evaluate application design patterns and application servers performance and scalability. RuBis offers different application logic implementations. It may take various forms, including scripting languages such as PHP that execute as a module in a Web server such as Apache, Microsoft Active Server Pages that are integrated with Microsoft's IIS server, Java servlets that execute in a separate Java virtual machine, and full application servers such as an Enterprise Java Beans (EJB) server [22]. This study focuses on the Java servlets implementation. Since we take the use case of RuBis in a cluster environment, we depict a load balancing scenario. In Appendix is presented a configuration implying two Tomcat servers and two MySQL servers. In this configuration, the Apache server is deployed on a node called sci40, the tomcat servers are on nodes called sci41 and sci42, and finally the two MySQL servers are on nodes called sci43 and sci44.

4. TOWARDS A COMPONENT-BASED INFRASTRUCTURE FOR AUTONOMOUS SYSTEMS MANAGEMENT The environment presented in the previous section is suitable for J2EE application deployed but, more generally, it can be easily derived to be applied to system management.

4.1 Overview of System Management Managing a computer system can be understood in terms of the construction of system control loops as stated in control theory [10]. These loops are responsible for the regulation and optimization of the behavior of the managed system. They are typically closed loops, in that the behavior of the managed system is influenced both by operational inputs (provided by clients of the system), and by control inputs (inputs provided by the management system in reaction to observations of the system behavior). Figure 4 depicts a general view of control loops that can be divided into multi-tier structures including: sensors, actuators, notification transport, analysis, decision, and command transport subsystems.

Analysis

Decision

Notification transport

Command transport

Sensor

Actuato Managed system

Figure 4. Overview of a Supervision Loop. Sensors locally observe relevant state changes and event occurrences. These observations are then gathered and transported by notification transport subsystems to appropriate observers, i.e., analyzers. The analysis assesses and diagnoses the current state of the system. The diagnosis information is then exploited by the decision subsystem, and appropriate command plans are built, if necessary, to bring the managed system behavior within the required regime. Finally, command transport subsystems orchestrate the execution of commands required by the command plan while actuators are the implementation of local commands. Therefore, building an infrastructure for system management can be understood as providing support for implementing the lowest tiers of a system control loop, namely the sensors/actuators and notification/command transport subsystems. We consider that such an infrastructure should not be sensitive as to how the loop is closed at the top (analysis and decision tiers), be it by a human being or by a machine (in the case of autonomous systems). In practice, a control loop structure can merge different tiers, or have trivial implementations for some of them (e.g., a reflex arc to respond in a predefined way to the occurrence of an event). Also, in complex distributed systems, multiple control loops are bound to coexist. For instance, we need to consider horizontal coupling whereby different control loops at the same level in a system hierarchy cooperate to achieve correlate regulation and optimization of overall system behavior by controlling separate

but interacting subsystems [11]. We also need to consider vertical coupling whereby several loops participate, at different time granularities and system levels, to the control of a system (e.g., multi-level scheduling).

4.2 Beyond JADE JADE is a first tool that has to be completed by other ones that provide administration process with monitoring and reconfiguration. Before this, a prerequisite is a cartography service that builds a comprehensive system model, encompassing all hardware and software resources available in the system. Instead of relying on a “manual” selection of resource eligible for hosting tiers, the deployment process should dynamically map application components on available resources by querying the cartography service that maintains a coherent view of the system state. A component-based model is well suited to represent a system model. Each resource (node, software …) can be represented by a component. Composition manifests hierarchical and containment dependencies. With such an infrastructure, a deployment description no more needs to bind static resources but only needs to define the set of required resources. The architecture description might include an exact set of resources or just define a minimal set of constraints to satisfy. The cartography service can then inspect the system representation to find the needed components that correspond to the resources needed by the application. The deployment process itself consists in inserting the application components into the node component that contains the required resources. Finally, the application components are bound to the resources via bindings to reflect the resource usage in the cartography. The effective deployment of a component is then performed consecutively, as the component model allows some processing to be associated with the insertion or removal of a sub-component. Then, there is also a need for a monitoring service reporting the current cluster state to take the appropriate actions. Such a monitoring infrastructure requires sensors and a notification transport subsystem. Sensors can be implemented as components and component controllers that are dynamically deployed to reify the state of a particular resource (hardware or software). Some sensors can be generic to interact with resources through common protocols such as SNMP or JMX/RMI, but other probes are specific to a resource (processor sensor). Deploying sensors optimally for a given set of observations is an issue. Sensors monitoring physical resources may have to be deployed where the resource is located (e.g., to monitor resource usage) or on remote nodes (e.g., for detecting node failures). Another direct concern about sensors is their intrusiveness on the system. For instance, the frequency of probing must not significantly alter the system behavior. In the case of a J2EE cluster, we have to deal with different legacy software for each tier. Some software, such as web or database servers, do not provide monitoring interfaces, in which case we have to rely on wrapping and indirect observations using operating system or physical resource

sensors. However, J2EE containers usually provide JMX interfaces that offer a way to instrument the application server. Additionally, the application programmer can provide user level sensors (e.g., in the form of JMX MBeans). Notification transport is in charge of event and reaction dispatching. Once the appropriate sensors are deployed, they generate notifications to report the state of the resource they monitor. The notifications must be collected and transported to the observers and analyzers that have expressed interest in them. An observer can for instance be a monitoring console that will display the state of the system in a human readable form. Different observers and analyzers may require different properties from the channel used to transport the notifications. An observer in charge of detecting a node failure may require a reliable channel providing a given QoS, while these properties are not required by a simple observer of the CPU load of a node. Therefore the channels used to transport the notifications should be configured according to the requirements of the concerned observers and analyzers. Typically, it should be possible to dynamically add, remove, or configure a channel between sensors and observers/analyzers. To this effect, we have implemented DREAM (Dynamic REflective Asynchronous Middleware) [12], a Fractal-based framework to build configurable and adaptable communication subsystems, and in particular asynchronous ones. DREAM components can be changed at runtime to accommodate new needs such as reconfiguring communication paths, adding reliability or ordering, inserting new filters and so on. We are currently integrating various mechanisms and protocols in the DREAM framework to implement scalable and adaptable notification channels, drawing from recent results on publishsubscribe routing and epidemic protocols.

5. CONCLUSION AND FUTURE WORK As the popularity of dynamic-content Web sites increases rapidly, there is a need for maintainable, reliable and above all scalable platforms to host these sites. Clustered J2EE servers is a common solution used to provided reliability and performances. J2EE clusters may consist of several thousands of nodes, they are large and complex distributed system and they are challenging to administer and to deploy. Hence is a crucial need for tools that ease the administration and the deployment of these distributed systems. Our ultimate goal is to provide a reactive management system. We propose the JADE tool which is a framework to ease J2EE applications deployment. Jade provides automatic scriptingbased deployment and configuration tools in clustered J2EE applications. We experienced a simple configuration scenario based on a servlet version of an auction site (RuBiS). This experiment provides us the necessary feedback and a basic component to develop a reactive management system. It shows the feasibility of the approach. JADE is a first tool that provides with deployment facility, but it has to be completed to provide a full administration process with monitoring and reconfiguration.

We are currently working on several open issues for the implementation of our architecture system model and instrumentation for resource deployment, scalability and coordination in the presence of failures in the transport subsystem, automating the analysis and decision processes for our J2EE use cases. We plan to experiment JADE with other J2EE scenarii including EJB (The EJB version of RuBis). Our deployment service is a basic block for administration system. It will be integrated in the future system management service.

International Working Conference on Component Deployment (CD'2004), Edinburgh, Scotland, may 2004.

7. APPENDIX // start the daemon (ie : the factory) start daemon sci40 start daemon sci41 start daemon sci44

6. References

start daemon sci45

S. Allamaraju et al. – Professional Java Server Programming J2EE Edition - Wrox Press, ISBN 1-861004-65-6, 2000.

// create the managed component:

[2]

http://www.apache.org

create apache apache1 sci40

[3]

http://jakarta.apache.org/tomcat/index.html

[4]

http://jonas.objectweb.org/

create mysql mysql1 sci43

[5]

http://www.mysql.com/

create mysql mysql2 sci44

[6]

http://www.onjava.com/pub/a/onjava/2001/09/2 6/load.html

// Configure the apache part

[7]

Emmanuel Cecchet and Julie Marguerite. CJDBC: Scalability and High Availability of the Database Tier in J2EE environments. In the 4th ACM/IFIP/USENIX International Middleware Conference (Middleware), Poster session, Rio de Janeiro, Brazil, June 2003.

[1]

[8]

[9]

[10] [11] [12]

// type name host create tomcat tomcat1 sci41 create tomcat tomcat2 sci42

set apache1 DIR_INSTALL /users/hagimont/apache_install set apache1 DIR_LOCAL /tmp/hagimont_apache_local set apache1 USER hagimont set apache1 GROUP sardes set apache1 SERVER_ADMIN [email protected] set apache1 PORT 8081

R.S. Hall et Al. An architecture for PostDevelopment Configuration Management in a Wide-Area Network. In the 1997 International Conference on Distributed Computing Systems.

set apache1 HOST_NAME sci40

Emmanuel Cecchet, Anupam Chanda, Sameh Elnikety, Julie Marguerite and Willy Zwaenepoel. Performance Comparison of Middleware Architectures for Generating Dynamic Web Content. In Proceedings of the 4th ACM/USENIX International Middleware Conference (Middleware), Rio de Janeiro, Brazil, June 16-20, 2003

// bind to tomcat2

K. Ogata – Modern Control Engineering, 3rd ed. – Prentice-Hall, 1997. Y. Fu et al. – SHARP: An architecture for secure resource peering – Proceedings of SOSP'03. Vivien Quéma, Roland Balter, Luc Bellissard, David Féliot, André Freyssinet and Serge Lacourte. Asynchronous, Hierarchical and Scalable Deployment of Component-Based Applications. In Proceedings of the 2nd

//bind to tomcat1 set apache1 WORKER tomcat1 8009 sci41 100 set apache1 WORKER tomcat2 8009 sci42 100 set apache1 JKMOUNT servlet // Configure the two tomcat set tomcat1 JAVA_HOME /cluster/java/j2sdk1.4.2_01 set tomcat1 DIR_INSTALL /users/hagimont/tomcat_install set tomcat1 DIR_LOCAL /tmp/hagimont_tomcat_local // provides worker port set tomcat1 WORKER tomcat1 8009 sci41 100 set tomcat1 AJP13_PORT 8009 set tomcat2 DataSource mysql2

set tomcat2 JAVA_HOME /cluster/java/j2sdk1.4.2_01 set tomcat2 DIR_INSTALL /users/hagimont/tomcat_install set tomcat2 DIR_LOCAL /tmp/hagimont_tomcat_local

// Install the component install tomcat1 {conf, doc, logs,webapps} install tomcat2 {conf, doc, logs,webapps} install apache1 {icons,bin,htdocs,cgibin,conf, logs} install mysql1 {}

// provides worker port

install mysql2 {}

set tomcat2 WORKER tomcat2 8009 sci42 100 // Load the application part in the middleware set tomcat2 AJP13_PORT 8009 set tomcat2 DataSource mysql2 // Configure the two mysql set mysql1 DIR_INSTALL /users/hagimont/mysql_install set mysql1 DIR_LOCAL /tmp/hagimont_mysql_local set mysql1 USER root

installApp mysql1 /tmp/hagimont_mysql_local "" installApp mysql2 /tmp/hagimont_mysql_local "" installApp tomcat1 /users/hagimont/appli/tomcat rubis installApp tomcat2 /users/hagimont/appli/tomcat rubis installApp apache1 /users/hagimont/appli/apache Servlet_HTML

set mysql1 DIR_INSTALL_DATABASE /tmp/hagimont_database // Start all the component set mysql2 DIR_INSTALL /users/hagimont/mysql_install set mysql2 DIR_LOCAL /tmp/hagimont_mysql_local set mysql2 USER root set mysql2 DIR_INSTALL_DATABASE /tmp/hagimont_database

start mysql1 start mysql2 start tomcat1 start tomcat2 start apache1

Highly Configurable Operating Systems for Ultrascale Systems∗ Arthur B. Maccabe and Patrick G. Bridges Department of Computer Science, MSC01-1130 1 University of New Mexico Albuquerque, NM 87131-0001

[email protected] [email protected]

Ron Brightwell and Rolf Riesen Sandia National Laboratories PO Box 5800; MS 1110 Albuquerque, NM 87185-1110

[email protected] [email protected]

ABSTRACT Modern ultrascale machines have a diverse range of usage models, programming models, architectures, and shared services that place a wide range of demands on operating and runtime systems. Full-featured operating systems can support a broad range of these requirements, but sacrifice optimal solutions for general ones. Lightweight operating systems, in contrast, can provide optimal solutions at specific design points, but only for a limited set of requirements. In this paper, we present preliminary numbers quantifying the penalty paid by general-purpose operating systems and propose an approach to overcome the limitations of previous designs. The proposed approach focuses on the implementation and composition of fine-grained composable micro-services, portions of operating and runtime system functionality that can be combined based on the needs of the hardware and sofware. We also motivate our approach by presenting concrete examples of the changing demands placed on operating systems and runtimes in ultrascale environments.

1.

INTRODUCTION

Due largely to the ASCI program within the United States Department of Energy, we have recently seen the deployment of several production-level terascale computing systems. These systems, for example ASCI Red, ASCI Blue Mountain, and ASCI White, include a variety of hardware architectures and node configurations. In addition to differing hardware approaches, a range of usage models (e.g., dedicated vs. space-shared vs. time-shared) and program∗This work was supported in part by Sandia National Laboratories. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract DEAC04-94AL85000.

Trammell Hudson Operating Systems Research, Inc. 1729 Wells Drive NE Albuquerque, NM 87112

[email protected]

ming models (e.g. message-passing vs. shared-memory vs. global shared address space) have also used for programming these systems. In spite of these differences and other evolving demands, operating and runtime systems are expected to keep pace. Full-featured operating systems can support a broad range of these requirements, but sacrifice optimal solutions for general ones. Lightweight operating systems, in contrast, can provide optimal solutions at specific design points, but only for a limited set of requirements. In this paper, we present an approach that overcomes the limitations of previous approaches by providing a framework for configuring operating and runtime systems tailored to the specific needs of the application and environment. Our approach focuses on the implementation and composition of micro-services, portions of operating and runtime system functionality that can be composed together in a variety of ways. By choosing appropriate micro-services, runtime and operating system functionality can be customized at build time or runtime to the specific needs of the hardware, system usage model, programming model, and application. The rest of this paper is organized as follows: section 2 describes the motivation for our proposed system, including the hardware and software architectures of current terascale computing systems and the challenges faced by operating systems on these machines, and presents preliminary numbers and experiences to outline the scale of this problem. It also presents several motivating examples that are driving our design efforts. Section 3 describes the specific challenges faced by operating systems in ultrascale environments, and section 4 presents our approach to addressing these challenges. Section 5 describes various related operating system work, and section 6 concludes.

2. MOTIVATION 2.1 Current and Future System Demands Modern ultrascale systems, for example the various ASCI machines and the Earth Simulator, have widely varying systemlevel and node-level hardware architectures. The first terascale system, ASCI Red, is a traditional distributed memory massively parallel processing machine – thousands of nodes,

Even within the context of a specific programming model such as MPI, applications can have wide variations in the number and type of system services they require and can also have varying requirements for the environment in which they run. For example, the Common Component Architecture assists in the development of MPI applications, but it requires dynamic library services to be available to the individual processes within the parallel job. Environmental services, such as system-level checkpoint/restart, are also becoming an expected part of the standard parallel application development environment. The usage model of these large machines has also expanded. The utility of capacity computing, largely driven by the ubiquity of commodity clusters, has led to changes in the way in which large machines are partitioned and scheduled. Machines that were originally intended to run a single, large parallel simulation are being used more frequently for parameter studies that require thousands of small jobs.

2.2

Problems with Current Approaches

General-purpose operating systems such as Linux provide a wide range of services. These services and their associated kernel structures enable sophisticated applications with capabilities for visualization and inter-networking. This generality unfortunately comes at the cost of performance for all applications that use the operating system because of the overheads of unnecessary services. In an initial attempt to measure this performance difference, we compared the performance of the mg and cg NAS B benchmarks on ASCI Red hardware [21] when running two different operating systems. We use Cougar, the productized version of the Puma operating system [26], as the specialized operating system, and Linux as the general-purpose operating system. To make the comparison as fair to Linux as possible, we have ported the CplantTM version of the Portals high-performance messaging [1] layer to the ASCI Red hardware. Cougar already utilizes this Portals for message transmission.

1200 Millions of Operations per Second

In addition to hardware, the approach from a programming model standpoint has varied as well. The lightweight compute node operating system on ASCI Red does not support a shared-memory programming model on individual compute nodes, while the other platforms support a variety of shared memory programming constructs, such as threads and semaphores. This has lead to the development of mixed-mode applications that combine MPI and OpenMP (or pthreads) to fully utilize the capabilities of systems with large numbers of processors per node. Applications have also been developed for these platforms that extend the boundaries of a traditional programming model. The distributed implementation of the Python scripting language is one such example [14]. Advanced programming models, such as the Global Address Space model, are also gaining support within the parallel computing community.

Cougar Linux/Portals

1400

1000

800

600

400

200

0 4

16

32

64

128

Number of Processors

Figure 1: CG Performance on Linux and Cougar on ASCI/Red Hardware 5000 Cougar Linux/Portals

4000 Millions of Operations per Second

each with a small number of processors (2). In contrast, the ASCI Blue Mountain machine was composed of 128processor nodes, while ASCI White employs 16-way SMP nodes. We also expect additional hardware advances such as multi-core chips and processor-in-memory chips to be available in similar systems in the near future.

3000

2000

1000

0 4

16

32

64 Number of Processors

128

Figure 2: MG Performance on Linux and Cougar on ASCI/Red Hardware

Figures 1 and 2 show the performance of these benchmarks when running on two different operating systems. Linux outperforms Cougar on the cg benchmark with small numbers of nodes because Cougar uses older, less optimized compilers and libraries, but as the number of nodes used increases, application performance on Linux falls off. Similar effects happen on the mg benchmark, though mg on Cougar outperforms mg on Linux even on small numbers of nodes despite using older compilers and libraries. A variety of different overheads cause Linux’s performance problems on largerscale systems, including lack of contiguous memory layout and the associated TLB overheads and suboptimal node allocations due to limitations with Linux job-launch on ASCI Red. Such operating system problems have also been seen in other systems. Researchers at Los Alamos, for example, have shown that excess services can cause dramatic performance degradations [17]. Similarly, researchers at Lawrence Livermore National Laboratory have shown that operating system scheduling problems can have a large impact on application performance in large machines [13].

2.3

Motivating Examples

The changing nature of demands on large scale systems present some of the largest challenges to operating system

design in this environment. We consider changing demands in several areas along with specific examples from each are to motivate our work.

2.3.1

Changing Usage Models.

As large-scale systems age, they frequently transition from specialized capability-oriented usage for a handful of applications to capacity usage for a wide range of applications. Operating systems for capability-oriented systems often provide a restricted usage model (dedicated or space-shared mode) and need to provide only minimal services, allowing more operating system optimizations. Operating systems for capacity-oriented systems, in contrast, generally support much more flexible usage models, such as timesharing, and must provide additional services including TCP/IP internetworking and dynamic process creation.

2.3.2

Changing Application Demands.

Applications have varying demands for similar operating system services depending on their needs. Correctly customizing these services can have a large impact on application performance. As a concrete example, consider four different ways for a signal to be delivered to an application indicating the receipt of a network packet: • Immediate delivery using interrupts (e.g., UNIX signals) for real-time or event driven applications • Coalescing of multiple signals and waiting until some other activity (e.g., an explicit poll or quantum expiration) causes an entry into the kernel, thereby minimizing signal handling overhead. • Extending the kernel with application-specific handler code for performance-critical signals • Forking a new process to handle each new signal/packet (e.g., inetd in UNIX)

2.3.3

Changing Hardware Architecture.

Operating system structure can present barriers to hardware innovation for ultrascale systems. Operating systems must be customized to present novel architectural features to applications and to make effective use of new hardware features themselves. Existing operating systems such as Linux assume that each machine is similar to a standard architecture, the Intel x86 architecture in the case of Linux, and in doing so limit their ability to expose innovative architectural features to the application or to use such features to optimize operating system performance. The inability of current operating systems to do so presents a significant impediment to hardware innovation. Consider, for example, operating system support for parcelbased processor-in-memory (PIM) systems [22]. Operating systems for such architectures must be flexible enough to perform scheduling and resource allocation on these architectures and make effective use of this hardware for its own purposes. We specifically consider the use of a PIM as a dedicated OS file cache that makes its own prefetching, replacement, and I/O coalescing decisions. Processes that access files would send parcels to this PIM, which could immediately satisfy them from a local cache, coalesce small writes together before sending the request on to the main

I/O system, or aggressively prefetch data based on observed access patterns. Doing such work in a dedicated PIM built for handling latency-sensitive operations would free the system’s heavyweight (e.g. vector) processors from having to perform the latency-oriented services common in operating systems.

2.3.4

Changing Environmental Services.

Finally, consider the variety of shared environmental services that operating systems must support, such as file systems and checkpointing functionality. New implementations of these services are continually being developed, and these implementations require changing operating system support. As just one example, the Lustre file system [2] is currently being developed to replace NFS in ultrascale systems. Lustre requires a specific message-passing layer from the operating system (i.e., Portals), in contrast to the general networking necessary to support NFS but in return provides much better performance and scalability. Similarly, checkpointing services require a means to determine operating system state and network quiescence. Finally, these services are often implemented at user-level in lightweight operating systems; in these cases, the operating system must provide a way to authenticate trusted shared services to applications and other system nodes.

3.

CHALLENGES

The processing resources for ultra scale systems will likely be partitioned based on functional needs [7]. These partitions will most likely include: a service partition to provide general services including application launch and compilation; an I/O partition providing shared file systems, a network partition that provides communication with other systems, and a compute partition that provides the primary computational resources for an application. In this work, we are primarily interested the operating system used in the compute partition, the compute node operating system. Like any operating system, the compute node operating system, provides a bridge between the application and the architecture used to run the application. That is, the operating system presents the application with abstractions of the resources provided by the computing system. The form of these abstractions will depend on the nature of the physical resources and the way in which these resources are used by the application. The compute node operating system will also arbitrate access to shared resources on the compute nodes, resolving conflicts in the use of these resources as needed. The need for this mediation will depend on the way in which the compute nodes are used, the system usage model. It may also need to provide special services (e.g., authentication) to support access to shared services (e.g., file systems or network services) that reside in other partitions. Figure 3 presents a graphical interpretation of the five primary factors that influence the design of the compute node operating systems for ultrascale computing systems. We include history in addition to the four factors identified in the preceding paragraphs: application needs, system usage models, architectural models (both system-level and node-level architectures), and shared services.

adversely affecting the performance and scalability of other applications that do not use these features. Application Shared Services

History

Operating System

System Usage Architecture

Figure 3: Factors Influencing the Design of Operating Systems

3.1

History

Every operating system has a history and this history may impact the feasibility of using the OS in new contexts. For example, as a Unix-like operating system Linux assumes that all OS requests come from processes running on the local system. As the network has become a source of requests, Unix systems have adopted a daemon approach to handle these requests. In this approach, a daemon listens for incoming requests and passes them to the operating system. In this context, inetd is a particularly interesting example. Inetd listens for connection requests. When it receives a new connection request, inetd examines the request and, based on the request, creates a new process to handle the request. That is, the request is passed through the operating system to inetd which calls the operating system to create a process to handle the request. While it might make more sense to modify Unix to handle network requests directly, this would represent a substantial overhaul of the basic Unix request model.

3.2

Application Needs

Applications present challenges at two levels. First, applications are developed in the context of a particular programming model. Programming models typically require a basic set of services. For example, in the explicit message passing model, it is necessary to allow for data to be moved efficiently between local memory and the network. Second, applications themselves may require extended functionality beyond the minimal set needed to support the programming model. For example, an application developed using a component architecture may require system services to enable the use of dynamic libraries. While lightweight operating systems have been shown to support the development of scalable applications, this approach places an undue burden on the application developer. Given any feature typically associated with modern operating systems (e.g., Unix sockets), there is at least one application that could benefit from having the feature readily available. In the lightweight operating system approach, the application developer is required to either implement the feature or do without. In fact, this is the reason that many of the terascale operating systems today are based on full-featured operating systems. The real challenge is to provide features needed by a majority of applications without

Advanced programming models strive to provide a highlevel abstraction of the resources provided by the computing system. Describing computations in terms of abstract resources enhances portability and can reduce the amount of effort needed to develop an application. While high-level abstractions offer significant benefits, application developers frequently need to bypass the implementations of these abstractions for the small parts of the code that are time critical. For example, while the vast majority of the code in an application may be written in a high-level language (e.g., FORTRAN or C), it is not uncommon for application developers to write a core calculation, such as a BLAS routine, in assembly language to ensure an optimal implementation. The crucial point is that the abstractions implemented to support advanced programming methodologies must allow application developers to drop through the layers of abstraction as needed to ensure adequate performance. Because we are interested in supporting resource constrained applications, providing variable layers of abstraction is especially important. Finally, because the development of new programming models is an ongoing activity, the operating and runtime system must be designed so that it is relatively easy to develop high-performance implementations of the features needed to support a variety of existing programming models as well as new models that may be developed.

3.3

System Usage Models

The system usage model defines the places where the principal computational resources can be shared by different users. Example usage models include: dedicated systems in which these resources are not shared; batch dedicated system in which the resources are not shared while the system is being used, but may be used by different users at different times; space-shared systems in which parts of the system (e.g., compute nodes) are not shared, but multiple users may be using different parts of the system at the same time; and time-shared systems in which the resources are being used by multiple users at the same time. Sharing requires that the operating system take on the role of arbiter, ensuring that all parties are given the appropriate degree of access to the shared resources – in terms of time, space, and privilege. The example usage models presented earlier are listed in roughly the order of operating system complexity needed to arbitrate the sharing: dedicated systems require almost no support for arbitrating access to resources, batch dedicated systems require that the usage schedule be enforced, space sharing requires that applications running on different parts of the system not be able to interfere with one another, and timesharing requires constant arbitration of the resources. In considering system usage models the challenge is to provide mechanisms that can support a wide variety of sharing policies, while ensuring that these mechanisms do not have any adverse impact on performance when they are not needed.

3.4

Architectures

Architectural models present challenges at two levels: the node level and the overall systems level. An individual com-

pute node may exhibit a wide variety of architectural features, including: multiple processors, support for PIM, multiple network interfaces, programmable network interfaces, access to local storage, etc. The key challenge presented by different architectures is the need to build abstractions of the physical resources that match the resource abstractions defined by the programming model. If this is not done efficiently, it could easily inhibit application scaling. Beyond the need to provide abstractions of physical resources, variations in systems-level architectures may require different levels of operating system functionality on the compute nodes. In most cases, specialized hardware (e.g., PIMs) will require specialized OS functionality. However, hardware features may simplify OS functionality. As an example, Blue Gene/L supports the partitioning of the high speed communication network: compute nodes in different partitions cannot communicate with one another using the high speed network. If the partitions correspond to different applications in a space shared usage model, there is no need for the OS to arbitrate access to the high speed network. As this example illustrates, the interactions between architecture and usage models may not be trivial.

3.5

Shared Services

Finally, applications will need access to shared services, e.g., file systems. Unlike the resource sharing that is arbitrated by the node operating system, access to the shared resources (e.g., disk drive) provided by a shared server is arbitrated by the server. In addition to arbitration, these servers may also require support for authentication and, using this authentication, provide access control to the logical resources (e.g., files) that they provide.

4.1

Micro-Services

At a minimum, each application will need micro-services for managing the primary resources: memory, processor, communication, and file system. We can imagine several implementations for each of these micro-services. One memory allocation service might perform simple contiguous allocation; another might map physical page frames to arbitrary locations in the logical address space of a process; another might provide demand page replacement; yet another may provide predictive page replacement. A processor management service may simply run a single process whenever a processor is available, and another might include thread scheduling. There may be dependencies and incompatibilities within the micro-services. As an example, a communication microservice that assumes that logically contiguous addresses are physically contiguous (thus reducing the size of a memory descriptor) would depend on a memory allocation service that provides this type of address mapping. There will also be dependencies between micro-services and system usage models. For example, a communication service that provides direct access to a network interface would not be compatible with a usage model that supports time sharing on a node. In addition to micro-services that provide access to primary resources, there will be higher-level services layered on top of the basic micro-services. As an example, a micro-service might provide garbage collected dynamic allocation, another might provide first fit, explicit allocation and de-allocation (malloc and free) for dynamic memory allocation. Other examples include an RDMA service or a two-sided message service layered on top of a basic communication service.

Here, the challenge is to provide the support required by the shared service. In some cases, this may be negligible. In other cases, the server may require that is be able to reliably determine the source of a message. In other cases, the shared server may rely on the operating system may need to maintain user credentials in a secure fashion while an application is running so that these credential can be trusted by the shared file system.

Finally, we will need “glue” services: micro-services that enable combinations of other services. As an example, consider a usage model that supports general time-sharing among the applications on a node. Further, suppose that one of the applications to be run on a node requires a memory allocator that supports demand page replacement and another application requires a simple contiguous memory allocator. A memory compactor service would make it possible to run both applications on the same node.

4.

4.2

APPROACH

In the context of the challenges described in the previous section, a “lightweight operating system” reflects a minimal set of services that meet the requirements presented by a small set of applications, a single usage model, a single architecture, and a single set of shared services. The Puma operating system [27], for example, represents a lightweight with the following bindings: application needs are limited MPI and access to a shared file system, the system usage model is space sharing, the system architecture consists thousands of simple compute nodes connected by a high performance network, and the shared services include a parallel file system which relies on Cougar to protect user identification. Our goal is to develop a framework for building operating and runtime systems that are tailored to the specific requirements presented by an application, the system usage model, the system architecture, and the shared services. Our approach is to build a collection of micro-services and tools that support the automatic construction of a lightweight operating system for a specific set of circumstances.

Signal Delivery Example

To illustrate how our micro-services approach can be used to address the challenges presented by ultrascale systems, we consider the signal delivery example presented toward the end of Section 2. Because signal delivery may not be needed by all applications, micro-services associated with signal delivery would be optional and, as such, would not have any performance impact on applications that did not need signal delivery. For applications that do require signal delivery, we would need a collection of “signal detector” micro-services that are capable of observing the events of interest to the application (e.g., the reception of a message). These micro-services would most likely run as part of the operating system kernel. To ensure that they are run with sufficient frequency, the signal detector micro-services may place requirements on the micro-service used to schedule the processor. The signal detector micro-services would then be tied to one of several specialized “signal delivery” micro-services.

The specific signal delivery micro-service will depend on the needs of the application. An immediate delivery service would modify the control block for the target process so that the signal handler for the process is run the next time the process is scheduled for execution. A coalescing signal delivery service would simply record the relevant information and make this information available to another micro-service that would respond to explicit polling operations in the application. A user defined signal delivery service could take a user defined action whenever an event is detected. Finally, a message delivery service could convert the signal information into to data and pass this to the micro-service that is responsible for delivering messages to application processes. The runtime level could then include a micro-service that would read these messages and fork the appropriate process.

4.3

Tools

We cannot burden application programmers with all of the micro-services that provide the runtime environment for their applications. Application programmers should only be concerned with the highest-level services that they need (e.g., MPI) and the general goals for lower-level services. We envision the need to develop a small collection of tools to analyze application needs and to combine and analyze combinations of micro-services. Figure 4 presents our current thinking regarding the required tools.

Application Requirements

Microservices

Available Shared Resources

Shared Resource Requirements

OS/Runtime Constructor

System Usage Model

OS/Runtime

Architecture

Figure 4: Building OS/Runtime

an

In a second step, these intermediate outputs will be combined with a specification of the system usage model, a specification of the underlying architecture and the collection of micro-services to build an OS/Runtime that is tailored to the specific needs of the application. Here, we envision a tool that will take as input a set of the top-level services used by an application and produce a directed graph of the permissible lower-level services for the required runtime environment. Nodes of this graph will be weighted by the degree to which the micro-service represented by the node meets the goals of the application developer. We plan to base some of our work on tools for composing micro-services on existing tools, such as the Knit composition tool developed at the University of Utah in the context of the Flux project [19]. Other tools will be needed to select particular services in the context of a system usage model. These tools will also need to ensure that the services selected meet the sharing requirements of the system.

5.

Application

Application Analysis Tool

tional features of these programming models. In addition, this tool will need to match the application needs for shared services to the shared services that are available. The application analysis tool will produce two, intermediate outputs: the application requirements and the requirements associated with the shared resources that are used by the application.

Application

Specific

As shown in Figure 4, the tool chain takes several inputs and produces an application specific OS/Runtime system. If the system usage model is timesharing, this OS/Runtime will be merged with the OS/Runtime needed by other applications that share the the computing resources (this merging will most likely be done on a node-by-node basis). For other usage models, the resulting OS/Runtime will be loaded with the application, when the application is launched. The application analysis tool extracts the application specific requirements from an application. This tool will need to be cognizant of potential programming models and the op-

RELATED WORK

A number of other configurable operating systems have been designed including microkernel systems, library operating systems, extensible operating systems, and component-based operating systems. In addition, configurability has been designed into a variety of different runtime systems and system software subsystems, including middleware for distributed computing, network protocol stacks, and file systems.

5.1

Configurable Operating Systems

Most standard operating systems such as Linux include a limited amount of configuration that can be used to add or remove subsystems and device drivers from the kernel. However, this configurability does not generally extend to core operating system functions, such as the scheduler or virtual memory system. In addition, the configuration available in many subsystems such as the network stack and the file system is coarse-grained and limited; entire networking stacks and file systems can be added or removed, but these subsystems cannot generally be composed and configured at a much finer granularity. In Linux, for example, the entire TCP/IP or Bluetooth stack can be optionally included in the kernel, but more fine-grained information about exactly which protocols will be used cannot easily be used to customize system configuration. Other operating systems allow more fine-grained configuration. Component-based operating systems such as the Flux OSKit [5], Scout [15], Think [4], eCos [18], and TinyOS [10], allow kernels to be built from from a set of composable modules. Scout, for example, is built from a set of routers that can be composed together into custom kernels. The THINK framework is very similar to the framework we propose here. The primary differences are that we expect to build operating systems that are far more tailored to the needs of specific applications and we do not expect to do much in the way of

dynamic binding of services. eCos and TinyOS provide similar functionality in the context of embedded systems and sensor networks, respectively. The Flux OSKit provided a foundation for component-based OS development based on code from the Linux and BSD kernels, focusing particularly on allowing device drivers from these systems to be used in developing new kernels. Unlike our proposal, however, none of these systems have concentrated on customizing system functionality at the fine granularity necessary to take full advantage of new hardware environments or optimize for the different usage models of ultrascale systems. Microkernel and library operating systems such as L4 [8], Exo-kernels [3], and Pebble [6], for example, allow operating system semantics to be customized at compile-time, boot-time, or run-time by changing the server or library that provides a service, though this composability is even more coarse-grained than the systems described above. Such flexibility generally comes at a price, however; these operating systems may have to use more system calls and up-calls to implement a given service than a monolithic operating system, resulting in higher overheads. It also can result in a loss of cross-subsystem optimization opportunities. In contrast, our approach seeks to decompose functionality using more fine-grained structures and to preserve cross-subsystem optimization opportunities through tools designed explicitly for composing system functionality.

5.2

Configurable Runtimes and Subsystems

A variety of different systems have also been built that enable fine-grained configuration of system services, generally in the realm of protocol stacks and file systems. In contrast to our approach, none of these systems seek to use configuration pervasively across in a an entire operating system. Coarse-grained configuration of network protocol stacks has been explored in System V STREAMS [20], the x -kernel [12], and CORDS [23]. Composition in these systems is layerbased, with each component defining one protocol layer. Similar approaches have been used for building stackable file systems [9, 28]. More fine-grained composition of protocol semantics has been explored in the context of Cactus [11], [24], Ensemble [25], and Rwanda [16]. Cactus’s event-based composition model, in particular, has influenced our approach to building; in fact, we are using portions of the Cactus event framework to implement our system. To date the Cactus project has focused primarily on using event-based composition in network protocols, not the more general operating system structures as described in this paper.

6.

CONCLUSIONS

In this paper, we have presented an argument for a framework for customizing an operating system and runtime environment for parallel computing. Based on the results of preliminary experiments, we conclude that the demands of current and future ultrascale systems cannot be addressed by a general-purpose operating system if high-levels of performance and scalability are to be maintained and achieved. The current methods of using specialized lightweight approaches and generalized heavyweight approaches will not be sufficient given the challenges presented by current and future hardware platforms, programming models, usage mod-

els and application requirements. To address this problem, we presented a design for a framework that uses microservices and supporting tools to construct an operating system and associated runtime environment for a specific set of requirements. This approach minimizes the overhead of unneeded features, allows for carefully tailored implementations of required features, and enables the construction new operating and runtime systems to adapt to evolving demands and requirements.

7.

REFERENCES

[1] R. Brightwell, T. Hudson, R. Riesen, and A. B. Maccabe. The Portals 3.0 message passing interface. Technical report SAND99-2959, Sandia National Laboratories, December 1999. [2] Cluster File Systems, Inc. Lustre: A Scalable, High-Performance File System, November 2002. http://www.lustre.org/docs/whitepaper.pdf. [3] D. Engler, M. Kaashoek, and J. O’Toole. Exokernel: An operating system architecture for application-level resource management. In Proceedings of the 15th ACM Symposium on Operating Systems Principles, pages 251–266, Copper Mountain Resort, CO, 1995. [4] J.-P. Fassino, J.-B. Stefani, J. Lawall, and G. Muller. THINK: A software framework for component-based operating system kernels. In Proceedings of the 2002 USENIX Annual Technical Conference, June 2002. [5] B. Ford, G. Back, G. Benson, J. Lepreau, A. Lin, and O. Shivers. The Flux OSKit: A substrate for kernel and language research. In Proceedings of the 16th ACM Symposium on Operating Systems Principles, pages 38–51, Saint-Malo, France, 1997. [6] E. Gabber, C. Small, J. Bruno, J. Brustoloni, and A. Silberschatz. The Pebble component-based operating system. In Proceedings of the 1999 USENIX Annual Technical Conference, pages 267–282, Monterey, CA, USA, 1999. [7] D. S. Greenberg, R. Brightwell, L. A. Fisk, A. B. Maccabe, and R. Riesen. A system software architecture for high-end computing. In ACM, editor, SC’97: High Performance Networking and Computing: Proceedings of the 1997 ACM/IEEE SC97 Conference: November 15–21, 1997, San Jose, California, USA., New York, NY 10036, USA and 1109 Spring Street, Suite 300, Silver Spring, MD 20910, USA, Nov. 1997. ACM Press and IEEE Computer Society Press. [8] H. H¨ artig, M. Hohmuth, J. Liedtke, S. Sch¨ onberg, and J. Wolter. The performance of µ-kernel-based systems. In Proceesings of the 16th ACM Symposium on Operating Systems Principles, 1997. [9] J. Heidemann and G. Popek. File-system development with stackable layers. ACM Transactions on Computer Systems, 12(1):58–89, 1994. [10] J. Hill, R. Szewczyk, A. Woo, S. Hollar, D. E. Culler, and K. S. J. Pister. System architecture directions for networked sensors. In Architectural Support for Programming Languages and Operating Systems, pages 93–104, 2000.

[11] M. A. Hiltunen, R. D. Schlichting, X. Han, M. Cardozo, and R. Das. Real-time dependable channels: Customizing QoS attributes for distributed systems. IEEE Transactions on Parallel and Distributed Systems, 10(6):600–612, 1999. [12] N. Hutchinson and L. L. Peterson. The x-kernel: An architecture for implementing network protocols. IEEE Transactions on Software Engineering, 17(1):64–76, 1991. [13] T. Jones, W. Tuel, L. Brenner, J. Fier, P. Caffrey, S. Dawson, R. Neely, R. Blackmore, B. Maskell, P. Tomlinson, and M. Roberts. Improving the scalability of parallel jobs by adding parallel awareness to the operating system. In Proceedings of SC’03, 2003. [14] P. Miller. Parallel, distributed scripting with python. In Third Linux Clusters Institute Conference, October 2002. [15] D. Mosberger and L. L. Peterson. Making paths explicit in the Scout operating system. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 153–168, 1996. [16] G. Parr and K. Curran. A paradigm shift in the distribution of multimedia. Communications of the ACM, 43(6):103–109, 2000. [17] F. Petrini, D. Kerbyson, and S. Pakin. The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Proceedings of SC’03, 2003. [18] Redhat. eCos. http://sources.redhat.com/ecos/. [19] A. Reid, , M. Flatt, L. Stoller, J. Lepreau, and E. Eide. Knit: Component composition for system software. In Proceesings of the 4th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 347–660, 2000. [20] D. M. Ritchie. A stream input-output system. AT&T Bell Laboratories Technical Journal, 63(8):311–324, 1984. [21] Sandia National Laboratories. ASCI Red, 1996. http://www.sandia.gov/ASCI/TFLOP. [22] T. L. Sterling and H. P. Zima. The gilgamesh MIND processor-in-memory architecture for petaflops-scale computing. In Internatinal Sympoium on High Performance Computing (ISHPC 2002), volume 2327 of Lecture Notes in Computer Science, pages 1–5. Springer, 2002. [23] F. Travostino, E. M. III, and F. Reynolds. Paths: Programming with system resources in support of real-time distributed applications. In Proceedings of the IEEE Workshop on Object-Oriented Real-Time Dependable Systems, 1996. [24] R. van Renesse, K. P. Birman, R. Friedman, M. Hayden, and D. A. Karr. A framework for protocol composition in Horus. In Proceedings of the 14th ACM Principles of Distributed Computing Conference, pages 80–89, 1995.

[25] R. van Renesse, K. P. Birman, M. Hayden, A. Vaysburd, and D. A. Karr. Building adaptive systems using Ensemble. Software Practice and Experience, 28(9):963–979, 1998. [26] S. R. Wheat, A. B. Maccabe, R. Riesen, D. W. van Dresser, and T. M. Stallcup. PUMA: An operating system for massively parallel systems. In Proceedings of the Twenty-Seventh Annual Hawaii International Conference on System Sciences, pages 56–65. IEEE Computer Society Press, 1994. [27] S. R. Wheat, A. B. Maccabe, R. Riesen, D. W. van Dresser, and T. M. Stallcup. PUMA: An operating system for massively parallel systems. Scientific Programming, 3:275–288, 1994. [28] E. Zadok and I. Badulescu. A stackable file system interface for Linux. In Proceedings of the 5th Annual Linux Expo, pages 141–151, Raleigh, North Carolina, 1999.

Cluster Operating System Support for Parallel Autonomic Computing A. Goscinski

J. Silcock

M. Hobbs

School of Information Technology Deakin University, Geelong Victoria, 3217, Australia +61 3 5227 2088

School of Information Technology Deakin University, Geelong Victoria, 3217, Australia +61 3 5227 1378

School of Information Technology Deakin University, Geelong Victoria, 3217, Australia +61 3 5227 3342

[email protected]

[email protected]

[email protected]

ABSTRACT The aim of this paper is to show a general design of autonomic elements and initial implementation of a cluster operating system that moves parallel processing on clusters to the computing mainstream using the autonomic computing vision. The significance of this solution is as follows. Autonomic Computing was identified by IBM as one of computing’s Grand Challenges. The human body was used to illustrate an Autonomic Computing system that possesses self-knowledge, self-configuration, selfoptimization, self-healing, and self-protection, knowledge of its environment and user friendliness properties. One of the areas that could benefit from the comprehensive approach created by the autonomic computing vision is parallel processing on nondedicated clusters. Many researchers and research groups have responded positively to the challenge by initiating research around one or two of the characteristics identified by IBM as the requirements for autonomic computing. We demonstrate here that it is possible to satisfy all Autonomic Computing characteristics.

Categories and Subject Descriptors D.4.7 [Operating Systems]: Organization and Design – Distributed systems.

General Terms Management, Design and Reliability.

Keywords Cluster Operating Systems, Parallel Processing, Autonomic Computing.

1. INTRODUCTION There is a strong trend in parallel computing to move to cheaper, general-purpose distributed systems, called non-dedicated clusters, that consist of commodity off-the-shelf components such as PCs connected by fast networks. Many companies, businesses and research organizations already have such "ready made parallel computers”, which are often idle and/or lightly loaded not only during nights and weekends but also during working hours. A review by Goscinski [9] shows that none of the research performed thus far has looked at the problem of developing a technology that goes beyond high performance execution and allows clusters and grids to be built for supporting their unpredictable changes and provide services reliably to all users,

offer ease of use and ease of programming. Computer clusters, including non-dedicated clusters that allow the execution of both parallel and sequential applications concurrently, are seen as being user unfriendly, due to their complexity. Parallel processing on clusters is not broadly accessible and it is not used on daily basis – parallel processing on clusters has not yet become a part of the computing mainstream. Many activities, e.g., selection of computers, allocation of computations to computers, dealing with faults and changes caused by adding and removing computers to/from clusters, must be handled (programmed) manually by programmers. Ordinary engineers, managers, etc do not have, and should not have, specialized knowledge needed to program operating system oriented activities. The deficiencies of current research in parallel processing in particular on clusters have also been identified in [19,6,33,2,31]. A similar situation exists in the area of Distributed Shared Memory (DSM). A need for an integrated approach to building DSM system was advocated in [16]. We decided to demonstrate a possibility to address not only high performance but also ease of programming/use, reliability, and availability through proper reaction to unpredictable changes and transparency, and developed the GENESIS cluster operating system that provides a SSI and offers services that satisfy these requirements [12]. However, to the end of 2001 there was no wider response to satisfy these requirements. A comprehensive program to re-examine the “obsession with faster, smaller, and more powerful” and “to look at the evolution of computing from a more holistic perspective” has been launched by IBM in 2001 [16,15]. Autonomic computing is seen by IBM [16] as “the development of intelligent, open systems capable of running themselves, adapting to varying circumstances in accordance with business policies and objectives, and preparing their resources to most efficiently handle the workloads we put upon them”. As it has been stated above, we have been carrying out research in the area of building new generation non-dedicated clusters through the study into cluster operating systems supporting parallel processing. However, in order to achieve a truly effective solution we decided to synthesize and develop an operating system for clusters rather than to exploit middleware approach. Our experience with GENESIS [12], which is a predecessor of Holos, demonstrated that incorporating many services (currently provided by middleware) into a single comprehensive operating system that exploits the concept of a microkernel, has made using the system easy and improved the overall performance of application execution. We are strongly

1

convinced that the client-server and microkernel approaches leads to better a design of operating systems, which are not bloated, can be easily tailored to applications, and improve security and reliability. An identical line of thought is presented just recently in [23]. As a natural progression of our work we have decided to move toward autonomic computing on non-dedicated clusters. The aim of this paper is to present the outcome of our work in the form of the designed services underlying autonomic nondedicated clusters, and to show the Holos (‘whole’ in Greek) cluster operating system (the implementation of these services) that is built to offer autonomic parallel computing on nondedicated clusters. The problem we faced was whether to present this new cluster operating system showing its architectural vision or to introduce the system from the perspective of its matching the characteristics of autonomic computing systems. We decided to use the latter because it could “say” more to the wider audience. This approach allows us also to better convey a message of a novelty and contribution of the proposed system through individual elements of the grid and clustering technologies. This paper is organized as follows. Section 2 shows related work, and in particular demonstrates that there is no project/system that addresses all characteristics of autonomic computing. Section 3 presents the logical design of autonomic elements and their services that must be created to provide parallel autonomic computing on non-dedicated clusters. Section 4 introduces the autonomic elements, presented in the previous section, implemented or being implemented as cooperating servers of the Holos cluster operating system. Section 5 concludes the paper and shows the future work.

2. RELATED WORK IBM's Grand Challenge identifying Autonomic Computing as a priority research area has brought research carried out for many years on self-regulating computers into focus. We have long identified lack of user friendliness as a major obstacle to the widespread use of parallel processing in distributed systems [10]. In 1993 Joseph Barrera discussed a framework for the design of self-tuning systems [3]. While IBM is advocating a "holistic" approach to the design of computer systems much of the focus of researchers is upon failure recovery rather than uninterrupted, continuous, adaptable execution. The latter includes execution under varying loads as well as recovery from hardware and software failure. A number of projects related to Autonomous Computing are mentioned by IBM in [16]. OceanStore (Berkeley University of California) [29] is a persistent data store which has been designed to provide continuous access to persistent information to an enormous number of users. The infrastructure is made up of untrusted servers, hence the data is protected using redundancy and cryptography. Any computer can join the infrastructure by subscribing to one OceanStore service provider. Data can be cached anywhere, anytime, to improve the performance of the system. Information gained and analysed by internal event monitors allow OceanStore to adapt to changes in its environments such as regional outages and denial of service attacks. The Recovery-Oriented Computing (ROC) [30] project is a joint Berkeley/Stanford research project that is investigating novel

techniques for building highly-dependable Internet services. ROC focuses on the recovery of the system from failures rather than their avoidance. Anthill (University of Bologna, Italy) [1] is a framework to support the design, implementation and evaluation of peer-to-peer applications. Anthill exploits the analogy between Complex Adaptive Systems (CAS) such as biological systems and the decentralized control and large-scale dynamism of P2P systems. An Anthill system consists of a dynamic network of peer nodes; societies of adaptive agents (ants) travel through this network, interacting with nodes and cooperating with other agents in order to solve complex problems. The types of P2P services constructed using Anthill show the properties of resilience, adaptation and self-organization. Neuromation [25], Edinburgh University's information structuring project, involves the structuring of information based on human memory. The structure used would be suited for organizing information in an autonomic architecture. The structure used is simple, homogeneous and self-referential. University of Freiburg's Multiagent Systems Project [24] revolves around the self-organized coordination of multiagent systems. This topic has some connections with Grid computing, especially economic coordination issues like in Darwin, Radar or Globus. The Immunocomputing project [18] (International Solvay Institutes for Physics and Chemistry, Belgium) is to use the principles of information processing by proteins and immune networks in order to solve complex problems while at the same time being protected from viruses, noise, errors and intrusions. A Grid scheduling system, developed at Monash University, called Nimrod-G [28], has been built to provide tools and services for solving coarse-grain task farming. The resource broker/Grid scheduler has the ability to lease resources at runtime depending on their capability, cost, and availability. The Bio-inspired Approaches to Autonomous Configuration of Distributed Systems [4] at University College London has used bio-inspired approaches to autonomous configuration of distributed systems (including a bacteria inspired approach) are being explored. While many of these systems engage in some aspects of Autonomic Computing none engage in research to develop a system which has all eight of the characteristics required. Furthermore, none of the projects addresses parallel processing, in particular parallel processing on non-dedicated clusters.

3. THE LOGICAL DESIGN OF AUTONOMIC ELEMENTS PROVIDING AUTONOMIC COMPUTING ON NONDEDICATED CLUSTERS According to Horn [15], an autonomic computing system could be described as one that possesses at least the following characteristics: knows itself; configures and reconfigures itself under varying and unpredictable conditions; optimizes its working; performs something akin to healing; provides selfprotection; knows its surrounding environment; exists in an open

2

Computational Load & Parameters

Resource Discovery

Computer i Communication Pattern & Load

Where t0< t1< t2< t3

Availability Services

Virtual Cluster (t1)

Local Comms. Load

CPU

Main Memory

CE1

CEn

Remote Comms.

Resource Discovery Main Memory

RD RD

RD

RD RD

CE1 CPU

RD

Virtual Cluster (t0)

CEn

RD RD

Computer j

Figure 1. Resource Discovery Service Design (CE – computation element) (non-hermetic) environment; and anticipates the optimized resources needed while keeping its complexity hidden. An autonomic computing system is a collection of autonomic elements, which can function at many levels: computing components and services, clusters within companies, and grids within entire enterprises. Each autonomic element is responsible for its own state, behavior and management, which satisfy the user objectives. These elements interact among themselves and with surrounding environments. Objectives of individual components must be consistent with the objective of the whole set of cooperating elements [19]. We proposed and designed a set of autonomic elements that must be provided to develop an autonomic computing system supporting parallel processing on non-dedicated cluster. These elements are described in the following subsections.

3.1 Cluster knows itself To allow a system to know itself there is a need for resource discovery. This autonomic element (service) is designed to run on each computer of the cluster and: • Identifies its components, in particular computers, and their state; • Acquires knowledge of static parameters of the whole cluster, in particular computers, such as processor type, memory size, and available software; and • Acquires knowledge of dynamic parameters of cluster components, e.g., data about computers’ load, available memory, communication pattern and volume. Figure 1 shows an illustration of the outcome of the general design of this autonomic element (service). It depicts the Resource Discovery Service on each Computer of the cluster obtaining information from the various local resources such as processor loads, memory usage and communication statistics between both local and remote Computational Elements (CEs or processes).

3.2 Cluster configures and reconfigures itself In a non-dedicated cluster computers could become heavily loaded. On the other hand there are time periods when some computers of a cluster are lightly loaded or even idle. Some computers cannot be used to support parallel processing of other

Virtual Cluster (t3)

Virtual Cluster (t2)

Figure 2. Availability Service Design (RD – resource discovery element) users’ applications because the owners removed them from a shared pool of resources. To allow a system to offer high availability, i.e., to configure and reconfigure itself under varying and unpredictable conditions of adding and removing computers, the system was designed to: • Adaptively and dynamically form virtual according to load and changing resources;

cluster

• Offer high availability of resources, in particular computers. An illustration of the outcome of the general design of this autonomic element is illustrated in Figure 2. It shows how virtual clusters can change over time, with the virtual cluster expanding from t0, t1 and t2; and contracting at t3.

3.3 Cluster should optimize its working Computation elements of a newly created parallel (or sequential) application should be placed in an optimal manner on computers of a virtual cluster formed for the application. Furthermore, if a new computer is added to the virtual cluster or load of some of the computers in the cluster changes dramatically, load balancing should be employed to improve the overall execution performance. When improving performance not only computation load and available memory should be taken into consideration, but also communication costs, which in non-dedicated clusters are high. Thus, to optimize cluster’s working: • Static allocation and load balancing is employed; • Changing scheduling from static to dynamic, and dynamic to static is provided; • Changing performance indices, which reflect user objectives, among computation-oriented, and communication-oriented applications should be provided; • Computation element migration, creation and duplication is exploited; • Dynamic setting of computation priorities of parallel applications, which reflect user objectives, is provided. The outcome of the general design of this autonomic element is shown in Figure 3. In this example the static allocation component instantiates computational elements on selected computers, and the load balancing component migrates

3

Checkpointing (coordinated)

Global Scheduler Static Allocation

{Static Allocation Decisions: which, where: CE1 → C1, CE2 → C2, ……… {CEi, CEj} → Cn}

Load Balancing

Availability Services

{ Dynamic Load Balancing Decisions: where, which, when:

Virtual Cluster

C2 CE2

C2

C3

Checkpoint for CEi

Ck

Checkpoint for CEi

Disk CEi CE Cn j

Cj Checkpoint for CEi

CEi

CEi : Cn → C3 } C1..n = Computer 1..n

C1 CE1

C1

Recovery

CEi after crash recovery

Migration

Figure 3. High Performance Service Design

Figure 4. Self-Healing Service Design • Authentication, as a countermeasure against active attacks.

computational elements between computers. The decisions of when, which and where are made by higher level services, such as a Global Scheduler.

This autonomic element is the subject of our current design and will be addressed in another report.

3.4 Cluster should perform something akin to healing

3.6 To allow a system to know and work with its surrounding environment

Despite the fact that PCs and networks are becoming more reliable, hardware and software faults can occur in non-dedicated clusters. Failures in the system currently lead to the termination of computations. Many hours or even days of work can be lost if these computations have to be restarted from scratch. Thus, the system should be able to provide something akin to healing: • Faults and their occurrence are identified and reported;

There are applications that require more computation power, specialized software, unique peripheral devices etc. Many owners of clusters cannot afford such resources. On the other hand, owners of other clusters and systems would be happy to offer their services and resources to appropriate users. Thus, to allow a system to know its surrounding environment, to prevent a system from existing in a hermetic environment, to the benefit of existing unique resources and services:

• Checkpointing parallel applications are provided;

• Resource discovery of other similar clusters is provided;

• Recovery from failures is employed;

• Advertising services to make user’s own services available to others is in place;

• Migrating application computation elements from faulty computers to other, healthy computers that are located automatically is carried out; • Redundant/replicated autonomic elements are provided. An illustration of the outcome of the general design of this autonomic element is illustrated in Figure 4. (Fault detection is not shown in this figure.) Checkpoints are stored in main memories of other virtual clusters for performance and on disk for high reliability. A process is recovered after a fault by using one of the checkpoint copies on a selected computer or from disk.

3.5 Cluster should provide self-protection Computation elements of parallel and distributed applications run on a number of computers of a cluster. They communicate using messages. As such, they are subject to passive and active attacks. Thus, resources must be protected, applications/users authenticated and authorized, in particular when computation element migration is used, and communication security countermeasures must be applied. The design of an autonomic element providing self-protection includes: • Virus detection and recovery; • Resource protection based on access control lists or/and capabilities; • Encryption, as a countermeasure against passive attacks;

• The system is able to communicate/cooperate with other systems; • Negotiation with service providers is provided; • Brokerage of resources and services is exploited; • Resources should be made available/shared in a distributed/grid-like manner. An example of a set of cooperating brokerage autonomic elements running on different clusters that illustrate some aspects of the designed autonomic element is shown in Figure 5.

3.7 A cluster should anticipate the optimized resources needed while keeping its complexity hidden Until now the single factor limiting the harnessing of the computing power of non-dedicated clusters for parallel computing has been the scarcity of software to assist non-expert programmers. This implies a need for at least the following: • Single System Image, in particular where transparency is offered; • A programming environment that is simple to use and, does not require the user to see distributed resources is provided;

4

Cluster 1

Cluster 2

Computational Services

MP / PVM / MPI

Advertisement Brokerage Service

Storage/Memory Services Printer Information Services Services

Cluster 3

Exporting Services

Withdrawal Services

Import Requests

Parallel Application Processes

Brokerage Service

Cluster n Brokerage Service

DSM Agent

Brokerage Service

Figure 5. Grid-like Service Design • Message passing and DSM programming is supported transparently. When these features are provided the complexity of a cluster is greatly reduced from the perspective of both the programmer and user. Thus, hiding the complexities of managing the resources of a non-dedicated cluster and relieving the programmer from many of the system related functions.

4. THE HOLOS AUTONOMIC ELEMENTS FOR AUTONOMIC COMPUTING CLUSTERS To demonstrate that it is possible to develop an easy to use autonomic non-dedicated cluster, we decided to implement the autonomic elements presented in Section 3 and build a new autonomic cluster operating system, called Holos. We decided to implement autonomic elements as servers. Each computer of a cluster is a multi-process system with its objectives set up by their owners and the whole cluster is a set of multi-process systems with its objectives set up by a super-user.

4.1 Holos architecture Holos is being built as an extension to the Genesis system [12]. Holos exploits the P2P paradigm and object-based approach (where each entity has a name) supported by a microkernel [8]. The general architecture is shown in Figure 6. Holos uses a three level hierarchy for naming: user names, system names, and physical locations. The system name is a data structure, which allows objects in the cluster to be identified uniquely and serves as a capability for object protection [11]. The microkernel provides services such as local inter-process communication (IPC), basic paging operations, interrupt handling and context switching. Other operating system services are provided by a set of cooperating processes. There are three groups of processes: kernel managers, system servers, and application processes. Whereas the kernel and system servers are stationary, application processes are mobile. All processes communicate using messages. Kernel managers are responsible for managing the resources of the operating system. The Process Manager, Space (Memory) Manager, and IPC Manager manage the Process Control Blocks (PCBs), memory regions, and IPC of processes, respectively. The Network Manager provides access to the

Brokerage Server

Global Scheduler

Execution Server

Migration Server

DSM Server

File Server

Availability Server

Checkpoint Server

Resource Discovery Server

Space Manager

Process Manager

IPC Manager

Network Manager

System Servers

Kernel Managers

GENESIS Microkernel

Figure 6. The Holos operating system underlying network, and supports communication among remote processes. All the kernel managers support system servers. The servers, which form a basis of an autonomic operating system for non-dedicated clusters, are as follows: • Resource Discovery Server – collects data about computation and communication load; and supports establishment of a virtual cluster; • Availability Server, which dynamically and adaptively forms a virtual cluster for the application; • Global Scheduling Server – maps application processes on the computers that make up the virtual cluster for the application; • Execution Server – coordinates the single, multiple and group creation and duplication of application processes on both local and remote computers; • Migration Server – coordinates moving an application process (or set of application processes) from one computer to another computer or a set of computers, respectively; • DSM Server – hides the distributed nature of the cluster’s memory and allows programmers to write their code as though using physically shared memory; • Checkpoint Server – coordinates creation of checkpoints for an executing application; • Inter-Process Communication (IPC) Manager – supports remote inter-process communication and group communication within sets of application processes; • File Server – supports both system and user level processes in accessing secondary storage, particularly the Execution Manager in the creation of processes, the Chekpoint Server in storage of checkpoint data, and the Space Manager in the provision of paging; and

5

Table 1. Servers working together to carry out services of autonomic computing Autonomic Computing Requirement To allow a system to know itself A system must configure and reconfigure itself under varying and unpredictable conditions A system must optimize its working A system must perform something akin to healing A system must provide self-protection A system must know its surrounding environment A system cannot exist in a hermetic environment A system must anticipate the optimized resources needed while keeping its complexity hidden (most critical for the user) • Brokerage Server – supports resource advertising and sharing through the services of exporting, importing and revoking.

4.2 Holos possesses the autonomic computing characteristics Sets of Holos servers that individually and in cooperation provide services that satisfy the IBM’s Autonomic Computing requirements are specified in Table 1. The following subsections present the servers, which provide services that allow the operating system to offer autonomic operating system and support autonomic parallel computing on non-dedicated clusters. As inter-process communication in Holos is the basis of all services, and in particular is the basis of transparency, it is also presented.

4.3 Communication among parallel processes To hide distribution and make remote inter-process communication look identical to communication between local application processes, we decided to build the whole operating system services of Holos around the inter-process communication facility. To programmers of standard and parallel applications, local and remote communication is indistinguishable, which forms a basis for complete transparency. The IPC Manager is also responsible for both local and remote address resolution for group communication. Messages that are sent to a group require the IPC Manager to resolve the destination process location and provide the mechanism for the transport of the message to the requested group members. To support programmers, the Holos group communication facility allows processes to create, join, leave and kill a group, and supports different message delivery, response and message ordering semantics [31].

4.4 Establishment of a virtual cluster for cluster self awareness The Resource Discovery Server [12,26] and the Availability Server play a key role in the establishment of virtual clusters upon a cluster. The Resource Dsicovery Server identifies idle and/or lightly loaded computers and their resources (processor model, memory size, etc.); collects both computational load and communication patterns for each process executing on a given

Cooperating Holos Servers - Relationships Among Autonomic Elements Resource Discovery Server Resource Discovery, Global Scheduling, Migration, Execution, and Availability Servers Global Scheduling, Migration, and Execution Servers Checkpoint, Migration, Global Scheduling and Servers Capabilities in the form of System Names Resource Discovery, and Brokerage Servers Inter-process Communication Manager, and Brokerage Server DSM and Execution Servers, DSM Programming Environment, Message Passing Programming Environment, PVM/MPI Programming Environment computer, and provides this information to the Availability Server, which uses it to establish a virtual cluster. The virtual cluster changes dynamically in time as some computers are removed or become overloaded and cannot be used as a part of the execution environment for a given parallel application, and some computers are added or become idle/lightly loaded and can become a component of the virtual cluster. The dynamic nature of the virtual cluster creates an environment, which can address the application requirements that when executed expands or shrinks. The current resource discovery server collects (using specially designed hooks installed in the microkernel and Process, Space and IPC Servers) static parameters such as processor type and memory size and dynamical parameters such as computation load (the number of processes in the ready and blocked state), available memory, and communication pattern and volume. We are enhancing this server using the study of the way of data collection and processing. We also concentrate our efforts on the availability server. We study the identification of events that report on the computer and software faults, adding and removal to/from the cluster by an administrator and/or user, changing computation (a completion of a process, a new process creation) and communication load (process computers communicating intensively, computers/processes completing, intensive communication), and new requests to allocate/releases computers for their application. This information is used in the development of adaptive algorithms for forming and reconfiguring virtual clusters.

4.5 Mapping parallel processes to computers for cluster self optimization Mapping parallel processes to computers of a virtual cluster is performed by the Global Scheduling Server. This process combines static allocation and dynamic load balancing components, which allow the system to provide mapping by finding the best locations for parallel processes of the application to be created remotely and to react to large fluctuations in system load. The decision to switch between the static allocation and dynamic load balancing policies is dictated by the scheduling policy, which uses the information gathered by the Resource Discovery Server. Currently, the global scheduler is a centralized server. Our initial performance study of MPI computation-and computation-

6

bound parallel applications on a 16-computer cluster shows that their concurrent execution with sequential applications (computation-bound, I/O bound, and in between) leads to the improved execution performance and makes the utilization of the whole cluster better. Sequential applications have demonstrated a very small slow-down, which in some cases could even be unnoticed and in other cases could be rejected [37,11]. In both cases of parallel applications the utilization of static allocation to initially place parallel processes on cluster computers and dynamic load balancing were employed.

4.6 Process creation In Holos, each computer is provided with an EXecution (EX) Server, which is responsible for local process creation [13]. A local EX Server is capable of contacting a remote EX Server to create a remote process on its behalf. Currently, the remote process creation service employs multiple process creation that concurrently creates n parallel processes on a single computer, and group process creation that is able to concurrently create processes on m selected computers. These mechanisms are of great importance for instance for master-slave based applications, where a number of identical child processes is mapped to remote computers. When a new process is to be created, the Global Scheduler instructs the EX Server to create the process locally or on a remote computer. In both instances, i.e., group remote process creation and individual creation, a list of destination computers is provided to the by the Global Scheduler, and they are forwarded onto the EX Server on the respective destination computers. A process is created from an image stored in a file. This implies a need for employing the File Server to support this operation. To achieve high performance of the group process creation operation, a copy of the file that contains a child image is distributed to selected computers by a group communication facility.

4.7 Process duplication and migration Parallel processes of an application can also be instantiated on the selected computers of the virtual cluster by duplicating a process locally by the EX Server and, if necessary, migrating it to selected computer(s) [14]. Migrating an application process involves moving the process state, address space, communication state, and any other associated resources. This implies that a number of kernel Managers, such as Process, Space, and IPC Servers, are involved in process migration. The Migration Server only plays a coordinating role [8]. Group process migration is performed, i.e., a process can be concurrently migrated to n computers selected by the Global Scheduling Server.

4.8 Computation co-ordination for cluster self optimization When a parallel application is processed on a virtual cluster, where parallel processes are executing remotely, application semantics require an operating system to transparently maintain: input and output to/from the user, the parent/child relationship, and any communication with remote processes. As all communication in Holos is transparent, input and output to/from a user and communication with the remotely executing process is transparent.

In Holos, the parent’s origin computer manages all process “exits” and “waits” issued from the parent and its children. Furthermore, child processes in a parallel section of the program must co-ordinate their execution by waiting for both data allocation at the beginning of their execution and the completion of the slowest process in the group in order to preserve the correctness of the application, implied by a data consistency requirement. In the Holos system barriers are employed for this purpose.

4.9 Checkpointing for cluster self healing Checkpointing and fault recovery have been selected to provide fault tolerance. Holos uses coordinated checkpointing, which requires that non-deterministic events, such as processes interacting with each other, the operating system or end user, be prevented during the creation of checkpoints. However, under a microkernel-based architecture, operating system services are accessed by sending requests to operating system servers, rather than directly through system calls. This prevents nondeterministic events by stopping processes communicating with each other or with operating system servers during the creation of checkpoints. These messages are then included in the checkpoints of the sending processes to maintain the consistency of the checkpoints. Messages are dispatched to their destinations after all checkpoints are created. To improve the performance of checkpointing, the approach that employs the main memory of another computer of a cluster, rather than a centralized disk is used. However, as in this case back-up computers can also fail, a checkpoint is stored on at least k computers and k-delivery of group communication is used to support this operation. Disk based checkpointing is also used, but the frequency of storing checkpoints on the disk is much lower. To control the creation of checkpoints, another process of Holos, the Checkpoint Server, is employed. This process is placed on each computer and invokes the kernel managers to create a checkpoint of processes on the same computer [32]. The coordinating Checkpoint Server (where the application was originally created) directs the creation of checkpoints for a parallel application by sending requests to the remote Checkpoint Servers to perform operations that are relevant to the current stage of checkpointing. To create a checkpoint of a process, each of the kernel managers must be invoked to copy the resources under their control. Currently, fault detection and fault recovery is the subject of our research. A basis of this research is liveness checking and process migration, which moves a selected checkpoint to the specified computer, respectively. We also develop and study the methods of recording and using information about location of checkpoints within the relevant virtual cluster.

4.10 Brokerage (toward grids) for cluster self and surroundings’ awareness Brokerage and resource discovery have been studied to build basic autonomic elements allowing Holos services and applications to be offered to both other users working with Holos and users of other systems [26]. A copy of a brokerage process runs on each computer of the cluster. Each Holos broker is such a process that preserves user

7

Message Passing or PVM / MPI

Programming Environment

Shared Memory

DSM Based Communication Primitives

Message Passing Based Communication Primitives

System Services of an Operating System

Kernel Services of an Operating System

Figure 7. Easy Programming Service Design autonomy as in a centralized environment; and supports sharing by advertising services to make user’s own services available to other users, by allowing object to be exported to other clusters or to be withdrawn from service, and by allowing objects that have been exported by users from other clusters to be imported. The Holos broker supports object sharing among clusters of homogeneous clusters [26] and grids [27]. This implies that resources should be made available/shared in a distributed/gridlike manner. The test version of the broker was developed based on attribute names, in order to allow users to access the objects without knowing their precise names.

4.11 Programming interface for user friendliness Holos provides transparent communication services of standard message passing (MP) and DSM as its integral components. We present in this sub-section details of these common parallel programming communication mechanisms and how they are integrated transparently into the Holos system. The logical design of the communication services and how they interface to applications using MP and DSM is shown in Figure 7. This figure also shows the hierarchical relationship between the communication services and the system and kernel services.

4.11.1 Holos message passing The standard MP service within the Holos parallel execution environment is provided by the Local IPC component of the microkernel and the IPC Manager that is supported by the Network Manager. These combine to provide a transparent local and remote MP service, which supports both the various qualities of service and group communication mechanisms. The standard MP and RPC primitives such as send and receive; and call, receive and reply, respectively are provided to programmers.

4.11.2 Holos PVM and MPI PVM and MPI have been ported to Holos as they allow exploiting advanced message passing based parallel environment [30,22]. Three modifications to PVM applications running on UNIX have been identified to improve performance: avoiding the use of XDR encoding where possible, using the direct IPC model instead of the default model, and balancing the load. The PVM communication is transparently provided by a service that is only a mapping of the standard PVM services onto the Holos communication services and benefits from additional services (for

example group process creation and process migration), which are not provided by operating systems such as Unix or Windows. The move of MPI from UNIX to Holos has been achieved by replacing the two lower layers of MPICH with the services that Holos provides. These services were group communications, group process creation, process migration, and global scheduling, including static allocation and dynamic load balancing. Incorporating these services to MPI has shown promising results and provided a better solution for implementing parallel programming tools.

4.11.3 Distributed Shared Memory Holos DSM exploits the conventional memory sharing approach (to write shared memory code using concurrent programming skills) by using the basic concepts and mechanisms of memory management to provide DSM support [35]. One of the unique features of Holos DSM is that it is integrated into the memory management of the operating system. We decided to embody the DSM within the operating system in order to create a transparent, easy to use and program environment and achieve high execution performance of parallel applications. The options for placing the DSM system within the operating system were either to build it as a separate server or incorporate to it into one of the existing servers. The first option was rejected because of a possible conflict between two servers (Space Manager and the DSM system) both managing the same object type, i.e. memory. Synchronised access to the memory to maintain its consistency would become a serious issue. Since DSM is essentially a memory management function, the Space Manager is the server into which we decided the DSM system should be integrated. This implies that programmers are able to use the shared memory as though it were physically shared, hence, the transparency requirement is met. Furthermore, because the DSM system is in the operating system itself and is able to use the low level operating system functions the efficiency requirement can be met. The granularity of a shared memory object is a critical issue in the design of a DSM system. As the DSM system is placed within the Space Manager and the memory unit of the Holos Space is a page, the most appropriate object of sharing for the DSM system is a page. The Holos DSM system employs release consistency model (the memory is made consistent only when a critical region is exited), which is implemented using the writeupdate model [34]. In Holos DSM synchronisation of processes that share memory takes the form of semaphore type synchronisation for mutual exclusion. The semaphore is owned by the Space Manager on a particular computer. Because the ownership of the semaphore is controlled by the Space Manager on each computer, gaining ownership of the semaphore is still mutually exclusive when more than one DSM process exists on the same computer. Barriers are used in GENESIS to co-ordinate executing processes. One of the most serious problems of the current DSM systems is that they have to be initialised manually by programmers [5], [19]. Transparency of this operation is not provided. In Holos, DSM is initialised automatically and transparently. Machines are selected, process created and data distributed auromatically.

8

5. CONCLUSION In this paper, autonomic computing has been shown to be feasible and able to move parallel computing on non-dedicated clusters to the computing mainstream. The autonomic elements have been designed and implemented by respective servers or part of other system servers. All the cooperating processes that employ these mechanisms offer self and surroundings discovery, ability of reconfiguration, self-protection, self-healing, sharing and ease of programming. The Holos autonomic operating system has been built as an enhancement of the Genesis system to offer an autonomic non-dedicated cluster. This system relieves developers from programming operating system oriented activities, and provides to developers of parallel applications both message passing and DSM. In summary, the development of the Holos cluster operating system demonstrates that it is possible to build an autonomic non-dedicated cluster. This paper contributes to both the area of autonomic computing, in particular parallel autonomic computing on nondedicated clusters, by harnessing many technologies developed by the authors and the area of cluster operating systems by the development of a comprehensive cluster operating system supporting parallel computing and solving the problem of building virtual clusters changing dynamically and adaptively according to load and changing resources, in particular adding, and removing computers to/from the cluster.

[9] Goscinski, A. (2000): Towards an Operating System Managing Parallelism of Computing on Clusters of Workstations. Future Generation Computer Systems: 293314. [10] Goscinski, A. and Haddock, A. (1994): A Naming and Trading Facility for a Distributed System. The Australian Computer Journal, No. 1. [11] .Goscinski, A. and Wong, A (2004) The Performance of a Parallel Communication-Bound Application Executing Concurrently with Sequential Applications on a Cluster Case Study. (To be submitted to) The 2nd Intl. Symposium on Parallel and Distributed Processing and Applications (ISPA2004). Dec. 2004, Hong Kong, China. [12] Goscinski, A., Hobbs, M. and Silcock, J. (2002): GENESIS: An Efficient, Transparent and Easy to Use Cluster Operating System. Parallel Computing. [13] Hobbs, M. and Goscinski, A. (1999a): A Concurrent Process Creation Service to Support SPMD Based Parallel Processing on COWs. Concurrency: Practice and Experience. 11(13). [14] Hobbs, M. and Goscinski, A. (1999b): Remote and Concurrent Process Duplication for SPMD Based Parallel Processing on COWs. Proc. Int’l Conf. on High Performance Computing and Networking, HPCN Europe'99. Amsterdam.

6. ACKNOWLEDGEMENTS

[15] Horn, P. (2001): Autonomic Computing: IBM’s Perspective on the State of Information Technology.

We would like to thank the anonymous COSET reviewers for the valuable feedback and comments they provided. These suggestions helped us greatly to improve this paper.

[16] IBM (2001): IBM Corporation, http://www.research.ibm.com/autonomic/research. (Accessed 26 May 2004).

7. REFERENCES [1] Anthill (University of Bologna, Italy) http://www.cs.unibo.it/projects/anthill/, (accessed 26 May 2004) [2] Auban, J.M.B. and Khalidi, Y.A. (1997): Operating System Support for General Purpose Single System Image Cluster. Proc. Int’l Conf. Parallel and Distributed Processing Techniques and Applications. PDPTA’97, Las Vegas. [3] Barrera, J. (1993) Self-tuning systems software. Proc. Fourth Workshop on Workstation Operating Systems. 194-197. [4] Bio-inspired Approaches to Autonomous Configuration of Distributed Systems (University College London), http://www.btexact.com, (accessed 6 May 2003). [5] Carter, J., Efficient Distributed Shared Memory Based on Multi-Protocol Release Consistency, Ph.D. Thesis, Rice University, 1993. [6] Cluster, (2000): Cluster Computing White Paper, Version 2.0. M. Baker (Editor). [7] De Paoli, D. and Goscinski, A. (1998): The RHODOS Migration Facility. Journal of Systems and Software 40:5165. [8] De Paoli, D. et al. (1995): The RHODOS Microkernel, Kernel Servers and Their Cooperation. Proc. First IEEE Int’l Conf. on Algorithms and Architectures for Parallel Processing, ICA3PP’95.

[17] Iftode L. and Singh J. P. (1997): Shared Virtual Memory: Progress and Challenges, Technical Report, TR-552-97, Department of Computer Science, Princeton University, October. [18] Immunocomputing (International Solvay Institutes for Physics and Chemistry, Belgium) http://solvayins.ulb.ac.be/fixed/ProjImmune.html, (accessed 6 May 2003). [19] Keleher, P. Lazy Release Consistency for Distributed Shared Memory, PhD Thesis, Rice University, 1994. [20] Kephart, J. and Chess D. (2003): The Vision of Autonomic Computing. Computer, Jan. [21] Lottiaux, R. and MORIN, C. (2001): Containers: A Sound Basis for a True Single System Image. Proc. First IEEE/ACM Int’l Symp. on Cluster Computing and the Grid. Brisbane. [22] Maloney, A., Goscinski, A. and Hobbs, M.: An MPI Implementation Supported by Process Migration and Load Balancing, Recent Advances in Parallel Virtual Machine and Message Passing Interface: Proc. of the 10th European PVM/MPI User's Group Meeting, pp. 414-423, SpringerVerlag. [23] McGraw, G. and Hoglund, G. (2004) Dire Straits: The Evolution of Software Opens new vistas for Business and the Bad Guys. http://infosecuritymag.techtarget.com/ss/0,295796,sid6_iss36 6_art684,00.html, (accessed 26 May 2004).

9

[24] Multiagent Systems (Freiburg University) http://www.iig.uni-freiburg.de/~eymann/publications/, (accessed 26 May 2004). [25] Neuromation (Edinburgh University) http://www.neuromation.com/, (accessed 26 May 2004).

[33] Shirriff, K. et al. (1997): Single-System Image: The Solaris MC Approach. Proc. Int’l Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA’97. Las Vegas.

[26] Ni, Y. and Goscinski, A. (1994): Trader Cooperation to Enable Object Sharing Among Users of Homogeneous Distributed Systems. Computer Communications. 17(3): 218229.

[34] Silcock, J. and Goscinski, A. (1997). Update-Based Distributed Shared Memory Integrated into RHODOS' Memory Management. in: Proc. Third Intl. Conference on Algorithms and Architecture for Parallel Processing ICA3PP'97, Melbourne, Dec. 1997, 239-252.

[27] Ni, Y. and Goscinski, A. (1993): Resource and Service Trading in a Heterogeneous Distributed Systems. Proc. IEE Workshop on Advances in Parallel and Distributed Systems, Princeton.

[35] Silcock, J. and Goscinski, A. (1999): A Comprehensive DSM System That Provides Ease of Programming and Parallelism Management. Distributed Systems Engineering, 6: 121-128.

[28] Nimrod-G (Monash University) http://www.gridbus.org/, (accessed 26 May 2004).

[36] Walker, B. (1999): Implementing a Full Single System Image UnixWare Cluster: Middleware vs. Underware. Proc. Int’l Conf. on Parallel and Distributed Processing Techniques and Applications, PDPTA’99.

[29] OceanStore (Berkeley University of California). http://oceanstore.cs.berkeley.edu/, (accessed 26 May 2004). [30] Recovery-Oriented Computing (Berkeley/Stanford). http://roc.cs.berkeley.edu/, (accessed 26 May 2004). [31] Rough, J. and Goscinski, A. (1999): Comparison Between PVM on RHODOS and Unix. Proc. Fourth Int. Symp. on Parallel Architectures, Algorithms and Networks (ISPAN’99). A. Zamoya et al. (Eds), Freemantle.

[37] Wong, A. and Goscinski, A. (2004) Scheduling of a Parallel Computation-Bound and Sequential Applications Executing Concurrently on a Cluster - Case Study. (Submitted to) IEEE Intl. Conference on Cluster Computing. Sept. 2004, San Diego, California.

[32] Rough, J. and Goscinski, A. (2004): The Development of an Efficient Checkpointing Operating System of the GENESIS Cluster Operating System. Future Generation Computer Systems, 20(4):523-538.

10

Type-Safe Object Exchange Between Applications and a DSM Kernel

R. Goeckelmann, M. Schoettner, S. Frenz and P. Schulthess Department of Distributed Systems, University of Ulm, 89075 Ulm, Germany [email protected] Phone: ++49 731 50 24238; Fax: ++49 731 50 24142 Abstract: The Plurix project implements an object-oriented Operating System (OS) for PC clusters. Communication is achieved via shared objects in a Distributed Shared Memory (DSM) - using restartable transactions and an optimistic synchronization scheme to guarantee memory consistency. We contend that coupling object orientation with the DSM property allows a type-consistent system bootstrapping, quick system startup and simplified development of distributed applications. It also facilitates checkpointing of the system state. The OS (including kernel and drivers) is written in Java using our proprietary Plurix Java Compiler (PJC) translating Java source code directly into Intel machine instructions. PJC is an integral part of the language-based OS and tailor-made for compiling in our persistent DSM environment. In this paper we briefly illustrate the architecture of our OS kernel which runs entirely in the DSM and the resulting opportunities for checkpointing and communication between applications and OS. We present issues of memory management related to the DSM-kernel and to strategies to avoid false-sharing . Keywords: Distributed Shared Memory, Object-Orientation, Reliability, Single System Image

1 Introduction Typical cluster systems are built on top of traditional operating systems (OS) as Linux or Microsoft Windows and data is exchanged using message passing (e.g. MPI) or remote invocation (e.g. RPC,RMI) strategies. As each node in a cluster is running its own OS with different configurations, the migration of processes is difficult, as it is unknown which libraries and resources will available on the next node. Additionally, if a process is migrated to another node, the entire context including relevant parts of the kernel state must be saved and transferred. Because these OSs are not designed for cluster operation it is difficult to migrate kernel contexts [Smile]and as a consequence cluster systems typically redirect calls of migrated processes back to the home node, e.g. Mosix [Mosix]. Plurix is an OS specifically tailored for cluster operation and avoids these difficulties. The Distributed Shared Memory (DSM) offers an elegant solution for distributing and sharing data among loosely coupled nodes [Keedy],[Li]. Applications running on top of the Plurix DSM are unaware of the physical location of objects. A reference can either point to a local or to a remote memory block. During program execution the OS detects a remote memory access and automatically fetches the desired memory block. Plurix extends the DSM to a distributed heap storage, providing the benefit, that not only data but also the code segments of the programs are available on each node as they are shared in the DSM. One of our major research goals is to simplify the development of distributed applications. Typically, DSM systems use weak consistency models to guarantee the integrity of shared data. This makes the development of applications hard, as each programmer must explicitly manage the consistency of the data by using the offered synchronization mechanism [TreadMarks]. Plurix uses a strong consistency model, called transactional consistency [Wende02] relieving the programmer from explicit consistency management.

1

Single-System-Image (SSI) computing architectures have been the mainstay of high performance computing for many years. In a system implementing the SSI concept, each user gains a global and uniform view on available resources and programs. The system provides the same libraries and services on each node in the cluster, which is very important for load balancing and migration of processes. We extend the SSI concept by storing OS, kernel, and all drivers in the DSM. As a consequence we can implement a type-safe kernel interface and at the same time simplify checkpointing and recovery. In 1990 Fuchs introduced checkpointing and recovery for DSM systems [Fuchs90]. Numerous subsequent papers discuss the adaptation of checkpointing strategies designed for message-passing systems ranging from global coordinated solutions to independent checkpointing with and without logging [Morin97]. However, the more sophisticated solutions have not been evaluated in real implementations because checkpointing is difficult to achieve in PC-clusters even under global coordination. If a checkpoint needs to be saved it is not sufficient to save the DSM context but also the local kernel context needs to be saved - which is not trivial. Plurix avoids these drawbacks by storing OS and applications in the DSM. The remainder of the paper is organized as follows. The design of Plurix is briefly presented in section 2. We then describe the advantages of a type-safe kernel interface. In the sequel we describe the benefits of running the kernel within the DSM. Extending the SSI provides additional advantages for the checkpointing, which are described in section 5. Finally, we present measurements and give an outlook on future work.

2 Design of Plurix Plurix implements SSI properties at the operating system level, using a page-based distributed shared memory. According to the SSI concept all programs and libraries must be available on all nodes in the cluster. Therefore Plurix uses a global address space shared by all nodes and organized as

distributed heap storage (DHS) containing both data and code. Sharing the programs in the DHS reduces redundancy concerning code segments and makes the administration of the system easier. 2.1 Java-based Kernel and Operating System Plurix is entirely written in Java and works in a fully object oriented fashion. The development of an operating system requires access to device registers which is not possible in standard Java. For this reason we have developed our own Plurix Java Compiler (PJC) with language extensions to support hardware level programming. The compiler directly generates Intel machine instructions and initializes runtime structures and code segments in the heap. Traditional object-, symbol-, library- and exe-files are avoided. Each new program is compiled directly into the DHS and is thereby immediately available at each node. Plurix is designed as a lean and high speed OS and therefore able to start quickly. The start time of the primary node, which creates a new heap (installation of Plurix) or restarts an preexisting heap from the Pageserver (see Section 5.1), is less than one second. Additional nodes, which only have to join the existing heap, can be started in approximately 250 ms. This quick boot function of Plurix is helpful to guarantee fast node and cluster start-up time, which helps avoiding long downtimes in case of critical errors.

2.2 Distributed Shared Memory The transfer of the DHS-objects from one cluster node to a new one is managed within the page-based distributed shared memory (DSM), and takes advantage of the Memory Management Unit (MMU) hardware. The MMU detects page faults, which are raised if a node requests an object on a page which is not locally present. Each page fault results in a separate network packet which contains the address of the missing page (PageRequest). This packet is broadcast to all nodes in the cluster (Fast Ethernet LAN) and only the current owner of the page send it to the requesting node. An important topic in distributed systems is the consistency of shared and replicated objects. In Plurix this is synonymous to the consistency of the DSM. Plurix offers a new consistency model, called transactional consistency, which is described in the following section.

2

2.3 Consistency and Restartability Unlike traditional systems, Plurix does not burden the programmer with the consistency of the DHSobjects. All actions in Plurix are encapsulated in transactions. At the start of a transaction, write access to pages are prohibited. If a page is written, the system creates a shadow image of it and then enables write access. Additionally, the system logs the pages for which shadow images were created. At the end of a transaction (commit phase) the addresses of all modified pages are broadcasted and all partner nodes in the cluster invalidate these pages. If there is a collision with a running transaction on another node, it is aborted and eventually restarted. In case of an abort all modified pages are discarded. Since there is a shadow image for each modified page, the system can reconstruct the state of the node at the time just before the transaction has been started. A token mechanism is used to ensures, that only one node is in the commitphase at a time. The token is passed using a first wins strategy. To improve fairness further commit strategies will be developed.

2.4 False Sharing and Backchain All page-based DSM systems suffer from the notorious false-sharing syndrome. False-sharing occurs, if two or more nodes access separate objects which nevertheless reside on the same page. If a node writes to such an object, all other nodes are forced to abort their current transaction and restart it later. As these objects are not shared, such an abort is semantically unjustified and unnecessarily slows down the entire cluster. To handle this problem relocation of DSM-objects from one physical page to an other is required. When an object is relocated, all pointers to this object are adjusted. Due to the substantial network latency in the cluster environment, it is not possible to inspect each object whether contains a pointer to the relocated object. To adjust the affected references, Plurix uses the Backchain [Traub]. This concept links together all references to an object by recording the addresses of these pointers (see fig. 1). All references to a relocated DSM-object are found in the Backchain. To reduce invalidations of remote objects when a new Backchain entry is inserted, references on the stack are not tracked in the Backchain. backchain

object pointer

DSM Figure 1 The Backchain Concept

2.5 Garbage Collection The previously described Backchain concept can also be used to simplify a distributed garbage collection (GC). A Mark-and-Sweep algorithm should not be used in a DHS-environment, because it is either very difficult to implement (incremental Mark-and-Sweep) or it would stop the entire cluster while collecting garbage. Copying GC algorithms will unduly reduce the available address space - only reference counting algorithms appear feasible. The Backchain can be used as a reference counter. If an object contains an empty Backchain, no references to this object remain. This is equivalent to a reference counter of 0, so in this case the object is garbage and can be deleted. Because stack references are not included in the Backchain, the GC may only run if the stack is empty. Between two transactions this condition is always true, and that the GC task can be run as a regular Plurix transaction.

3

3 A Type-Safe Interface for a DSM-Kernel The SSI concept requires, that all nodes in the cluster have the same programs installed. In a distributed environment the easiest why to achieve this goal is to share not only data but also the code of the programs, for this reason the Plurix extends the DSM to the DHS. In this case it is mandatory to protect the code segments from unwanted modification either by corrupted pointers or by malicious attacks. This can be achieved by using a type-safe language like Java. Language-based OS development has been successfully demonstrated by the Oberon system [Wirth]. The requirement for type safety in the DSM affects also the interface to the OS. As data in the DSM is represented by objects and these data must be passed to the kernel, either the objects must be serialized before they are used as parameters or the kernel must be able to accept objects.

3.1 Traditional Kernel interfaces Traditionally, distributed systems are implemented as a middleware layer on top of a local OS, such as Linux or Mach, which are mostly written in C and therefore do not provide objects. The communication between the distributed system and the local OS takes place using primitive data types or structures. If the kernel cannot handle objects as such, they are serialized (and data items are copied) before being passed to the kernel. This kind of raw communication does not provide type checks for parameters and signatures by the runtime environment. Hence no type-safe calls of kernel methods are possible and each kernel method has to check explicitly its parameters to avoid runtime errors.

3.2 Benefits of a Type-Safe Kernel Interface To reduce programming complexity and to increase system performance we recommend to pass typed objects to the kernel. This was part of the motivation to create Plurix as a stand alone OS not as a middleware layer. Since the kernel of Plurix is written in Java and easily handles objects, a type-safe communication between the DSM applications and the OS is natural. All Java types and objects, can be handed to the kernel methods. The programmer has no need to pay attention to the type of the passed object because this is checked by the compiler and in some cases by the runtime environment. Further on there is no reason to serialize objects which are used as parameters for kernel methods, so the performance of the entire system increases. Another benefit of using objects as parameters is that the object respectively the data included in these objects need not be copied. The kernel method obtains a reference and accesses the object directly. This increases the system performance again.

3.3 Inter Address Space Pointers In traditional systems there are at least two different address spaces, one for the kernel and at least one for user applications. As the kernel methods are always needed on each node the straight-forward way of implementing the system would be to place the kernel in the local address space. These local addresses are not shared with other nodes, and each node in the cluster can use them in different ways. In this case a separation between the kernel and user address space would mean to differentiate between the local- or NonDSM- and the DSM-address-space. If in such an environment objects are used as parameters, some references will point from the Non-DSM into the DSM address space. References which point from the Non-DSM into the DSM reduces the performances of the cluster, as they inhibit the relocation of objects so that avoidance of false-sharing and memory fragmentation is prevented. The reason being that the Backchain entries are not longer unambiguous when an object migrates to another node and is then relocated from one DSM address to another. If an object is referenced by a Non-DSM-object, the Backchain leads into the local memory of the node. As addresses in the local memory are not unique, the pointer can not be adjusted, as it is not possible to detect which local memory area is specified by this Backchain entry. The correct reference to this object can not be found and an adjustment of the memory location which is specified by the Backchain will lead to invalid pointers or even destroyed code segments

(see figure 2).

4

DSM

local memory node 1

DSM

local memory node 2

DSM-Object referenced by a Non-DSM-object pointer

local memory node 1

DSM

local memory node 2

DSM-object migrated to another node

local memory node 1

local memory node 2

DSM-object was relocated to another address

backchain

Figure 2 Migration and subsequently relocation of a DSM-object

As long as DSM-objects are relocatable, references from the Non-DSM into the DSM address space are not possible, as they could lead to dangling pointers or destroyed code segments. To solve this problem it would be possible to prevent relocation of DSM-Objects, which are referenced by Non-DSM-object. As it is not possible to specify, which objects are used as parameters for kernel methods, nearly all objects in the DSM could not be relocated and so the performance of the cluster will be impaired because false sharing and fragmentation of the memory can not be handled. Therefore direct pointers from the Non-DSM into the DSM-addressspace must be avoided. Another interesting question is how kernel methods can be called from DSM applications. Two alternative methods are conceivable: 1. Software Interrupt: Like in most traditional systems, kernel methods may be called using kernel- or system-calls. These are software interrupts which request a specific function from the kernel. If kernel-calls are used to communicate between the DSM applications and the operating system there are no “address space spanning” pointers but the question arises how to pass parameters from the DSM to the kernel, as the software interrupt itself cannot accept parameters. One possible solution is to pass data to the kernel through a fixed address. If an object is used as a parameter, this address would contain the pointer to the object which should be used. As each kernel method requires different parameters, this object must be of a generic type and thereby each object can be passed. Each kernel method has to check the given object if it is type compatible with the expected one as this could not be handled by the runtime environment. This rises the complexity for system programmers and makes the system vulnerable to faults by simultaneously reducing the performance and the possibilities of parameter passing. 2. Object oriented invocation: Kernel methods are invoked in an object oriented fashion via direct pointers to the requested kernel class. This implies that all kernel classes and their methods have to reside at the same addresses on each node in the cluster. This is necessary as each application can only have one pointer to a kernel class. Should they reside at different addresses, these references would point to invalid addresses and the corresponding kernel methods could not be called correctly on some nodes (see figure 3). If direct pointers are used each node in the cluster must run the same kernel, and such a kernel can never be changed during runtime. Consequently all pointers in the applications, which reference kernel methods require adjustment. To achieve this, all kernel methods must contain a Backchain which points from Non-DSM into the DSM and thereby the above described problems will occur.

5

DSM

local memory node 1

DSM

local memory node 2

local memory node 1

local memory node 2

Figure 3 Invalid reference to kernel methods

Both techniques give rise to an additional problem. The compiler is running in the DSM and any new program is automatically created in the DSM. If the new program is a device driver (which typically resides in kernel space) the code segments must be transfered from the DSM into the Non-DSM address space and this must occur simultaneously on each node. Our implemented solution which solves all the challenges above is to remove the kernel from the local memory address space and move it into the DSM. Further benefits of this approach are described in the following section.

4 Extending the Single System Image Concept We elaborate the SSI concept by moving the OS and the Kernel into the DSM. The local memory is only used for a few state variables for the network device drivers and -protocol and for the so called Smart-Buffers, which help to bridge the gap between not restartable interrupts and transactions [Bindhammer].

4.1 Benefits of a kernel running in the DSM If the kernel runs in the DSM parameter passing between applications and kernel is elegant and all objects can be used as parameters. Kernel methods are called directly as described in section 3.3. There are no references pointing from one address space to the other. Since all device drivers now reside in the DSM even the problem of transferring newly compiled drivers from the DSM into the kernel space vanishes. Because the code segments of the kernel methods are in the DSM redundancy is avoided. Further benefits from this concept, especially for system checkpointing are described in Chapter 5. Some interesting questions surfaced when moving the kernel into the DSM, but before we describe these topics and our corresponding solutions we describe the memory management of Plurix and the allocation mechanism for new objects as this is important for our solution.

4.2 Distributed Heap Management A basic design topic of the Plurix system is the page-based DSM, raising the false sharing problem. The allocation strategy of the memory management must try to avoid false sharing wherever possible. Furthermore collisions during the allocation of objects in the DHS must be avoided, as such a collision will abort other transactions and thereby serialize all allocations in the cluster. To achieve those goals, Plurix uses a two stage allocation concept consisting of allocator-objects and a central memory manager. The latter is needed, as the memory has to be portioned to the different nodes in the cluster. This division must not be static, as this would reduce the maximum size of the objects. The memory manager is used to create allocators and large objects. As the allocator must have at least the same size as the new object which should be created, the usage of allocators for large objects would lead to large allocators and thereby to a static fragmentation of the heap. The alternative for this is to limit the size of the allocators and thereby the maximum size of the DHS-objects which is unacceptable. Allocator-objects represent a portion of empty memory. The size of an allocator is reduced for each allocated object. If it is exhausted, the Allocator is discarded and a new one is requested from the central memory manager.

6

4.2.1 Allocation of Objects

If a new object is requested, the memory management first decides if the object is created by the corresponding allocator or by the memory manager. This decision depends on the size of the object. Each object which is greater than 4KB is directly allocated by the memory manager. To avoid false sharing on these objects, their size is increased to a multiple of 4 KB (page granularity of the 32-bit Intel architecture). As all objects which are allocated by the memory manager have a size of a multiple of 4 KB, each object starts at a page border and consumes N pages. Therefore these objects do not co- reside with other objects on the same page. Objects which are less than 4 KB are created by an allocator. As each node has its own allocator, collisions can only occur if a large object is allocated or if an allocator is exhausted and a new one must be created. The measurements in section 6 show, that the size of most of the objects in Plurix are less than 4KB, so large objects are rarely allocated. The collisions which occur during these allocations are tolerable most of the time. Allocator Node1 e< siz

4k B

4 GB

Allocator Node2

size

< 4k B

Node2

Node1 siz e> =4

kB

siz

Memory manager object

= e>

B 4k

512 MB Figure 4 Allocation of objects

The benefit of the two level allocation of objects is that small objects from one node are clustered in the memory. As a consequence collisions do not occur during the allocation of small objects and are rare if large objects are allocated. As large objects are not allocated within the allocator, its size can be limited, without limiting the maximum size of the objects. No static division of the memory is needed and therefore no static fragmentation is created. 4.2.2 Reduction of False Sharing

Generally speaking objects can be divided into two categories: Read-Only (RO) and Read-Write (RW) objects. False-sharing on RW-objects is reduced by the mechanism described above. To further reduce false-sharing it is reasonable to make sure that RO-objects like code segments and class descriptors without static variables do not co-reside with RW-objects on the same page, as this would lead to unnecessary invalidations of the RO-objects due to false-sharing. Code segments are only written during compilation by the compiler. If these objects would be indiscriminantly allocated, they could reside on the same pages as the RW-objects of the node which is currently running the compiler. To avoid this, Plurix provides additional allocators for RO-objects.

4.3 Protection of SysObjects If the entire system is running in the DSM, some code segments and instances of classes must be protected against invalidation, as these objects are vital for the system. The objects which must always be present on a node are called SysObjects. These are nota bene all classes and instances concerning the Page-Fault-handler, the DSM protocol and the network device drivers. As these objects reside in the DSM

7

they might be affected by the transaction mechanism and in case of a collision on such a page, the page would be discarded and the system will hang, as the node is no longer able to request missing pages. The protection of SysObjects against invalidation is easy to achieve just by defining two additional allocators. SysObjects are either code segments or instances of SysClasses. As described above, code segments are only written during compilation otherwise they are read only. Additionally, code segments should not co-reside on the same pages as RW-objects as this would lead to false-sharing and therefore a special allocator is used. The compiler will create the new kernel classes in a different memory area. Afterward update messages need to be sent to all nodes in the cluster, to replace old classes and instances by new ones. To achieve this, it is sufficient to make sure that such an allocator is only used by the current compilation and after that the remaining part of the last used page is consumed by a Dummy-SysObject. RW-SysObjects are instances of SysClasses which are meaningless for all nodes except that one, that has created the instance. For this reason RW-SysObjects are not published through the global name service. Therefore no other nodes can access a RW-SysObject. The only case where a RW-SysObject could be invalidated is as a result of false-sharing. To prevent this, each node acquires a SysRW-Allocator during the boot phase. All instances of SysClasses are allocated in this private allocator, so that there are only SysObjects from one node on the same page. These two additional allocators and the described techniques to use them are sufficient to protect all SysObjects against invalidation at run time.

4.4 Local memory for State Variables State variables of the DSM protocol and the network device drivers must outlast the abort mechanism, as these variables are needed to handle the abort mechanism. If they would be reset the current state of the protocol and the network adapter would be lost. The network device driver would never be able to receive the next packet as the receive-buffer pointer would also be reset. Also the protocol contains a sequence number for the messages to make sure, that no vitally important message is lost. If the state variables are reset, the protocol will receive messages from the future. In this case it would not be possible to decide if this number is invalid due to an abort or if the node has missed important network packets. As the protocol is not a device driver, its current state variables can not be read from the hardware registers, as it is possible for normal (not network) device drivers. Hence these variables must be stored outside the DSM address space. For device drivers and the protocol, the kernel provides special memory areas in the local memory in which state variables are stored. To access these areas, Plurix provides “structs“, allowing to address raw memory much like the variables in an object. “structs“ are also used to access the memory mapped registers of devices. As Structs may not contain pointers and are not referenced by pointers no problems with address space spanning pointers arise.

4.5 Restart of device drivers In case of an abort the state of the entire node is reset to the state just before the current transaction was started. Devices can not be automatically reset and the device driver programmer must implement an Undo-method, which is called by the system in case of an abort. This method has to ensure that both the state of the hardware and that of the state variables in the device driver object are reset. To make this possible the state of all devices before the transaction must be conserved. An example for such an Undo-method is shown for graphics controller devices. In this case the current On- and Off-screen memory-area on the display adapter must be reset. Since between two transactions the On-and Off-screen contains the same data, it is sufficient to reset the Off-screen memory and afterward copy this value to the On-screen area. This is easy to implement as most graphics controllers contain substantial amounts of memory for textures and vertices. A small part of this memory can be used to save the committed state of the graphic controller. After the commit phase, the current On-Screen area is copied into this separate memory area and can be restored if necessary. The serial-line controller is more difficult to handle. This controller sends all data if it receive it from the system. In case of an abort it is not possible to “undo” the sent data. For this problem there are two possible solutions. Either the affected application is able to handle duplicated data or the driver has to use smartbuffers. Data in this special buffer type are invisible to the device until the commitphase, so the device can only access committed and therewith value data. 8

5 Checkpointing and Recovery State of the art PC-Clusters are built using Linux or Microsoft Windows but implementing checkpointing and recovery in these operating systems is difficult because it is not sufficient to save the process context but also the local kernel context needs to be observed. The latter includes internal data structures, open files, used sockets, pending page requests, ... which can be read only at kernel level. Resetting the kernel and process context in case of a rollback is also challenging because of the complex OS architectures. As a consequence taking a checkpoint is time consuming and checkpointing intervals are quite large, e.g. 15-120 min. for the IBM LoadLeveler. By extending the Single System Image concept we avoid these drawbacks. Storing the kernel and its contexts in the DSM makes it easy to save this data. Rollback in case of an error is no problem in Plurix because the OS and all applications are designed to be restartable anyway.

5.1 Current Implementation A central fault-tolerant PageServer stores consistent heap images in an incremental fashion on disk. Between two checkpoints the PageServer uses a bus snooping protocol to intercept transmitted and invalidated memory pages to reduce the amount of data to be retrieved from the cluster at the next checkpoint. If a checkpoint must be saved the cluster is stopped and the PageServer collects invalidated pages that have not been transmitted since the last checkpoint. All memory pages are written to disk synchronously. We have implemented a highly optimized disc driver that is able to write about 45 MB/s. An early performance evaluation of our PageServer can be found in section 6. Because the kernel and its context reside in the DSM we must not save node local data. Furthermore, we have no long running processes or threads with preemptive multitasking that need to be checkpointed. Currently, we use a cooperative multitasking scheme for executing short transactions. A transaction is executed by a command or periodically called from the scheduler. Long running computations have to divided in sub transactions manually. In case of an error the node can perform a reboot and fetch required memory pages again from the DSM from the last checkpoint.

5.2 Fault-Tolerance We support clusters running within a single Fast Ethernet LAN and assume a fail-stop behavior of nodes. Most DSM systems typically use a reliable multicast or broadcast facility to avoid inconsistencies caused by lost network packets. Because of the low error probability of a LAN we are not willing to impose the overhead by a reliable communication during normal operation. Instead we rely on a fast error detection, fast recovery, and the quick boot option of our cluster OS. As described in 2.x our DSM implements transactional consistency and committing transactions are serialized using a token. We introduce a logical global time (a 64-Bit value) incremented each time a transaction commits. In the latter case the new time is broadcasted to the cluster and each node updates its time variable. A node can immediately detect if it missed a commit and ask for recovery. If the commit message cannot be sent with one Ethernet frame, the commit number is incremented for each commit packet. Thus we avoid inconsistencies if a node did miss a packet of a multiple packet commit. Furthermore, any page or token requests always includes the global time value of the requesting node. If such a request contains an out-of-date commit number it is not processed but recovery is started. Thus a node that missed a commit is note able to commit a transaction because it is not granted the token. If a single node fails temporarily it can reboot and join the DSM again. If the PageServer detects missing pages during the next checkpoint that were lost because of the a node failure the cluster is reset to the last checkpoint. If a multiple nodes fail temporarily or permanently the same error detection scheme works, too. The network might be partitioned temporarily into two or more segments. Only one token and one PageServer is available in any of these segments. Nodes within the segments send page and token requests. If either request cannot be satisfied the segment tries to recover by contacting the PageServer. Only the segment with the PageServer can recover the others have to wait until the PageServer becomes available again. We plan to implement a distributed version of our PageServer two avoid a bottleneck and to replicate data stored on the PageServers to tolerate failures of PageServers, too. We also plan to introduce a

9

asynchronous checkpointing scheme to avoid stopping the cluster during checkpointing operation. Dependency tracking will also be investigated to restart only affected nodes in case of a failure.

6 Measurements The performance evaluation is carried out on three PCs interconnected by Fast Ethernet Hub. Each node is equipped with a RLT8139 network card and an ATI Radeon graphic adapter. Only the first machine (with Celeron-CPU) is featured with hard disk (IBM 120GB, max disc write throughput without network 45 MB/s) and acts as pageserver.

Table 1. Node configuration CPU Celeron 1.8 GHz Athlon XP2.2+ at 1.8 GHz Athlon XP2.0+ at 1.66 GHz

Node 1 2 3

RAM 256 MB DDR RAM at 266 MHz 256 MB DDR RAM at 333 MHz 256 MB DDR RAM at 333 MHz

6.1 General System Measurements We have tested the startup time of the above described cluster nodes. The results are split into the time which the kernel needs and the time which is needed to detect and start the hardware such as HD, mouse and keyboard. The nodes have been started with and without harddisc and the time difference is about 540 ms during which we have to wait for the harddisc to answer.

Table 2. Startup times (in ms) Node

Startup as Master

Kernel time

Startup as Slave

Kernel time

1

791

254

240

234

2

780

248

238

233

3

792

254

239

234

The kernel allocates 2787 objects if running as master and 518 objects if running as slave. It takes

approximately 3 microseconds to allocate an object and additional 0.5 microseconds to assign a pointer to an object. To get the kernel from the DHS a slave node must request 284. To show the correlation of changed heap size, heap spreading and time to save a checkpoint, ten measurements were made. Comparison of several measurements is needed for predications about speed of hard disk, performance of implemented software and latency caused by network. In the following table for each measurement the configuration (single station or cluster) and heap spreading is given. The pageserver creates consistent images of the complete heap containing both user data (node1 – node3) and operating system. The latter is contained in “saved data”.

Table 2. Measurements #

nodes

Node 1

Node 2

Node 3

Saved data

1

1

20 MB

-

-

21,4 MB

1639 ms

2

1

40 MB

-

-

42,5 MB

2491 ms

17,5 MB/s

3

1

60 MB

-

-

63,0 MB

3371 ms

19,1 MB/s

4

1

80 MB

-

-

83,4 MB

4321 ms

19,7 MB/s

5

1, 2, 3

60 MB

0 MB

0 MB

63,1 MB

3422 ms

18,9 MB/s

6

1, 2, 3

20 MB

20 MB

20 MB

63,1 MB

4476 ms

14,4 MB/s

7

1, 2, 3

0 MB

28 MB

32 MB

63,1 MB

4971 ms

13,0 MB/s

10

Time to save to disc

Throughput (resulting disc write bandwidth) 13,7 MB/s

8

1, 2, 3

40 MB

40 MB

40 MB

124,6 MB

8049 ms

15,8 MB/s

9

1, 2, 3

48 MB

48 MB

48 MB

149,1 MB

9540 ms

16,0 MB/s

10

1, 2, 3

60 MB

60 MB

60 MB

186,0 MB

11707 ms

16,3 MB/s

In comparison of measurement 1-4 we see an increase of throughput in consequence of increased data size. Measurements 3, 5-7 have same size of saved data, so decreased throughput depends on network latency. Comparing measurements 6, 8-10 approves nearly constant throughput. The slight improvement for increased data size is due to faster saving of local data. The following chart shows these three comparisons:

7 Experiences and Future Work Moving the kernel into the DHS and therewith elaborating the SSI concept made it possible to create a type-safe kernel interface and to solve the problem of address space spanning pointers. Additionally, checkpointing is made much easier and the question in which way kernel methods should be called is answered. The current version of Plurix is running stable in the cluster environment, without collisions during allocation. The usage of the allocator strategy inhibits false sharing if no applications which share objects are running. As soon as objects are created by an application and shared with other nodes, the allocation mechanism is not able to prevent false sharing but we are working on a monitoring tool to detect false sharing. Relocation of objects to dissolve false sharing is currently available. Plurix uses a distributed garbage collection algorithm which is able to detect and collect garbage (including cyclic garbage) without stopping the cluster. The detection algorithm for cyclic garbage works error free but currently there is no information which object could be cyclic garbage so each object in the DHS must be checked. The consistency of the DHS is ensured by the page server, which uses linear segment technique to save all changed pages. This includes data and code objects of user applications as much as the OS. In the current implementation the speed of saving the complete heap is limited by network throughput and not by OS or hard disc. For this reason it is necessary to save the state of the cluster continuously which could be achieved by some minor changes, regarding the mechanism of detecting missing pages.

8 References [Mosix] Barak A. and La'adan O., The MOSIX Multicomputer Operating System for High Performance Cluster Computing , Journal of Future Generation Computer Systems, Vol. 13, No. 4-5, pp. 361-372, March 1998. [Wirth] N. Wirt and J. Gutknecht, „Project Oberon“, Addison-Wesley, 1992. [Traub] S. Traub, “Speicherverwaltung und Kollisionsbehandlung in transaktionsbasierten verteilten Betriebssystemen”, PhD thesis, University of Ulm, 1996. [TreadMarks] Amza C., Cox A.L., Drwarkadas S. and Keleher P., „TreadMarks: Shared Memory Computing on Networks of Workstations“, Proceedings of the Winter 94 Usenix Conference, pp. 115-131, January 1994. [Fuchs90] Kun-Lung Wu and W. Kent Fuchs, „Recoverable Distributed Shared Virtual Memory”, IEEE Transactions on Computers, 39(4):460-469, April 1990. [Morin97] C. Morin, I. Puaut, “A Survey of Recoverable Distributed Shared Virtual Memory Systems”, IEEE Transactions on Parallel and Distributed Systems, Vol. 8, No. 9, September 1997. [Keedy] J.L. Keedy, and D. A. Abramson, “Implementing a large virtual memory in a Distributed Computing System”, in: Proc. of the 18th Annual Hawaii International Conference on System Sciences, 1985.

11

[Li] K. Li, “IVY: A Shared Virtual Memory System for Parallel Computing”, In Proceedings of the International Conference on Parallel Processing, 1988. [Wende02] M. Wende, M. Schoettner, R. Goeckelmann, T. Bindhammer, P. Schulthess, “Optimistic Synchronization and Transactional Consistency”, in: Proceedings of the 4th International Workshop on Software Distributed Shared Memory, Berlin, Germany, 2002 [Bindhammer] T. Bindhammer, R. Göckelmann, O. Marquardt, M. Schöttner, M. Wende, and P. Schulthess, “Device Programming in a Transactional DSM Operating System”, in: Proceedings of the Asia-Pacific Computer Systems Architecture Conference, Melbourne, Australia, 2002. [Simle] http://os.inf.tu-dresden.de/SMiLE/

12