Optimizing System Services in Virtualized ... - Semantic Scholar

2 downloads 0 Views 317KB Size Report
William Magato, Ryan Miller, and Philip A. Wilsey ... philip[email protected] ..... [24] G. Ciaccio, D. Tavangarjan, and R. D. Russel, “A communication system for ...
Optimizing System Services in Virtualized Machines for Improved Execution of Parallel Applications William Magato, Ryan Miller, and Philip A. Wilsey Experimental Computing Laboratory School of Electronics and Computing Systems PO Box 210030, Cincinnati, OH 45221–0030 [email protected], [email protected], [email protected] Abstract Hardware virtualization in the x86 architecture has enabled the use of virtual machine monitors in Beowulf clusters. In turn, each application user can then run a virtual guest operating system of his/her choice. Of course virtualizing and sharing the hardware resources in a Beowulf cluster will add additional overheads that may negatively effect performance of the application. However, the use of an application specific guest operating system also permits the introduction of application specific optimizations into that guest operating system. Potentially then, even with virtualization, the overall application performance may be improved. In this paper we consider the optimization of the communication subsystem in a Linux-based Beowulf cluster. In particular, we examine and compare the communication costs between a native TCP/IP Linux host and a virtual Linux guest with the TCP/IP drivers replaced with active message drivers. Using the KVM hypervisor, we observe that TCP/IP latency is nearly doubled in the virtual guest. However, after replacing the TCP/IP drivers with GAMMA active message drivers in the virtual guest, we observe substantial reductions in the message latency over the native TCP/IP host. We believe that this result provides us evidence to continue exploring the development of a streamlined virtual host for our research with Time Warped synchronized parallel simulation. Keywords: Virtualization, KVM, Active Messages, GAMMA, MPI, Time Warp Mechanism.

1

Introduction

Today most Beowulf cluster hardware platforms run general purpose, server (or desktop) based operating systems to provide high-performance general purpose (parallel and sequential) computational support. 1

While this adequately provides shared access to general purpose parallel computing to a large community of users, it does so using operating system services that may actually be a poor match to the needs of the parallel applications to which the cluster is being used. Fortunately, the integration of hardware virtualization capabilities into the x86 instruction set opens an opportunity for building and using full or semi-custom application specific execution environments in low-cost commodity hardware platforms. Thus, the nodes in a moderately priced Beowulf clusters can run a virtual machine host with application software running in virtual (guest) environments. In this configuration, a Beowulf cluster can be available to service multiple users and also provide a framework in which application specific optimizations can be used in the kernel services of the virtual guest environment. Of course this is beneficial only if the overhead costs of the virtual machine host software can be contained so that virtual customized execution environments (guests) can still achieve performance gains. In this paper, we are investigating the feasibility of modifying a virtual guest to optimize our application environment for parallel simulation [1]. In particular, we are investigating Time Warp synchronized discrete event simulation [2]. There are many facets of Time Warp synchronized parallel simulation that would benefit from customized O/S services. For example, most parallel discrete event simulation models present fine grained parallelism and the high cost of TCP/IP based communication make effective parallel execution on distributed hardware quite difficult. It is well known that message latency has dramatic impact on the performance of parallel simulators [3] and we have already seen that active messages [4] provide a substantial performance boosts to our Time Warp synchronized parallel simulator [5]. Unfortunately the large Beowulf clusters at our disposal are shared with many others and we cannot simply replace the standard TCP/IP protocol stack in the machine with an active message layer. Fortunately, we now have a Beowulf system that supports virtualization. Thus, we are exploring the feasibility of replacing the TCP/IP device drivers in a virtual guest with a lightweight “active message” device driver (in this case, the “gamma” active message drivers [6]). Essentially, we are exploring if the expense of virtualizing (and paravirtualizing) the networking hardware is low enough that we can expect to see substantial reductions in node-to-node communication latencies between virtual guests. The remainder of this paper is organized as follows: Section 2 presents some background information on virtualization technology and implementations. Section 3 presents some of the related work and config-

2

uration options for networking in virtualized environments. Section 4 presents alternative communication technologies that can be utilized by virtualized environments. Section 5 describes the hardware and software systems and configurations that we use in our experimental analysis. Section 6 presents the empirical results of various configurations of communication latencies using native and virtualized systems. Finally, Section 7 presents a summary of our results and contains some concluding remarks.

2

Background

Virtualization technologies have found their place in a number of today’s computing environments. Probably the most commercially viable application of them is in the form of server consolidation. Organizations can use virtualization technologies to consolidate a number of individual servers into one physical machine running multiple virtualized environments. The benefits of this practice are numerous (reduced power consumption, higher availability, better security, easier software maintenance, etc.) Computer hardware engineers can utilize virtualization technologies to delay the need for prototype hardware during development cycles. Software engineers can execute specialized software within their comfortable desktop environments while developing embedded systems. Today, there are many potential uses for virtualization and for most of them the benefits of using virtualization far outweigh any performance impact. The performance overhead of these technologies have, however, prohibited their full use for many high performance computing applications such as parallel simulation. Recent software and hardware developments have begun to significantly reduce the overhead of virtualization. The following sections describe these advances and the available technologies in use today.

2.1

Virtualization Architectures

There are various basic types of virtualization architectures in existence today (based on a hypervisor or virtual machine monitor). The three types of most interest to our studies are: Emulation, Full Virtualization and Paravirtualization [7]. These three types are described in detail in the sections below.

3

2.1.1

Emulation

Emulated environments simulate all (or part) of a computer system, possibly including the microprocessor itself. The architecture has the ability to host an emulated environment of a platform that may, or may not, be identical to the actual running system. This is a convenient and powerful tool for computer hardware design engineers since portions of development can be tested within the virtual environment without access to target platform hardware. The primary disadvantage of emulation is the considerable performance overhead.

2.1.2

Full Virtualization

Full virtualization, on the other hand, requires virtualization environments to be platform compatible with that of the actual running system. Performance overhead can be greatly reduced by directly utilizing the hardware of the actual running system. Modern microprocessors provide an extended feature set to support virtualization environments more effectively (AMD-V and Intel’s VT-x [8]). These features allow more of the virtual machine monitor’s difficult tasks to be executed in hardware [9]. This development has enabled a new class of virtualization environments to emerge in recent years. Operating systems can be executed within a full virtualization environment unmodified and completely unaware of virtualization, and yet take near full advantage of the host system’s hardware. This concept in particular has lead to the use of virtualization for non-scientific applications, such as server consolidation.

2.1.3

Paravirtualization

Paravirtualization also requires virtualization environments to be platform compatible with that of the host system [10]. However, paravirtualized environments provide additional software system facilities that specifically support virtualization. As with full virtualization, paravirtualization can take near full advantage of the hosts system’s processor hardware. The additional software system facilities provide an added performance enhancement by reducing the need for emulated I/O and system devices. Unlike full virtualization, the guest environment is aware of its virtualization platform and utilizes the service provided by the host.

4

2.2

Virtualization Technologies

There are numerous implementations of the various architectures described above available today. The following sections describe some of the most common that are of particular interest to our studies.

2.2.1

QEMU — open source processor emulator

QEMU is an open source project which supports both the emulation and full virtualization architectures [11]. QEMU software packages allow users to emulate a number of different platforms and computing architectures on a variety of host platforms. The software is also versatile enough to take advantage of microprocessor extensions and host a full virtualization environment. Many other virtualization projects use portions of this project to support their own architectures, especially in the area of system peripheral and device emulation.

2.2.2

KVM — Kernel-based Virtualization Machine

KVM is an open source project adding virtualization capabilities to the standard Linux kernel [12, 13]. The virtual machine monitor portion amounts to a normal Linux process, very unintrusive to the host system compared to competing virtualization technologies. For instance, Xen [14] requires a specialized host subsystem to provide the virtual machine monitor. This technology is an example of a full virtualization architecture capable of running unmodified software in virtual environments. Recent developments concerning I/O interfaces enable KVM to also host paravirtualized environments. The virtio package [15] provides lightweight drivers shared by both the host and virtual environments. These interfaces eliminate some of the overhead concerned with emulating hardware for the benefit of using standard drivers. Most modern Linux distributions include package level support for KVM.

2.2.3

Direct Device Access

An alternative to paravirtualization that also provides fast I/O interfaces to the guest environment is known as direct device access (otherwise known as PCI passthrough). This technique was adopted early on by Xen [16] and later added to KVM upon the development of IOMMU [17] technologies (Intel’s VT-d [18]). This feature allows the host to give the guest environment exclusive and direct access to system hardware (such 5

as network interfaces, frame buffers, disk drives, etc.) to improve performance. The guest operates at near native speeds for services related to the hardware involved. The obvious drawback is scalability and the fact that the host system (and other guest environments if multiple are resident) lose access to the hardware. Dispite the logistical drawbacks of the configuration, the technology has been shown to perform extremely well [19].

3

Virtualization Performance

Due to the recent microprocessor innovations concerning virtualization support (AMD-V and VT-d), significant virtualization performance overhead has been eliminated. Others [20] have shown CPU intensive tasks with light I/O demands perform well within virtual environments. It has also been shown that I/O intensive tasks still perform poorly within virtual environments, particularly KVM [21]. Optimizing network performance, specifically latency, is of particular value to distributed application that tend to rapidly exchange small messages. Studies have shown [22], significant percentages of the end-to-end message latency time are caused by the inefficiency of the standard TCP/IP protocol stack. Regardless of the virtualization architecture used, some form of software interaction is required to transport messages between the hardware and the virtualization environment software. In fact, general purpose operating systems incur some level of performance overhead when dealing with communication services. Current technologies are available to optimize this interaction specifically for high performance computing applications that are sensitive to high message latencies.

3.1

Networking Performance

KVM offers several interesting networking options for virtual machines. In the ideal case, a network configuration would not require any root privileges, adds no additional overhead to the latency costs, and can be shared by multiple users. In other words it would be no different than if running natively. There are two solutions explored here that attempt to reach this ideal case, namely: (i) public bridging and (ii) PCI pass-through. Neither of these meet our ideal requirements alone however each has a useful purpose. Public bridging ties the host’s network card to a software based tap device. A tap device is a virtual kernel driver that is accessible to the guest. Using KVM, the virtual machine can be assigned its own 6

virtual network card with a MAC address and a specific driver. KVM also allows for paravirtualization of the network card by specifying virtio as the net model option in place of an actual driver. Our tests were inconclusive in reference to the performance gains of virtio vs the e1000 driver in the latest release of KVM v0.11.1. Some problems of this public bridging approach is the additional overhead of going through a network bridge and then a tap device. An additional drawback is the need for root privileges or at least group privileges to create the tap device and then connect the bridge to the tap device. As the name PCI passthrough suggests, the PCI device, in our case the network card, is passed directly to the virtual machine. This gives the virtual machine complete control of the network card thereby removing the card from the host machine’s control. The problems of root access is obviously still prevalent considering this approach requires the device to be detached from the host machine and also prohibits anyone else from using it. For this approach to be viable on a remote machine at least one network card is required per passthrough connection in addition to the card used to remotely control the host. However what makes this option interesting is the huge latency drop that is achieved versus that of public bridging. While not a viable solution for multi user systems, because of each user needing their own network card, it is an interesting solution for someone wanting to run multiple operating systems on a server with marginal network latency costs to the virtual machine. Some other requirements for passthrough to work include all bridged PCI devices must be given control to the virtual machine, this does not include PCIe devices. In addition to this if the PCI device does not support MSI and shares an IRQ address then it also can not be used [13].

4

High Performance Protocol Alternatives

Regardless of the virtualization architecture used, some form of software interaction is required to transport messages between the hardware and the virtualization environment software. In fact, general purpose operating systems incur some level of performance overhead when dealing with communication services. Current technologies are available to optimize this interaction specifically for high performance computing applications that are sensitive to high message latencies. Alternatives to the standard family of protocols (TCP/IP stack) are introduced in the sections below.

7

4.1

Active Message API

Active Messages [4, 23] is one such technology available to improve communication performance and reduce overhead. It provides a specification for low latency message passing facilities required for high performance parallel distributed applications. This technology replaces the standard TCP/IP software stack along with the application networking logic and provides a more direct flow of data from the application level to the communication hardware.

4.2

GAMMA: The Genoa Active Message MAchine

GAMMA is an open source project implementing the Active Message API [24]. Like KVM, this software adds capabilities to the standard Linux kernel by implementing a kernel module interface. When GAMMA loads, the software takes full control of the communication hardware. Application data passed to GAMMA system calls is mapped directly to low level protocol buffers and sent immediately to the communication hardware for transmission. This provides an extremely low-latency message passing mechanism ideal for parallel distributed applications. The GAMMA project suffers two significant shortcomings that inhibit its wide spread use. The foremost being, GAMMA supports only a limited set of communication hardware devices, namely: only ethernet devices with the Intel PRO/1000 and Broadcom Tigon3 chipsets are supported. Secondly, each communicating node requires two network adapters (one for normal TCP/IP traffic and one used by the GAMMA module). In theory, a customized virtual environment should be able to overcome these shortcomings and leverage GAMMA software modules to support efforts in enhancing communication performance. In practice though, there are a few pitfalls and compatibility issues present when attempting to combine the two technologies. Full virtualization architectures are compatible provided they can emulate a communication device supported by GAMMA. With the inclusion of QEMU device emulation software for the Intel 8254x adapters, KVM has the capability of realizing this configuration. Providing multiple adapters to the guest environment to satisfy GAMMA’s requirement is easily done. To support a paravirtualized architecture, a specialized driver interface needs to be developed for GAMMA to enable communication. Such an interface to virtio has been implemented for the purpose of this study. The following describes the experiments conducted to test each configuration’s ability to utilize GAMMA within a virtualized environment and provide 8

high performance low-latency communication services comparable to native environments.

5

Experimental Setup

This section describes the experiments conducted with the intent of measuring and comparing communication latencies of various virtual environments to that of platform equivalent native environments. The following subsections specifically define the hardware and software configurations used.

5.1

Hardware Processor: AMD Athlon II X2 250 Mother Board: Gigabyte GA-MA78GM-S2HP Chipset: AMD 780G/SB700 Memory: Corsair DDR2 800 MHZ, 4.0 GB PCI LAN: Realtek RTL-8169 PCI LAN: Intel 82541PI Gigabit (e1000)

The equipment was chosen to replicate inexpensive commodity computing hardware potentially used to construct Beowulf clusters. At the time of this writing, all of the hardware components based on cost would be characterized as low to mid-range (at best) with a very high performance/cost ratio. This has the favorable side effect of demonstrating that expensive high-performance equipment is not necessarily required to take advantage of virtual computing environments. Two identical systems comprised of the hardware specified above are used to carry out the experiments. The PCI Realtek LAN is connected to Ethernet switch and is used to transmit TCP/IP communications during the experiments (GAMMA requirement). The PCI Intel LAN is connected via a crossover cable and is used to transmit the TCP and GAMMA messages being measured for performance. The direct cable link was chosen to isolate the experimental environment and reduce the variance of the measurements due to external hardware events not related to the virtualization environments.

5.2

Software Operating System: Debian GNU/Linux x86_64 (5.0/Lenny)

9

Kernel: 2.6.26 Communication API: GAMMA (09-04-14) Virtualization Technology: KVM and QEMU (KVM-85, QEMU 0.10)

The same operating system configuration is used for both the native and the virtualized guest environments when performing the experiments. Although the experiments are conducted using the Debian GNU/Linux distribution, they should work for practically any modern Linux distribution compatible with KVM and GAMMA. This is particularly true if the Linux kernel is 2.6.24 or higher.

5.3

Measurement Methodology

Standard communication performance benchmark programs could not be utilized to conduct these experiments because of the protocols and system configurations used. A simple application was developed to measure communication latency using the TCP/IP, UDP/IP, and GAMMA protocols at varying message lengths. The application is a simple master/slave ping-pong design in which one node communicates a message and the other immediately sends a response. The standard Linux system call gettimeofday() is used to capture the time stamp. Typically, it is convenient to access the x86 microprocessor’s Time Stamp Counter (TSC) through an assembly instruction to accurately measure high precision timing events. However, each core of the processor being used has an individual counter, which causes an issue if the measuring process moves from one core to another. Bovet and Cesati [25] do an excellent job of describing the internal working of the Linux kernel pertaining to timing measurements. The gettimeofday() system call actually identifies and uses the most accurate timing mechanism available to the running platform. As a set of side experiments, timing data was collected using gettimeofday() and by accessing the TSC hardware. The results were identical even when the process was purposely isolated to a single core. The experiments were first conducted with the test programs running within the native environment on both systems to establish a base-line performance measurement. Each experiment sends 100,000 messages and was conducted 3-5 times each for message lengths of [1, 150, 300, 450, 600, 750, 900, 1050, 1200, 1350] bytes each to establish a reliable statistical mean and variance of the results. The top of Figure 1 shows a scatter plot of a 500,000 message sample of GAMMA messages transmitted 10

Figure 1: Virtual environment GAMMA latencies. Scatter plot (top) and histogram (bottom) of sample data. by the test software running within a virtual environment. The bottom of Figure 1 is a histogram of the exact same data set represented in the scatter plot. A series of experiments were also conducted using UDP to communicate the messages. The latency results were slightly better than those of the tests conducted using TCP. This result would be expected given the lower protocol overhead of UDP compared to TCP. However, the performance difference is insignificant in the context of this discussion. Therefore, the UDP results are left out of the analysis of experimental results for purpose of simplicity. As a final application note, the TCP NODELAY socket option is turned “ON” before performing the tests using TCP. This option directs the TCP/IP protocol stack to transmit the messages instantaneously, as opposed to delaying the transmission in an effort to optimize bandwidth (Nagle Algorithm) [26].

6

Experimental Results

Figure 2 summarizes the latency results from running the test software within the three different environments using the TCP/IP protocol. The “native” columns of each grouping are the results of running the ex-

11

periments without virtualization. The “virtio” columns are the results from running the test software within a paravirtual KVM environment using the virtio device driver. Lastly, the “e1000” columns shows the same test only using the fully virtualized e1000 driver emulated by the QEMU project code. As expected, the paravirtualization environment provides better performance than the full virtualization environment. This was actually not the case using the standard virtio software. Although, the driver performs very well with regards to high-end throughput, it is less desirable with respect to low-latency response times. Part of the virtio logic delays the transmission of packets in an effort to potentially queue up additional packets and service them together. This reduces overhead and optimizes raw throughput similar to the TCP NODELAY feature of TCP/IP. Just as we enabled TCP NODELAY in an effort to reduce latencies in our test application, we also have to bypass the virtio transmission delay timers. The effect of this change was significant. Latency times fell from over a millisecond to the roughly one half millisecond on average, as indicated in Figure 2. Figure 3 summarizes the latency results from running the test software using the GAMMA protocol to exchange messages. The individual bar columns of each group have the same meaning in Figure 2. Unfortunately, there is an anamoly concerning the virtio driver’s performance and the GAMMA driver. With GAMMA, the emulated Intel PRO/1000 driver outperforms the paravirtualized driver by a large margin, nearly 50%. This happens to be a software engineering problem as opposed to a system shortcoming or configuration mistake. The newly created driver designed to bridge the GAMMA communication paths onto the virtio interface is not performing optimally at this time. Better memory management and manipulation of the message buffers is required to increase the performance to anticipated levels. The most interesting aspect of Figure 3 is not the disappointing paravirtual environment’s performance but the full virtualization environment’s performance using the emulated e1000 driver. The full virtualization’s performance is impressive because it is comparable to the native performance of the host system. Figure 4 directly compares the native TCP performance to fully virtualized GAMMA based performance. As indicated by the horizontal lines, which represent the average latency over all message lengths, the full virtualization environment equipped with the GAMMA communication service is comparable to the performance of the nonvirtualized host system. It is actually slightly better than the native performance of the host system. Thus the flexibility of using a virtualized guest can ultimately lead us to similar and

12

Figure 2: Comparison of TCP/IP message latencies.

Figure 3: Comparison of GAMMA message latencies.

13

Figure 4: Comparison of native TCP/IP vs KVM GAMMA message latencies. possibly better network communication performance over a traditional system. Furthermore, as further improvements in the paravirtualized GAMMA driver are achieved, the actual paravirtualized communication performance latencies may well be substantially below those of native TCP. The next set of tests conducted were aimed at measuring the performance of the direct device access virtualization features mentioned earlier (PCI passthrough). These tests required a change to our test system to include components compatible with Intel’s VT-d technology. So the AMD CPU/motherboard was swapped for newer Intel i7 products. In addition, the Linux kernels compatible with GAMMA that we have been using do not support the direct device access features either. Thus we had to upgrade the kernel of the host system, while leaving the guest environment unchanged. In summary, we swapped the CPU hardware and installed a newer virtual machine monitor to conduct our tests. Processor: Intel Core i7 920 Mother Board: ASRock X58 Extreme Core i7 Motherboard Operating System: Debian GNU/Linux x86_64 (5.0/Lenny) Kernel: 2.6.32

Figure 5 summarizes the latency results from running the test software while using the TCP/IP proto14

Figure 5: Comparison of TCP/IP w/direct device access. col. The columns labeled “native” are the results of running the experiments without virtualization. The columns labeled “VT-d” summarize the results of running the test software in a virtual environment while having direct access to the communication hardware. Unlike others [19] who have demonstrated near native performance from VT-d enabled virtual environments, our results were disappointing. This is possibly more evidence that the designers of virtualized communication services are focusing their efforts on maximizing throughput at the expense of low-latency message response. Figure 6 summarizes the latency results with the test software within the same two configurations using the GAMMA protocol. The performance results here were exactly as expected. The virtual environment delivers perform communication services uneffected by virtualization overhead. Figure 7 again directly compares the virtualization performance of GAMMA using PCI passthrough against the native (non-virtualized) TCP performance. In this case, the virtualized GAMMA system outperforms the native system by a wide margin, roughly 50%.

15

Figure 6: Comparison of GAMMA w/direct device access.

Figure 7: Comparison of native TCP/IP vs GAMMA w/direct device access.

16

7

Conclusion

Using KVM we were able to show that while the virtual machine monitor added a substantial amount of overhead to access the shared networking hardware, the added overhead was low enough to gain speedup over a native host in a virtual guest optimized with a low latency communication active message driver (GAMMA). In fact, in the virtual guest with the lightweight GAMMA messaging layer, we saw substantially lower communication latencies than the native (non-virtual) TCP/IP host operating system. Thus we plan to continue to optimize a virtual guest for our time warp simulation kernel. In addition to communication optimizations, we expect to make changes to task scheduling, memory management, and other services to better match our application. However, from which initial O/S we will begin, we have yet to decide. Considerable work has gone into building high-performance operating systems for parallel computing. Some of these efforts have been focused on general purpose parallel computing [27], some have argued for minimalist operating system services [28], and still others have targeted specific application environments [29]. Unfortunately, the cost of parallel hardware generally drives them toward general purpose use that is best serviced by a general purpose operating system supporting a broad range of cross-disciplinary users. This makes it quite difficult to deploy and use non-standard operating systems on parallel hardware. The widespread emergence of hardware virtualization technologies is, however, changing this fact. Using virtualization, these full custom execution environments/operating systems can peacefully co-exist with general purpose computation nodes that can service the more standard users of parallel hardware.

References [1] R. M. Fujimoto, Parallel and Distribution Simulation Systems. New York, NY, USA: John Wiley & Sons, Inc., 2000. [2] D. R. Jefferson, “Virtual time,” ACM Trans. Program. Lang. Syst., vol. 7, no. 3, pp. 404–425, 1985. [3] C. D. Carothers, R. M. Fujimoto, and P. England, “Effect of communication overheads on time warp performance: an experimental study,” SIGSIM Simul. Dig., vol. 24, no. 1, pp. 118–125, 1994.

17

[4] T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser, “Active messages: a mechanism for integrated communication and computation,” tech. rep., University of California, Berkeley, Berkeley, CA, USA, 1992. [5] U. K. V. Rajasekaran, M. Chetlur, G. D. Sharma, R. Radhakrishnan, and P. A. Wilsey, “Addressing communication latency issues on clusters for fine grained asynchronous applications — a case study,” in International Workshop on Personal Computer Based Network of Workstations, PC-NOW’99, april 1999. [6] G. Ciaccio, “Gamma project homepage,” 2009. [7] J. N. Matthews, E. M. Dow, T. Deshane, W. Hu, J. Bongio, P. F. Wilbur, and B. Johnson, Running Xen: A Hands-On Guide to the Art of Virtualization. Upper Saddle River, NJ, USA: Prentice Hall PTR, 2008. [8] Leung, “Intel virtualization technology: Hardware support for efficient processor virtualization,” Intel Technology Journal, 2008. [9] D. Chisnall, The Definitive Guide to the Xen Hypervisor (Prentice Hall Open Source Software Development Series). Upper Saddle River, NJ, USA: Prentice Hall PTR, 2007. [10] “Paravirtualization,” Dec. 2009 (last updated). [11] F. Bellard, “Qemu project homepage,” 2009. http://www.qemu.org/. [12] A. Kivity, “kvm: the linux virtual machine monitor,” in OLS ’07: The 2007 Ottawa Linux Symposium, pp. 225–230, July 2007. [13] Qumranet, “Kvm project homepage,” 2009. http://www.linux-kvm.org/. [14] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, “Xen and the art of virtualization,” in SOSP ’03: Proceedings of the nineteenth ACM symposium on Operating systems principles, (New York, NY, USA), pp. 164–177, ACM, 2003. [15] R. Russell, “virtio: towards a de-facto standard for virtual i/o devices,” SIGOPS Oper. Syst. Rev., vol. 42, no. 5, pp. 95–103, 2008. 18

[16] K. Fraser, S. H, R. Neugebauer, I. Pratt, A. Warfield, and M. Williamson, “Safe hardware access with the xen virtual machine monitor,” in In 1st Workshop on Operating System and Architectural Support for the on demand IT InfraStructure (OASIS, 2004. [17] AMD, “Amd i/o virtualization technology (iommu) specification,” tech. rep., Advanced Micro Devices, Inc., 2009. [18] D. Abramson, J. Jackson, S. Muthrasanallur, G. Neiger, G. Regnier, R. Sankaran, I. Schoinas, R. Uhlig, B. Vembu, and J. Wiegert, “Intel virtualization technology for directed i/o,” Intel Technology Journal, vol. 10, pp. 179–192, August 2006. [19] B.-A. Yassour, M. Ben-Yehuda, and O. Wasserman, “Direct device assignment for untrusted fullyvirtualized virtual machines,” tech. rep., IBM Research, 2008. [20] M. Fenn, M. A. Murphy, and S. Goasguen, “A study of a kvm-based cluster for grid computing,” in ACM-SE 47: Proceedings of the 47th Annual Southeast Regional Conference, (New York, NY, USA), pp. 1–6, ACM, 2009. [21] L. Nussbaum, F. Anhalt, O. Mornard, and J.-P. Gelas, “Linux-based virtualization for hpc clusters,” in Linux Symposium 2009, July 2009. [22] S. Larsen, P. Sarangam, and R. Huggahalli, “Architectural breakdown of end-to-end latency in a tcp/ip network,” in SBAC-PAD, pp. 195–202, 2007. [23] T. H. Von Eicken, Active messages: an efficient communication architecture for multiprocessors. PhD thesis, University of California, Berkeley, 1993. Co-Chair-Culler, David E. and Co-Chair-Wawrzynek, John. [24] G. Ciaccio, D. Tavangarjan, and R. D. Russel, “A communication system for efficient parallel processing on clusters of personal computers,” 1999. [25] D. Bovet and M. Cesati, Understanding the Linux Kernel, Third Edition. Sebastopol, CA, USA: O’Reilly & Associates, Inc., 2006.

19

[26] W. R. Stevens, TCP/IP illustrated (vol. 1): the protocols. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1993. [27] S. R. Wheat, A. B. MacCabe, R. Riesen, D. W. van Dresser, and T. M. Stallcup, “Puma: an operating system for massively parallel systems,” Sci. Program., vol. 3, no. 4, pp. 275–288, 1994. [28] D. R. Engler and M. F. Kaashoek, “Exterminate all operating system abstractions,” in HOTOS ’95: Proceedings of the Fifth Workshop on Hot Topics in Operating Systems (HotOS-V), (Washington, DC, USA), p. 78, IEEE Computer Society, 1995. [29] D. Jefferson, B. Beckman, F. Wieland, L. Blume, M. Di Loreto, P. Hontalas, P. Laroche, K. Sturdevant, J. Tupman, V. Warren, J. Wedel, H. Younger, and S. Bellenot, “Distributed simulation and the time warp operating system,” in Proceedings of the 12th SIGOPS — Symposium of Operating Systems Principles, pp. 77–93, 1987.

20