Enabling Instantaneous Relocation of Virtual Machines ... - CiteSeerX

265 downloads 255391 Views 488KB Size Report
technology is not suitable for dynamic resource provisioning .... VM memory is also made from a VMEM device file at source. ..... Page Update Workload: Private Network Throughput, and CPU Usage at Source & Destination (Pre-Copy). 0. 5.
Enabling Instantaneous Relocation of Virtual Machines with a Lightweight VMM Extension Takahiro Hirofuchi, Hidemoto Nakada, Satoshi Itoh, and Satoshi Sekiguchi National Institute of Advanced Industrial Science and Technology (AIST) Central 2, Umezono 1-1-1, Tsukuba, Japan 305-8568 Email: t.hirofuchi at aist.go.jp

Abstract—We are developing an efficient resource management system with aggressive virtual machine (VM) relocation among physical nodes in a datacenter. Existing live migration technology, however, requires a long time to change the execution host of a VM; it is difficult to optimize VM packing on physical nodes dynamically, corresponding to ever-changing resource usage. In this paper, we propose an advanced live migration mechanism enabling instantaneous relocation of VMs. To minimize the time needed for switching the execution host, memory pages are transferred after a VM resumes at a destination host. A special character device driver allows transparent memory page retrievals from a source host for the running VM at the destination. In comparison with related work, the proposed mechanism supports guest operating systems without any modifications to them (i.e, no special device drivers and programs are needed in VMs). It is implemented as a lightweight extension to KVM (Kernel-based Virtual Machine Monitor). It is not required to modify critical parts of the VMM code. Experiments were conducted using the SPECweb2005 benchmark. A running VM with heavily-loaded web servers was successfully relocated to a destination within one second. Temporal performance degradation after relocation was resolved by means of a precaching mechanism for memory pages. In addition, for memory intensive workloads, our migration mechanism moved all the states of a VM faster than existing migration technology.

I. I NTRODUCTION In cloud computing, virtual machine technology plays an important role for server consolidation. Physical computing resources are efficiently managed in enormous datacenters of service providers, and virtualized computing resources are offered for remote customers in a pay-per-use manner. IaaS (Infrastructure-as-a-Service) providers need to run customers’ VM as much as possible to fully utilize their datacenter capacity; increasing datacenter utilization is the key to success for a datacenter business. To achieve higher resource utilization, we consider that the next-generation of datacenter should exploit live migration technology to dynamically optimize VM deployment on physical machines. The number of VMs on a physical node should be overcommitted to consolidate VMs efficiently; host nodes of VMs are dynamically changed corresponding to their resource usage, thereby maintaining optimum utilization of physical machines. For example, by aggregating idle VMs to one physical node, service providers can provide more VMs to customers. They also can turn off unused physical machines to reduce power consumption.

To the best of our knowledge, however, commercial IaaS providers do not utilize live migration technology to overcommit VM deployment. Once management systems assign VMs on physical machines, they never change the locations of running VMs. The number of VMs on one physical node is statically defined in advance, calculated from resource allotments per VM. Even though all VMs on a physical node are idle, consuming only small portions of assigned resources, providers do not launch more VMs on the node. The fundamental reason why overcommitted deployment has not yet been realized is that available live migration technology needs a long time to change the location of a VM. VM deployment on physical machines cannot be quickly optimized corresponding to ever-changing resource usage of VMs. Although most commercial providers present VM performance criteria to customers, it is difficult for management systems to assure VMs of their maximum assigned resources. If the states of VMs suddenly change from idle to active, the locations of VMs can not be optimized again to meet the change. For instance, to migrate a VM with 1.7 GB of RAM (i.e., the memory size of the default Amazon EC2 instance [1]), it takes more than 20 seconds with a GbE network. Overcommitted VMs cannot be rebalanced promptly, which results in performance degradation of customers’ VMs. In this paper, we propose an advanced live migration mechanism that allows instantaneous relocation of VMs. The execution host of a VM can be quickly switched within 1 second. This mechanism is designed as a lightweight extension to KVM (Kernel-Based Virtual Machine Monitor) [2]. In comparison with related work, our proposed mechanism does not need to modify the critical parts of a VMM, such as memory management code. In addition, all guest operating systems are supported without any modifications to them. Because our implementation is quite simple and stable, we believe that this implementation already has a production-level quality and it is ready to be used in real world environments. Section II clarifies our contention that available live migration technology is not sufficient for virtualized datacenters. Section III summarizes related work. Section IV and Section V describe the design and implementation of our instantaneous live migration mechanism. Section VI shows performance evaluations and Section VII concludes this paper.

II. BACKGROUND We envision more efficient IaaS service platforms that can drastically reduce the hosting cost of virtualized datacenters. In our previous work [3], we experimentally developed a virtualization toolkit which aggressively optimizes VM locations. A genetic algorithm (GA) engine periodically decides optimal locations for VMs and then dynamically relocates VMs with live migration in order to reduce the number of power-on physical nodes. However, it is difficult to extend this work for real worlds where performance criteria are presented to customers. For example, Amazon EC2 states its default VM instance has a CPU capacity equivalent to a 1.0-1.2 GHz 2007 Opteron processor. Because existing live migration mechanisms take a long time to relocate VMs on physical machines, the GA engine can not aggressively repack idle VMs with overcommitted resource allocation. If one of the VMs suddenly becomes active, VMs must be rebalanced as soon as possible to meet performance criteria. Dozens of seconds are, however, required to change the execution hosts of VMs; during this long period, performance criteria promised to customers can not be guaranteed. As stated in [4], [5], [6], existing live migration technology reconstructs a VM’s memory image at a destination host before switching its execution node. In this paper, we call this approach pre-copy live migration. After live migration is initiated, this basically works as follows. 1) Reserve virtual CPU and memory resources at a destination host. 2) Start dirty page logging at a source host. This mechanism detects updates of memory pages during the following memory copy steps. 3) Copy all memory pages to the destination. Since the VM is running at the source host, memory pages are being updated during this period. 4) Copy dirtied memory pages to the destination again. Repeat this step until the number of remaining memory pages is small enough. 5) Stop the VM at the source. Copy the content of virtual CPU registers, the states of devices, and the rest of the memory pages. 6) Resume the VM at the destination host. At the third step, all memory pages are transferred to the destination, which means that migration time basically increases in proportion to the memory size of the VM. Moreover, at the fourth step, dirtied pages must be periodically copied to the destination. If the VM is intensively accessing large amounts of memory, numerous dirty pages are created and transferred. In the worst case, as noted in Section VI-C, live migration is never completed; a workload dirties VM memory faster than network bandwidth can accommodate. Because it is hard to estimate when migration is completed, existing migration technology is not suitable for dynamic resource provisioning in virtualized datacenters. Recently, many memory usage optimization mechanisms have been proposed, and some of them are now (or going

to be) available in production. The memory size of a VM is dynamically extended/shrunk corresponding to actual memory usage of its guest operating system. On a physical node, the same (and similar) memory pages are shared among multiple VMs ([7], [8], [9], [10]). We consider that memory usage of VMs will not become a primary concern for consolidating VMs efficiently, although other resource types such as CPU and I/O bandwidth will be potential barriers for our project; to allow overcommit and rebalance of these resources, switching execution hosts of VMs should be completed instantaneously. However, pre-copy live migration technology does not contribute to this requirement. III. R ELATED W ORK In academia, post-copy live migration was proposed to reduce the relocation time of VMs. In contrast with precopy migration, memory pages are transferred after a VM is resumed at a destination host. The key to post-copy migration is an on-demand memory transfer mechanism, which traps the first access to a memory page at the destination and copies its content from a source host. Post-copy migration basically works as follows: 1) Stop the VM at the source host. Copy the content of virtual CPU registers and the states of devices to the destination. 2) Resume the VM at the destination without any memory pages. 3) If the VM touches a not-yet-transferred memory page, pause the VM temporarily. Copy the content of the memory page from the source. Then, resume the VM. The third step is repeated until all memory pages are transferred to the destination. In prior studies, switching the execution host is performed much more quickly than pre-copy migration; copying the largest part of VM states (i.e., memory pages) is postponed until after the VM is resumed at the destination. SnowFlock [11] can dynamically replicate a running VM on other physical machines. A fork()-like API was implemented to control VM cloning, so that developers can easily program distributed systems composed of multiple nodes. This mechanism is similar to post-copy live migration, except that VM cloning needs to continue the original VM at a source host. When VM cloning is initiated, the original VM is stopped momentarily, and the content of virtual CPU registers and page table information are copied to the destination hosts. Then, the original VM and the replicated VMs are continued. At the source host, all memory updates are performed on newlyallocated pages to preserve the memory image at the moment of cloning. When the replicated VMs try to read a not-yettransferred memory page, its content is retrieved from the source. SnowFlock was implemented by modifying both Xen [12] and its paravirtualized Linux system. Memory allocation of Linux is extended for VM cloning, thereby reducing memory page retrievals over the network; memory pages allocated after cloning are never transferred. The memory management

of the hypervisor is also modified to trap memory access by VMs. A study [13] developed a post-copy live migration mechanism for the paravirtualization mode of Xen, which exploited the swap-in/out code of the Linux kernel for on-demand memory transfer. When live migration is initiated, most memory pages of a VM are swapped out to a special swap device backed by physical memory; the VM is slimmed down, and then the content of virtual CPU registers and the rest of the memory pages are quickly transferred to a destination host. After the VM is resumed at the destination, a guest operating system performs swap-in operations to load memory pages. A special swap device at the destination retrieves the content of memory pages from the source. By extending the paging support of the guest kernel, this mechanism reduces modification to the hypervisor. In addition, Xen’s memory paravirtualization is exploited to implement this special swapin/out operation with no memory copy. It just exchanges Pseudo Physical Address and Machine Frame Number (real RAM address) at the mapping table of the hypervisor. Both the above implementations are deeply dependent on the memory abstraction of the Xen hypervisor, which is the most critical part of the virtualization code. This kind of extension needs enormous efforts to be sufficiently stable for various environments. Actually, although the authors of SnowFlock published its experimental source code, it seems that there needs to be more careful tests and improvements before it can be used in production. It will require the continuous efforts of hypervisor experts to merge the extensions into productionlevel code maintained by an open source community. In addition, both the mechanisms need to modify the kernel of the guest operating system. Further development is required to support other operating systems and different versions of Linux. Because IaaS providers allow users to customize VMs flexibly, it is hard to enforce the rule that a guest VM must be properly configured for post-copy migration. IV. P OST-C OPY L IVE M IGRATION TOMORROW ’ S C LOUD

FOR

We believe that post-copy live migration is essential technology for dynamic VM balancing in virtualized datacenters. As discussed in Section II, however, existing mechanisms are not suitable for the real world because of their complexity; there is still a lack of working code. We consider a feasible mechanism of post-copy live migration must meet the following requirements: • Be carefully designed to appear in production with the minimum work. An extension to a hypervisor should be preferably small and easily acceptable to an open source community. Our proposed mechanism should not remain only in an academic paper, but should also appear with a short lead time in commercial cloud services. • Be independent of the internals of a guest VM. An extension should allow service providers and customers to have completely-isolated administrative domains. No special drivers and programs should be required to be

Fig. 1.

Overview of Our Post-copy Live Migration Mechanism

Fig. 2.

Overview of Our VMEM Device

working on a guest operating system. Post-copy live migration should be performed transparently for VMs. We propose a novel post-copy live migration mechanism for instantaneous relocation of VMs, which is implemented with a trivial extension to a widely-used hypervisor, KVM. The mechanism is completely independent of guest operating systems. On-demand memory transfer for migrated VMs, which is the heart of post-copy live migration, is implemented outside of the hypervisor code; a small special program on a host operating system transparently handles page faults and immediately copies remote memory pages. In this section, we describe its design and implementation in detail. A. KVM (Kernel-based Virtual Machine Monitor) KVM is one of the most widely-used VMMs, developed by an open source community. By loading its device drivers, the Linux kernel works as the hypervisor for KVM. It fully exploits recent hardware technology for virtualization, such as Intel VT [14] and AMD-V [15], so that unmodified operating systems run inside VMs with little performance overhead. KVM is an extension to a userland hardware emulator (i.e., QEMU [16]); on a host operating system, running VMs are normal processes, which are easy to handle and extend. It supports pre-copy live migration. B. Proposed Mechanism Figure 1 illustrates an overview of our post-copy live migration mechanism. The memory area of a VM is mapped

Fig. 3.

VM Relocation Steps in Post-copy Live Migration

onto a special device file, which was developed by us to implement on-demand memory retrievals outside of the KVM code. This is a trivial change for KVM; only the way of the allocating VM’s memory is changed. At a destination host, the first access to a memory page is trapped by our device driver and its content is copied from a source host by a helper process at userland. The memory pages of the VM can be manipulated from the helper process. In this paper, we call this mechanism a VMEM device. It is composed of only a small device driver (VMEM driver) and a userland process (VMEM process) on a host operating system. 1) VMEM Device: As shown in Figure 2, the VMEM device provides memory sharing between a normal process (e.g., QEMU) and a VMEM process by exploiting the Direct I/O feature of Linux and mmap() to a VMEM device file (/dev/vmem0). After loading a VMEM device driver (vmem.ko), VMEM device files (/dev/vmem0, etc.) are created on a host operating system. A VMEM process allocates memory pages at userland, which are to be shared with another process later on. The allocated memory pages are reported by the VMEM driver via ioctl() of /dev/vmem0. Then, the VMEM driver remaps the memory pages also into kernel address space by calling the kernel functions of Direct I/O. In Linux, it is difficult (and not recommended) to allocate large memory in kernel, because the number of reserved structures for kernel mapping is limited at boot time. Therefore, for a VM’s memory area that is possibly over 1 GB, the memory pages are allocated at userland and passed to the kernel by Direct I/O. When a userland process performs mmap() to /dev/vmem0, the allocated memory pages are provided. This means that the mapping process and the VMEM process share the same memory pages; memory manipulation by one

process can be seen from the other process. The first page access by the mapping process is trapped at the VMEM driver. The mmap() implementation of Linux prepares the content of mapped memory pages when they are accessed for the first time. If the mapping process accesses a not-yet-prepared page, the page fault handler of the VMEM driver is called by the kernel with its page number. The page fault handler notifies the VMEM process of the target page number via ioctl(), in order to set up the content of the faulted page at userland. It should be noted that VMEM is a general mechanism to trap memory access and manipulate page data. It will be applicable, for example, to software debugging and security inspection, as well as post-copy live migration of KVM. 2) Extension to KVM: The original KVM allocates VM memory at userland by doing mmap() to /dev/zero. At a destination host, we modify this to use a VMEM device file (e.g., /dev/vmem0), instead. At a source host, we create a normal file on a memory file system (e.g., /dev/shm/kvm/mem0) for VM memory, and let KVM perform mmap() to it. If a VM migrates more than one time, VM memory is also made from a VMEM device file at source. After the execution host of a VM is switched, an mmap()’ed file at source is the target of retrievals over the network. 3) VM Relocation: In post-copy live migration, VM relocation works as illustrated in Figure 3. The steps 2-5 below are repeated until all memory pages are relocated. 0) At a source host, a QEMU process pauses its VM. In the address space of the QEMU process, the VM’s memory pages, CPU registers, and device states are retained. 1) The QEMU process pulls out the content of the CPU registers and the device status, then transfers them to a destination host. At the destination, another QEMU process resumes the VM with these states. The QEMU

process at the source is terminated. 2) At the destination, the VM accesses a memory page. 3) If the memory page has not yet been accessed at the destination, the kernel calls the page fault handler of the VMEM driver. The VM is temporarily paused by the kernel. VMEM performs the following steps. 4) The VMEM driver requests a VMEM process to get the content of the faulted page. 5) The VMEM process fetches the content of the page from the VMEM process at the source, then writes it to the shared VM memory. The VMEM driver completes the page fault handler. Finally, the VM is continued again. After all memory pages have been transferred to the destination by this on-demand copy and the background copy described later, the VM has no dependency on the source host; it is safe to shutdown the VMEM device and process at the source. Network connections used for memory retrievals are terminated. The allocated memory pages at the source are released. Moreover, two additional mechanisms work together with the fault page retrievals, to alleviate temporal performance degradation after relocation. They also work to complete the entire memory copy as soon as possible. First, because memory access of a VM tends to be sequential, the on-demand page retrieval mechanism transfers a range of memory pages at once. In our current implementation, the following 128 pages are copied at the same request of a faulted page. This mechanism precaches memory pages that may possibly be accessed in the next few instructions, so that reduces remote memory retrievals that can incur performance losses. Second, in parallel with the on-demand page retrievals, a background copy mechanism works to make bulk copies of not-yet-transferred pages. Because on-demand page copy may not cover all ranges of VM memory in a short period of time, the background copy mechanism gets rid of dependency on a source host as soon as possible. The background copy mechanism analyzes important memory areas with page fault statistics, and starts to deal with hot-spot memory pages for current VM workloads. On-demand memory page retrievals over a network are also reduced by this mechanism. V. P ROTOTYPE I MPLEMENTATION We developed a prototype implementation of the proposed system for the recent stable releases of KVM (KVM-84/88 and qemu-kvm-0.11). We added approximately 200 lines to the userland code of an original KVM; if post-copy live migration is enabled, the target of mmap() for VM memory allocation is changed as noted in Section IV-B2, and memory page copy is skipped in live migration code. These modifications just add simple conditional branches, which do not affect most of the existing code. The VMEM driver is a Linux kernel device driver of approximately 500 lines. The major part of the driver is a page fault handler which notifies userland of page fetch requests. The userland part of VMEM is a daemon program which

initializes memory pages and sends/receives page data via TCP/IP. For memory page retrievals over a network, we exploit the NBD (Network Block Device) protocol [17], which is a simple network storage protocol, like iSCSI [18]. In our previous work, we proposed a wide-area live storage migration mechanism [19], and its implementation [20] supports the NBD protocol. We reuse parts of the storage migration code for our post-copy live migration. We carefully designed both the on-demand and background copy mechanisms; the POSIX threads for each mechanism do not have lock contentions, so that the on-demand copy mechanism always runs with a higher priority than the background one. Ideally, the on-demand copy should not be affected by the background one, and is always completed as soon as possible. Different NBD connections are used to retrieve memory pages, respectively for the on-demand copy and the background one. It is possible to use network QoS mechanisms together, for example, thereby giving a higher priority to Ethernet frames related to the on-demand copy. Additionally, we experimentally implemented data compression for memory page copy over a network, which possibly contributes to finishing all memory transfers quickly and reduces network traffic usage. The LZO algorithm [21] is used to compress page data from a source host. Although LZO is a fast compression algorithm, the performance impact on host processing power is not negligible, so this feature is optional and disabled in the default settings. We confirmed that our prototype implementation successfully worked for Linux and Windows 7 (RC Build 7100) on a GbE network. For instance, a running VM of Windows 7, in which YouTube movies were being played, was instantaneously relocated to another physical host in several hundred milliseconds. Although, immediately after relocation, many retrievals of memory pages were initiated, the movies were being smoothly played without any visible pauses. Before switching the execution host, approximately 8 Mbytes of data was transferred to the destination, most of which was the states of a virtual VGA device. In the case that the VGA device was disabled, only 256 Kbytes were transferred at relocation. VI. E VALUATION Experiments were conducted to evaluate the performance of the proposed system. We used the SPECweb2005 benchmark [22], which is a commonly-used benchmark suite of performance measurements for web servers. Figure 4 shows the experiment environment which models virtualized datacenters. There are source/destination physical machines with a 2-core processor and 4 GB of memory1 . They are connected to two GbE network segments. A VM with one virtual CPU core and 1 GB of RAM is launched at the source machine, and then migrated to the destination. The VM is connected to the physical network segments via network bridges. Inside the VM, Linux (Debian Lenny) and an Apache web server are 1 Intel

Core2 Duo E6305, 4 GB DDR2 RAM

Transferred Web Data (bytes/s)

1800 1600 1400 1200 1000 800 600 400 200 0 0

50

100

150

200

250

300

Time (s)

Fig. 7.

Web Server Throughput (Post-Copy)

45

Experiment Environment

40 Transferred Pages (%)

Fig. 4.

Average Response Time (s)

10 9 8 7 6

30 25 20 15 10 5

5 4 3

0 0

50

100

150

200

250

300

Time (s)

2 1

Fig. 8.

0

Fig. 5.

50

100

150

200

250

Remote Page Retrievals (Post-Copy)

300

Time (s)

300000

Response Time (Post-Copy)

250000

60

Transferred Page Offset

0

Number of Responses

35

Failed Tolerable Good

50 40 30

200000 150000 100000 50000 0 0

20

50

100

150

200

250

300

Time (s)

10

Fig. 9.

Page Offsets (Post-Copy)

0 0

50

100

150

200

250

300

Time (s)

Fig. 6.

Response QoS (Post-Copy)

configured to accept SPECWeb’s client connections from the public network. In a private network, there is a shared storage server which provides virtual disk images accessible from both the source and destination. Also, a database simulator node of SPECweb is configured in the private network. In addition to SPECweb, we also performed experiments with our workload program, focusing on page updates.

A. Post-Copy Live Migration (Background Copy Disabled) We ran an internet banking benchmark (SPECweb Banking) for the VM. The number of concurrent client connections was configured to 200; the VM was heavily loaded and its CPU usage kept at approximately 100%. At the time of 150 seconds, post-copy live migration was initiated to switch the execution host of the VM. The results of this benchmark are shown in Figures 5, 6 and 7. In this experiment, the background copy mechanism was disabled to focus on on-demand memory retrievals. The switching of the execution host was completed within one second. Immediately after the switching, however, the

Transferred Web Data (bytes/s)

Average Response Time (s)

9 8 7 6 5 4 3 2 1

2000 1800 1600 1400 1200 1000 800 600 400 200

0

0 0

50

100

150

200

250

300

0

50

100

Time (s)

Response Time (Post-Copy, Background Copy Enabled)

Number of Responses

70

250

300

Failed Tolerable Good

45

50 40 30 20 10

40 35 30 25 20 15 10 5

0 0

50

100

150

200

250

300

Time (s)

Fig. 11.

200

Web Server Throughput (Post-Copy, Background Copy Enabled) 50

Failed Tolerable Good

60

Fig. 12.

Number of Responses

Fig. 10.

150 Time (s)

Response QoS (Post-Copy, Background Copy Enabled)

request response time temporarily became worse. As indicated in Figure 8, many remote page retrievals were performed. Approximately 10 seconds later, the page retrieval speed gradually slowed down. Although not clearly shown within 300 seconds, request response time recovered later. Figure 6 shows user experiences estimated by the benchmark; if a state is ”Failed,” a user will leave the online banking site without doing a transaction, because he/she feels the web site hangs up. After the switching, there were Failed states, which were gradually resolved corresponding to the increasing number of retrieved pages. In this experiment, the banking web site kept working for most customers. If the number of concurrent client connections is smaller, the decline of user experiences becomes milder. Figure 9 shows transferred page offsets. The offsets of retrieved pages were nearly all within a particular area of VM memory. Not all pages are required immediately after the relocation. This means that the background copy mechanism should be designed to copy important areas of memory pages before others, so that performance degradation can be better alleviated. B. Post-Copy Live Migration (Background Copy Enabled) Next, we did the same experiment again with the background copy enabled. At the time of 5 seconds after switching the execution host, the background copy was started with

0 0

50

Fig. 13.

100

150

200

250

300

Response QoS (Pre-Copy)

a transfer speed of 800 Mbps. All memory pages were transferred in approximately 10 seconds. As shown in Figures 10, 11, and 12, after all memory pages were transferred, no performance degradation was observed. The response time recovered, returning to the same value as before the relocation. There was, however, a spike of response time at the time of 150 seconds, which caused Failed QoS states just after the relocation. During this benchmark, the VM was heavily loaded, consuming nearly 100% of CPU usage; this result would be the worst case for live migration. When we did the same experiments with fewer concurrent client sessions, the spike was lower and the decline of QoS states was negligible. In future work, we will address dynamic VM rebalance with instantaneous live migration, and we currently consider that it is reasonable to give idle VMs a higher relocation priority, because the performance impact on such VMs will probably be slight or negligible. C. Pre-Copy Live Migration To compare our proposed mechanism, we conducted the same experiment with KVM’s pre-copy live migration. Figures 13 and Figure 14 show the results of this experiment. We started live migration at the time of 150 seconds. As shown in private network traffic (Figure 14), KVM could not finish recursive memory page transfer in pre-copy live migration (as

Network Throughput (Mbytes/s)

120 100 80 60 40 20 0

DST->SRC SRC->DST

0

50

100

150

200

250

CPU Usage at Source (%)

Time (s)

system user

35 30 25 20 15 10 5 0 0

50

100

150

200

250

CPU Usage at Destination (%)

Time (s)

system user

35 30 25 20 15 10 5 0 0

50

100

150

200

250

Time (s)

Page Update Workload: Private Network Throughput, and CPU Usage at Source & Destination (Pre-Copy)

35

180

30

160 Migration Time (s)

Network Throughput (Mbytes/s)

Fig. 16.

25 20 15 10 5

140 120 100 80 60 40 20

0

0 0

50

100

150

200

250

300

0

Time (s)

Fig. 14.

Private Network Traffic (Pre-Copy)

explained in Section II); during this experiment, the execution of the VM was never switched. After live migration is initiated, KVM tries to replicate all ranges of memory pages of a running VM. It repeatedly copies updated memory pages during this replication. KVM does not stop the VM until the remaining memory pages are less than a small value, which is calculated from an estimated transfer

1000 2000 3000 4000 5000 6000 7000 8000 Memory Page Update Speed (pages/s)

Fig. 15.

Migration Time with Different Page Update Speeds (Pre-Copy)

speed in order to make relocation downtime within 3 seconds. In this experiment, the running VM at the source host updated memory pages faster than memory transfer over the network; the percentage of remaining memory pages did not become lower than the calculated value. Although implementation details are different among VMMs, there exist potential problems regarding actively run-

Network Throughput (Mbytes/s)

120 100 80 60 40 20 0

DST->SRC SRC->DST

0

50

100

150

200

250

CPU Usage at Source (%)

Time (s)

system user

35 30 25 20 15 10 5 0 0

50

100

150

200

250

CPU Usage at Destination (%)

Time (s)

system user

35 30 25 20 15 10 5 0 0

50

100

150

200

250

Time (s)

Fig. 17.

Page Update Workload: Private Network Throughput, and CPU Usage at Source & Destination (Post-Copy)

ning VMs which are accessing memory pages intensively; for example, relocation downtime becomes larger as noted in the above, and in another case, a VMM forcibly slows down a VM to migrate it in a finite time [5]. D. Memory Page Update Workload To focus the impact of migration mechanisms, we developed a workload program that updates memory pages with a specified speed. It first allocates a large amount of memory pages (90% of the VM memory size), and then repeatedly updates all memory pages by writing one byte at each page. As illustrated in Figure 15, relocation time of pre-copy live migration increases, corresponding to page update speeds. When we ran the workload program with a page update speed of 10240 pages/s, memory transfer of migration continued endlessly as discussed in Section VI-C. Figure 16 and Figure 17 show the results of round trip migration. Vertical lines indicate the times of execution host changes. We ran the workload program with a page update speed of 5120 pages/s. First, we launched the VM at the source host, and started the workload program on the guest operating system. At the time of 50 seconds, we moved the VM to

the destination host by pre-/post-copy live migration. After all migration steps were completed, we continued running the VM at the destination for 60 seconds. Finally, we moved the VM back to the source host. Pre-copy live migration took approximately 65 seconds to switch the execution host and complete migration. On the other hand, post-copy live migration promptly switched the execution host, and completed all memory transfers in approximately 10 seconds. In post-copy live migration, the memory image at the source is never updated during memory copy; the background copy mechanism sends a large amount of memory pages at once, thereby efficiently utilizing the available bandwidth of the private network. In pre-copy live migration, however, the memory image at the source is intensively updated by the running VM. The pre-copy mechanism periodically detects updated pages from all of the VM’s memory pages, and transfers them at each iteration of memory copy. Therefore, the memory transfer speed over a network is limited approximately at 30 Mbytes/s, which results in a long migration time. Post-copy migration involved higher CPU utilization than pre-copy. In Figure 17, the first spike in CPU usage at the

source host was caused by copying CPU and device states. The first CPU usage spike at the destination host was triggered by VMEM device initialization. Next, at both the hosts, the ondemand and background copy mechanism used approximately 20% of the CPU power. After all memory transfer was completed, there remained approximately 5% CPU usage, which was caused by page faults. These memory pages had already been transferred to the new host, but not yet accessed by the VM there. We consider that further optimization is possible for our implementation in order to reduce CPU usage. For rapid prototyping, currently, requests for memory copy are sent for each 4 Kbyte page, which should be merged into a single request. Unnecessary page faults will be able to be skipped by modifying the page table directly. We will make such optimizations before releasing our proposed mechanism in production. VII. C ONCLUSION In this paper, we proposed an advanced live migration mechanism which enables instantaneous VM relocation on physical machines. A VM is promptly migrated to another machine in several hundred milliseconds. To minimize the time required for switching the execution host, memory pages are transferred after the VM resumes at a destination host. In comparison with related work, our proposed mechanism has great advantages. First, it does not require any special drivers and programs running inside of VMs. It supports guest operating systems without any modifications to them. This is suitable for the virtualized datacenters of the IaaS Cloud, because customers can flexibly customize their VMs, and service providers can dynamically optimize VM locations. Second, this mechanism is implemented as a lightweight extension to KVM. It is not required to modify the critical parts of a VMM, such as memory management. A special device driver is added to a host operating system, which transparently works for on-demand memory retrievals after relocation. Our prototype implementation is remarkably simple and already sufficiently stable for daily use in our laboratory. We believe, this mechanism should not remain only in academic papers, but also should be available in commercial cloud services in the very near future. Experiments showed that a heavily-loaded VM was successfully migrated to another physical machine within 1 second. The background copy mechanism, which precaches important memory pages, significantly contributed to minimizing temporal performance degradation after relocation. In addition, our proposed mechanism completely moved all the states of a running VM (including memory pages), faster than pre-copy live migration. It reduced the total number of transferred memory pages, and efficiently copied them by utilizing available network bandwidth. We are now designing an advanced resource management system for virtualized datacenters, which is based on our instantaneous live migration mechanism. VMs are placed on physical machines with overcommitted resource allocation. Corresponding to actual resource usage, VMs are quickly

rebalanced among physical machines, thereby meeting performance criteria and reducing power consumption. This challenge will be reported in our upcoming papers. ACKNOWLEDGMENTS This work was partially supported by KAKENHI (20700038) and JST/CREST ULP. R EFERENCES [1] Amazon Elastic Compute Cloud, http://aws.amazon.com/ec2. [2] A. Kivity, Y. Kamay, D. Laor, and A. Liguori, “kvm: the Linux virtual machine monitor,” in Proceedings of the Linux Symposium. The Linux Symposium, 2007, pp. 225–230. [3] H. Nakada, T. Hirofuchi, H. Ogawa, and S. Itoh, “Toward virtual machine packing optimization based on genetic algorithm,” in Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing, and Ambient Assisted Living (Proceedings of International Symposium on Distributed Computing and Artificial Intelligence 2009), ser. Lecture Notes in Computer Science, vol. 5518. Springer, Jun 2009, pp. 651– 654. [4] M. Nelson, B.-H. Lim, and G. Hutchins, “Fast transparent migration for virtual machines,” in Proceedings of the USENIX Annual Technical Conference. USENIX Association, 2005, pp. 25–25. [5] C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield, “Live migration of virtual machines,” in Proceedings of the 2nd Symposium on Networked Systems Design and Implementation. USENIX Association, 2005, pp. 273–286. [6] A. Mirkin, A. Kuznetsov, and K. Kolyshkin, “Containers checkpointing and live migration,” in Proceedings of the Linux Symposium. The Linux Symposium, Jul 2008, pp. 85–92. [7] D. Gupta, S. Lee, M. Vrable, S. Savage, A. C. Snoeren, G. Varghese, G. M. Voelker, and A. Vahdat, “Difference engine: Harnessing memory redundancy in virtual machines,” in Proceedings of the Eighth USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, 2008, pp. 309–322. [8] D. Magenheimer, C. Mason, D. McCracken, and K. Hackel, “Paravirtualized paging,” in Proceedings of the Usenix First Workshop on I/O Virtualization. USENIX Association, 2008. [9] C. A. Waldspurger, “Memory resource management in VMware ESX server,” in Proceedings of the 5th Symposium on Operating Systems Design and Implementation. ACM Press, 2002, pp. 181–194. [10] A. Arcangeli, I. Eidus, and C. Wright, “Increasing memory density by using KSM,” in Proceedings of the Linux Symposium. The Linux Symposium, Jul 2009, pp. 19–28. [11] H. A. Lagar-Cavilla, J. A. Whitney, A. Scannell, P. Patchin, S. M. Rumble, E. de Lara, M. Brudno, and M. Satyanarayanan, “SnowFlock: Rapid Virtual Machine Cloning for Cloud Computing,” in Proceedings of the Fourth ACM European Conference on Computer Systems. ACM Press, 2009, pp. 1–12. [12] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, “Xen and the art of virtualization,” in Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles. ACM Press, 2003. [13] M. R. Hines and K. Gopalan, “Post-copy based live virtual machine migration using adaptive pre-paging and dynamic self-ballooning,” in Proceedings of the 5th International Conference on Virtual Execution Environments. ACM Press, 2009, pp. 51–60. [14] G. Neiger, A. Santoni, F. Leung, D. Rodgers, and R. Uhlig, “Intel virtualization technology: Hardware support for efficient processor virtualization,” Intel Technology Journal, vol. 10, no. 13, pp. 167–178, Aug 2006. [15] Advanced Micro Devices, AMD64 Architecture Programmer’s Manual Volume 2: System Programming, Revision 3.14, Sep 2007. [16] F. Bellard, “Qemu, a fast and portable dynamic translator,” in Proceedings of the USENIX Annual Technical Conference. USENIX Association, 2005, pp. 41–41. [17] P. T. Breuer, A. M. Lopez, and A. G. Ares, “The network block device,” 1999. [18] J. Satran, K. Meth, C. Sapuntzakis, M. Chadalapaka, and E. Zeidner, “Internet small computer systems interface (iSCSI),” RFC 3720, Apr. 2004. [Online]. Available: http://www.ietf.org/rfc/rfc3720.txt

[19] T. Hirofuchi, H. Nakada, H. Ogawa, S. Itoh, and S. Sekiguchi, “A live storage migration mechanism over wan and its performance evaluation,” in Proceedings of the 3rd International Workshop on Virtualization Technologies in Distributed Computing. ACM Press, Jun 2009, pp. 67–74. [20] T. Hirofuchi, “xNBD,” http://bitbucket.org/hirofuchi/xnbd/. [21] M. F. X. J. Oberhumer, “LZO – a real-time data compression library,” http://www.oberhumer.com/opensource/lzo/. [22] Standard Performance Evaluation Corporation, “SPECweb2005,” http://www.spec.org/web2005/.