GPU Resource Sharing and Virtualization on High ... - IEEE Xplore

0 downloads 0 Views 327KB Size Report
GPU Resource Sharing and Virtualization on High Performance Computing. Systems. Teng Li, Vikram K. Narayana, Esam El-Araby, Tarek El-Ghazawi.
2011 International Conference on Parallel Processing

GPU Resource Sharing and Virtualization on High Performance Computing Systems Teng Li, Vikram K. Narayana, Esam El-Araby, Tarek El-Ghazawi Department of Electrical and Computer Engineering The George Washington University Washington, DC, USA {tengli, vikram, esam, tarek}@gwu.edu

Abstract—Modern Graphic Processing Units (GPUs) are widely used as application accelerators in the High Performance Computing (HPC) field due to their massive floating-point computational capabilities and highly dataparallel computing architecture. Contemporary high performance computers equipped with co-processors such as GPUs primarily execute parallel applications using the Single Program Multiple Data (SPMD) model, which requires balanced computing resources of both microprocessor and coprocessors to ensure full system utilization. While the inclusion of GPUs in HPC systems provides more computing resources and significant performance improvements, the asymmetrical distribution of the number of GPUs relative to the microprocessors can result in an underutilization of overall system computing resources. In this paper, we propose a GPU resource virtualization approach to allow underutilized microprocessors to share the GPUs. We analyze factors affecting the parallel execution performance on GPUs and conduct a theoretical performance estimation based on the most recent GPU architectures as well as the SPMD model. Then we present the implementation details of the virtualization infrastructure, followed by an experimental verification of the proposed concepts using an NVIDIA Fermi GPU computing node. The results demonstrate a considerable performance gain over the traditional SPMD execution without virtualization. Furthermore, the proposed solution enables full utilization of the asymmetrical system resources, through the sharing of the GPUs among microprocessors, while incurring low overheads due to the virtualization layer.

academia, such as [2]. Latest offerings from supercomputer vendors have begun to include GPUs in the compute blades of their parallel machines; examples include the SGI Altix UV [3] and the recently released Cray XK6 supercomputer incorporating NVIDIA GPUs [4]. Yet another example is the Tianhe-1A supercomputer, which uses NVIDIA Tesla M2050 GPUs and currently ranks 2nd in the Top 500 list of supercomputers, with a sustained Linpack performance of 2.57 PFlop/s [5]. Development of applications for any of these HPC systems requires the use of parallel programming techniques. Among the different parallel programming approaches, the most commonly followed programming style is the Single Program Multiple Data (SPMD) model [6]. Under the SPMD scenario, multiple processes execute the same program on different CPU cores, simultaneously operating on different data sets in parallel. By allowing autonomous execution of processes at independent points of the same program, SPMD serves as a convenient yet powerful approach for efficiently making use of the available hardware parallelism. With the introduction of hardware accelerators, such as GPUs, as co-processors, HPC systems are exhibiting architectural heterogeneity that has given rise to programming challenges not previously existing in traditional homogeneous parallel computing platforms. With the SPMD approach used for programming most of the homogeneous parallel architectures, directly offloading the program instances on to GPUs is not feasible due to the different Instruction Set Architectures (ISAs) of CPUs and GPUs. Moreover, GPUs are primarily suited for the compute-intensive portions of the program, serving as coprocessors to the CPUs in order to accelerate these sections of the parallel program. The “single program” requirement of SPMD therefore means that every program instance running on the CPUs must necessarily have access to a GPU accelerator. In other words, it is necessary to maintain a oneto-one correspondence between CPUs and GPUs, that is, the number of CPU cores must equal the number of GPUs. However, due to the proliferation of many-core microprocessors in HPC systems, the number of CPU cores generally exceeds the number of GPUs, thereby resulting in the underutilization of the system computing resources within the SPMD approach.

Keywords-GPU; virtualization; resource sharing; SPMD

I.

INTRODUCTION

Rapid advancements in graphic processing technology over the past few years, coupled with the introduction of programmable processors in Graphics Processing Units (GPUs), has led to the advent of GPGPU, or GeneralPurpose computation on Graphic Processing Units [1]. Due to their powerful floating computational capabilities and massively parallel processor architecture, GPUs are increasingly being used as application accelerators in the high-performance computing (HPC) arena. Thus, a wide range of HPC systems now incorporate GPUs as hardware accelerators, including systems ranging from clusters of compute nodes to parallel supercomputer systems. Several examples of GPU-based computer clusters may be found in 0190-3918/11 $26.00 © 2011 IEEE DOI 10.1109/ICPP.2011.88

733

Although the number of GPUs may not match the number of CPU cores found in contemporary HPC systems, modern high-end GPUs internally contain several hundreds of processing cores. For example, the most recent NVIDIA Fermi architecture, in the Tesla 20 series GPU, features 448 Streaming Processor (SP) cores and allows the simultaneous execution of up to 16 GPU kernels [7]. The increasing parallel computation capabilities of modern GPUs enables the possibility of sharing a single GPU to compute different applications or multiple instances of the same application, especially when the application problem size and parallelism is significantly smaller than the inherent parallelism capacity of the GPU. Based on the limitations of resource underutilization under SPMD execution as well as the increasing parallelism available in modern GPUs, we propose the idea of sharing the GPU resources among the microprocessor cores in heterogeneous HPC systems, by providing a virtualized unity ratio of GPUs and microprocessors. We develop a GPU resource virtualization infrastructure, which provides the required symmetry between GPU and CPU resources by virtually increasing the number of GPU resources, thereby enabling efficient SPMD execution. We also analyze the GPU program execution within a single compute node, and provide an analytical model for our resource virtualization scenario. Furthermore, we conduct experiments using our virtualization infrastructure on an NVIDIA Fermi GPU cluster node as a verification of the proposed infrastructure and execution model. The rest of this paper is organized as follows. Section II provides an overview of related work on resource sharing in heterogeneous HPC systems and GPU resource virtualization under different contexts. A background of GPU computing, architecture and execution is given in Section III, followed by a formal analysis of the GPU execution model for our proposed solution in Section IV. Section V then discusses the implementation details of the GPU resource sharing and virtualization infrastructure. The experimental results are presented and discussed in Section VI, followed by our conclusions in Section VII. II.

interaction with the GPU. A similar approach is adopted in vCUDA by Shi et al. [9], also on Xen. They additionally provide suspend and resume facility by maintaining a record of the GPU state within the virtualization infrastructure in the guest OS as well as the management OS. One of the drawbacks of using Xen is the fact that NVIDIA drivers do not officially support Xen, therefore the aforementioned approaches are not portable. Giunta et al. [10] circumvent this problem by using the Kernel Virtual Machine (KVM) available in Linux distribution, while following a similar split-driver approach coupled with API call interception. Their software infrastructure, termed gVirtuS, focuses on providing GPU access to virtual machines within virtual clusters, as part of a cloud computing environment. For cases when the host machine does not have a local GPU, they envision that the virtual machines will use TCP/IP to communicate with other hosts within the virtual cluster in order to gain access to remote GPUs. Although native support for CUDA within virtual machines (VM) is an attractive solution, the use of multiple VMs within the same compute node can result in overheads for HPC applications. To elaborate, efficient SPMD execution requires all CPU cores (processes) to have a virtual view of a GPU; with the VM approach, this would require a virtual machine to be launched for every CPU core within the compute node. Since the number of CPU cores per node is rapidly on the rise, the VM-based approach for GPU sharing can incur significant overheads. Further, the available VM-based solutions use the GPU in a time-shared fashion, and the ability for simultaneous execution of GPU functions from multiple processes is not available. Other types of solutions have also been proposed for GPU virtualization in HPC systems. For example, Duato et al [11] propose the use of remote GPU access similar to [10], for cases when high performance clusters do not have a GPU within every compute node. Instead of using a VM, they propose a GPU middleware solution consisting of a daemon running on GPU-capable nodes that serves requests from non-GPU node clients. The client nodes incorporate a CUDA wrapper library to capture and transfer API function calls to the server using TCP/IP. Although the VM overheads are removed, their proposed solution can result in communication overheads in accessing GPUs from remote compute nodes. Moreover, simultaneous execution of multiple GPU kernels is not discussed. In order to share GPU resources among multiple requesting processes, Guevara et al. [12] propose an approach that involves run-time interception of GPU invocations from one or more processes, followed by merging them into one GPU execution context. Currently their solution is demonstrated for two kernels, with the merged kernels predefined manually. Although kernel merging can be useful for SPMD execution, it would need compiler support for generation of the combined kernels a priori, which can actually be avoided by using the concurrent kernel execution support from the latest NVIDIA GPUs. Moreover, the kernel merging approach incurs data transfer overheads, since the combined kernels are launched

RELATED WORK

With the continued proliferation of GPGPU, virtualization of GPUs as compute devices is gaining considerable attention in the research community. Recent research has focused on providing access to GPU accelerators within virtual machines [8][9][10]. These studies focus on providing native CUDA support within virtual machines, on computers that use NVIDIA GPUs. Gupta et al. [8] presented their GViM software architecture, which allows CUDA applications to execute within virtual machines under the Xen virtual machine monitor. The GPU is controlled by the management OS, called ‘dom0’ in Xen parlance. The actual GPU device driver therefore runs in dom0, while applications execute on the guest OS or virtual machine. By using a split-driver approach, the CUDA runtime API function calls from the application are captured by an interposer library within the virtual machine, followed by transfer of the function parameters to dom0 for actual

734

immediately and does not exploit the concurrent copy and execution feature that allows data transfer for the second compute kernel while the execution of the first kernel is in process. Another solution for sharing GPU resources among multiple processes is proposed by the S_GPU project [13]. S_GPU is a software layer that resides between the application and the GPU, typically used for GPU timesharing between MPI processes in a parallel program. Each MPI process is provided the view of a private GPU, through a custom stream-based API. Each process inserts the GPU commands, such as memory copy, kernel launch, etc, in the required sequence into a stream object, irrespective of the number of GPUs available. When the process initiates the execution of a stream, all the enqueued GPU commands are then executed in the required sequence. The S_GPU software stack takes care of sharing the available GPUs among the streams from multiple processes. The approach followed by S_GPU is complementary to our approach, and may be combined with our proposed approach by simultaneously executing kernels from multiple processes for efficient GPU sharing. Peters et al. [14] proposed another technique for sharing an NVIDIA GPU among multiple host threads on a compute node. Their technique involves the use of persistent kernels that are initialized together within a single execution context, and remain executing on the GPU indefinitely. Each of these persistent kernels occupies a single block of threads on the GPU, and they execute the appropriate function based on commands written to the GPU memory by the host process running on the CPU. This ‘management thread’ accepts requests from other CPU threads for the use of GPU resources. Their approach allows for multiple kernels to simultaneously execute even on devices that do not support concurrent kernel execution. However, due to the persistent nature of the kernels, the number of thread blocks is severely limited, which means that the memory latency may not be hidden effectively. As a result, highly data-parallel applications will not be able to take full benefit of the GPU resources. Furthermore, their approach requires significant changes to the application code to fit within a thread block. Also, the communication mechanism between the management thread and other CPU threads is not clear. Nevertheless, the use of a management process to control the GPU and manage the GPU memory and resources is a common feature shared with our proposed solution. Virtualization of co-processors has been studied for other technologies as well, such as Field-Programmable Gate Arrays (FPGAs). For example, Huang and Hsiung [15] provide an OS-based infrastructure that allows multiple hardware functions from different processes to be configured in the FPGA, by using the partial run-time reconfiguration feature. Their virtual hardware mechanism allows a given hardware function on the FPGA to be used by multiple applications, as well as enabling one application to access multiple hardware functions simultaneously residing in the FPGA. Our previous work [16] also uses the partial run-time reconfiguration feature of the FPGA, albeit for enabling efficient SPMD execution in HPC systems. This work

partitions the FPGA into multiple regions, and allocates each region to a CPU core within the compute node. By thus providing a private or virtual FPGA to every CPU core, the required 1:1 ratio of the number of CPU cores to the number of virtual FPGAs is achieved. This work forms the basis of our proposed virtualization solution for GPU-based HPC systems. Our proposed approach makes use of features from the latest GPUs, such as concurrent kernel execution and concurrent data transfer and execution, to improve the execution performance of the system. Overheads are kept small by maintaining a simple communication mechanism between the CPU processes and our proposed virtualization infrastructure. III.

BACKGROUND OF GPU COMPUTING

We provide a brief overview of GPU programming model and computing architecture. Our execution analytical model will be based on the Compute Unified Device Architecture (CUDA) [7], Open Computing Language (OpenCL) [17] programming model and NVIDIA Fermi architecture [18]. A. GPU Programming Model CUDA and OpenCL are two similar GPU programming models provided by NVIDIA and the Khronos Working Group, respectively. Both models follow SPMD by executing data-parallel kernel functions within the GPU. Both models also provide abstractions of thread group hierarchy and shared memory hierarchy. In terms of thread group hierarchy, both models (CUDA/OpenCL) provide three hierarchy levels: Grid/NDRange, Block/Workgroup and Thread/Workitem. For convenience, we will use CUDA terminology in the rest of this paper, namely, grids, blocks and threads. GPU kernels are launched per grid and a grid is composed of a number of blocks, which have access to the global device memory. Each block consists of a group of threads, which are executed concurrently and share the accesses to the on-chip shared memory. Each thread is a very lightweight execution of the kernel function. From programming perspective, the programmer needs to write the kernel program for one thread and decides the total number of threads to be executed on the GPU device while dividing the threads into blocks based on the data-sharing pattern, memory sharing and architectural considerations. B. GPU Architectural Model From the hardware perspective, the most recent NVIDIA Fermi GPU architecture [18], as shown in Figure 1, is composed of 16 streaming multiprocessors (SMs), each of which consists of 32 Streaming Processor (SP) cores. Each thread is executed on an SP core and each block runs on an SM at a given time. Inside each SM, threads of the block are scheduled as a batch of 32 threads at one time, which is called a warp. In other words, while blocks are running on SMs, the warp schedulers (two in the case of Fermi) inside each SM schedule each warp from the blocks to execute on SPs. Moreover, more than one block can be assigned to one SM provided that there are enough resources such as register and shared memory available on the SM.

735

Global GPU Memory

噯 Virtual GPU View for Processor N

Interconnection Network

Processor 1

Processor N

Node 1 Streaming Multiprocessor

Streaming Multiprocessor

Streaming Multiprocessor

Ă

Streaming Multiprocessor

Thread Block 1

Ă

Thread Block N

Streaming Processor

Streaming Processor

Warp Scheduler Streaming Processor

Ă



Streaming Processor

Virtual GPU View for Processor N

Interconnection Network Thread Blocks are executed on SMs

GPU Device Memory

Warp Scheduler



Streaming Multiprocessor

Virtual GPU View for Processor 1

Shared Memory



GPU

System Interconnection Network

GPU Device Memory

Virtual GPU View for Processor 1

Processor 1 GPU

噯 Processor N

Figure 1. An overview of Fermi architecture Node N

Figure 2. A virtual SPMD view of asymmetric heterogeneous HPC systems with GPUs

Programming and executing a grid of GPU kernel threads for a single kernel involves creation of a GPU context for the kernel. If more than one kernel grids are running on a single GPU from different processes, different kernel contexts will be created for each kernel. Thus each kernel can share the GPU sequentially at the cost of GPU context switching. Under Fermi architecture, multiple kernels only from the same GPU context can be executed simultaneously if the size of each kernel cannot occupy the full GPU. We will utilize the concurrent kernel execution as well as concurrent I/O and kernel execution capability of Fermi under our virtualization infrastructure and will discuss further in the following sections. IV.

GPU Initialization

Send Data

Compute

Retrieve Data

Tinit

Tdata_in

Tcomp

Tdata_out

Figure 3. A GPU execution cycle for a single process

computationally intensive functions to be executed on the GPU. Our approach will expose a virtual view of the GPU to each microprocessor by creating a virtualization layer so that each microprocessor “sees” its own Virtual GPU (VGPU) and executes its own GPU functions within each node.  B. Process-level GPU Request and Execution Model To study the potential GPU performance gain (speedup) using virtualization, we will focus our discussions on GPU tasks while excluding the software (CPU) tasks. This is because we are mostly interested in enhancing the performance of GPU tasks only by using virtualization. Meanwhile we also try to avoid adding unnecessary complications to model the scenario of co-scheduling both CPU and GPU tasks. We set the comparison baseline to be the scenario that multiple SPMD processes have direct access to the physical GPU without the virtualization layer. Under the asymmetrical architecture as previously described, conventionally, multiple processes running on microprocessors within each node will execute multiple individual GPU tasks on the shared single physical GPU sequentially. To analyze the execution pattern of conventional sharing the GPU without virtualization, we model the execution cycle for a given process to perform a computing task on the GPU, as shown in Figure 3. The execution cycle is composed of the following 4 stages: the process first initializes the GPU device, creates its own GPU context and allocates the GPU device memory, followed by sending the data, followed by computing the task, and finally retrieving the data back to the process. The time spent on each stage is also specified in Figure 3.

ANALYSIS OF THE GPU EXECUTION MODELS

In this section, we perform a study of the factors that affect the overall performance of GPU kernel executions. By conducting this study, we will be able to investigate the performance potential of using our concept of centralized virtualization manager based resource sharing. We will provide an execution model depicting several different GPU kernel execution scenarios. We will derive a set of formulas for each scenario and give an analysis of the amount of performance gain that can be theoretically achieved. Our implementation is based on the analysis describe in this section. In describing this model, we follow the similar approaches, which have been presented in [16]. We will first describe a SPMD architectural model consisting of GPUs while applying our virtualization concept to the model. Then we will describe an execution and timing model of GPU execution request on the process level based on the SPMD virtualization architecture model. Finally we will derive performance formulas and make some comparisons. A. SPMD Architecutral Model with GPU Virtualization Figure 2 shows a representative HPC system architecture with several nodes connected by an interconnection network. Within each node there is a heterogeneous asymmetry between the microprocessors and GPU. Under the SPMD scenario, each processor runs the same application on different data while the application has a few

736

GPU Initialization

Send Data

Compute

Retrieve Data

Context Switch

Send Data

Compute

Retrieve Data

Tinit

Tdata_in

Tcomp

Tdata_out

Tcontext_switching

Tdata_in

Tcomp

Tdata_out

Ă

Context Switch

Send Data

Compute

Retrieve Data

Tcontext_switching

Tdata_in

Tcomp

Tdata_out

Figure 4. GPU execution cycles from multiple processes without virtualization Process N

Send Data

Compute

Ă

Ă

Process 3

Send Data

Process 2 Process 1

Compute

Send Data

GPU Initialization

Send Data

Tinit

Tdata_in

Retrieve Data

Compute Compute

Retrieve Data Retrieve Data

Retrieve Data

Tcomp

Tdata_out

(a) Process N

Send Data

Compute

Ă

Process 1

Send Data Send Data

GPU Initialization

Send Data

Tinit

Tdata_in

Retrieve Data

Ă

Process 3 Process 2

Wait

Compute Compute

Wait Wait

Compute

Retrieve Data

Retrieve Data

Retrieve Data

Tcomp

Tdata_out

(b) Figure 5. GPU execution of compute-intensive applications with virtualization, when (a) previous “retrieve data” finishes before current “compute”, (b) current “compute” finishes before previous “retrieve data” Process N

Send Data

Ă Send Data

Process 2 GPU Initialization Tinit

Send Data Send Data Tdata_in

Retrieve Data

Ă

Process 3

Process 1

Compute

Compute

Compute

Retrieve Data

Tcomp

Tdata_out

Compute

Retrieve Data

Retrieve Data

(a) Process N

Send Data

Ă Send Data

Process 2 GPU Initialization Tinit

Send Data Send Data Tdata_in

Wait

Retrieve Data

Ă

Process 3

Process 1

Compute

Compute Tcomp

Compute

Wait

Compute

Wait

Retrieve Data

Retrieve Data

Retrieve Data Tdata_out

(b) Figure 6. GPU execution of I/O intensive applications with virtualization, when (a) previous “retrieve data” finishes before current “compute”, (b) current “compute” finishes before previous “retrieve data”

processes is sequential among tasks. Moreover, in the execution model, since each process needs to initialize the GPU device and create its own context, we assume there is a total GPU initialization overhead for all processes, Tinit at the beginning and each of the following processes will incur an average context switching overhead Tctx_switch, as shown in Figure 4. Therefore, the total execution time for the conventional scheme without virtualization can be derived in equation (1). Ttotal_no_vt = (Ntask – 1) (Tctx_switch + Tdata_in + Tcomp + Tdata_out) (1) + Tinit + Tdata_in + Tcomp + Tdata_out

Besides, we define necessary model parameters in TABLE I and will use them in our further mathematical analysis in the following. 1) The Execution Model of GPU Sharing Among SPMD Processes Without Virtualization On Each Node Under the conventional execution scheme within each node, multiple processes treat the GPU as a non-virtualized resource. The latest NVIDIA GPU architecture allows multiple host processes to be executed under sharing compute mode, by creating multiple GPU contexts for each process. The GPU kernels from multiple processes are serially executed in the order queued with respective GPU context switching overheads. In other words, the execution model for convention GPU sharing among multiple

2) The Execution Model of Efficient GPU Sharing Among SPMD Processes With Virtualization On Each Node

737

TABLE I. Ntask NVGPU Nprocessor Tinit Tctx_switch Tdata_in Tdata_out Tcomp Ttotal_no_vt Ttotal_vt S Smax

In our execution model, SPMD execution assumes and requires Ntask not greater than Nprocessor. By incorporating SPMD execution conditions into the model, we can derive the total execution time, as shown in the following equations. In the scenarios shown in Figure 5 (a) and Figure 6 (a), the execution follows: Under the initial SPMD condition: Ntask  NVGPU If (Tdata_in  Tdata_out) Ttotal_vt = Ntask Tdata_in + Tcomp + Tdata_out (2) In the case shown in Figure 5 (b) and Figure 6 (b), the execution follows: If (Tdata_in < Tdata_out) (3) Ttotal_vt = Tdata_in + Tcomp + Ntask Tdata_out To combine equation (2) and equation (3) together, we derive equation (4) as the following: Ttotal_vt = Ntask MAX (Tdata_in, Tdata_out) + Tcomp + MIN (4) (Tdata_in, Tdata_out ) To compare equation (4) with equation (1), we can derive equation (5), which shows the speedup of total execution time with virtualization versus without virtualization.

PARAMETERS DEFINED IN THE EXECUTION MODEL The number of parallel tasks (SPMD processes) in each node, which should not exceed Nprocessor The number of virtual GPU views for each process exposed from the virtualization layer, which equals Nprocessor The total number of processors in each node The total time for all processes to initialize the GPU device and corresponding GPU contexts The average time for each process to switch to its own GPU context The average time for each process to transfer the data into the GPU device memory The average time for each process to retrieve the data back from the GPU device memory The average time for the GPU to compute the task The total time to execute all the tasks from all processes on the HPC system without the virtualization approach The total time to execute all the tasks from all processes on the HPC system with the GPU virtualization approach The theoretical speedup that can be achieved from Ttotal_vt to Ttotal_no_vt The theoretical maximum speedup that can be achieved from Ttotal_vt to Ttotal_no_vt

S

Using the proposed virtualization technique, multiprocesses treat the physical GPU as their individual VGPUs due to the run-time virtualization layer. The GPU requests from all processes are handled by the virtualization layer and execution results are transferred back to the processes from the layer. Proposed as a centralized virtualization manager approach, the execution model with virtualization is able avoid context switching overhead while hiding the initialization overhead as well. This is because the virtualization layer is created to have full access to the physical GPU as a single run-time process, which has the only required GPU context that has already been initialized. Furthermore, by utilizing concurrency supports from Fermi architecture limited within a single GPU context, the execution model with virtualization provides three possible overlaps: the overlap of multiple kernel executions, I/O and kernel executions as well as bi-directional I/O. Since the execution model is to provide a performance upper bound, we assume the GPU resource is large enough to accommodate Ntask GPU kernels and single-directional data transfers always take the full I/O bandwidth and therefore cannot be inter-overlapped. We also assume that process sequence is maintained in order since the process finishing order does not affect the total execution time. Figure 5 and Figure 6 show four execution scenarios under this model. Figure 5 shows execution of computational intensive applications while Figure 6 shows execution of I/O intensive applications. Both figures provide two possible scenarios depending on the rate of computation and I/O. Comparing the two scenarios in Figure 5 and Figure 6, respectively, if the current process finishes computing before the previous process finishes retrieving data, then the current process needs to wait for previous data retrieving. On the other hand, if the current process finishes computing after the previous process finishes retrieving data, then the current process can continue retrieving data without waiting.

Ttotal _ no _ vt Ttotal _ vt

{( N task  1)(Tctx _ switch  Tdata_ in  Tcomp 

 Tdata_ out )  Tinit  Tdata_ in  Tcomp  Tdata_ out}

(5)

{N task MAX (Tdata_ in , Tdata_ out )

 Tcomp  MIN (Tdata_ in , Tdata_ out )} To estimate the theoretical maximum speedup under our virtualization scenario, we consider equation (5) and assume Ntask can range from 1 to +. If we take the limit of this equation so that Ntask goes all the way to +, we can drive equation (6) as the following:

S max  lim S  Ntask 

Tctx _ switch  Tdata_ in  Tcomp  Tdata_ out

(6)

MAX (Tdata_ in , Tdata_ out )

Equation (6) provides a performance upper bound for the proposed virtualization approach and clearly shows that the theoretical performance improvement increases along with the task computation time Tcomp and context switching overhead Tctx_switch but is limited by the I/O time. This is because the virtualization technique provides certain amount of execution overlap for both compute-intensive and I/O intensive applications while eliminating context switching overhead. If the GPU resource allows, compute-intensive application could benefit more from our proposed approach. V.

THE GPU VIRTUALIZATION INFRASTRUCTURE

Using our concept of providing process-level SPMD execution parallelism and overlapping under the GPU virtualization scenario, we lay out the virtualization infrastructure, which is composed of two layers creating virtual GPU views to the processes. Figure 7 shows the hierarchical view of the virtualization infrastructure and data flows between layers. The base of the infrastructure is a runtime virtualization layer, which manages the underlying GPU computing and memory resources. On top of the base

738

GVM

Processor 1

Processor 2 Process 1

Retrieve Data Retrieve Data from Host Pinned Memory

Response Queue P2 Data

Sends GPU Data to GVM Through Virtual Shared Memory -SND()

SND

Starts Executing the GPU Program - STR()

STR

ACK

ACK

Provides Virtual and GPU Resource (P2)

Copies Data from Virtual Shared Memory to Host Pinned Memory (P1)

Copies Data from Virtual Shared Memory to Host Pinned Memory (P2)

SND

Buffers the “STR” Buffers the “STR” Message Message Barrier to Synchronize “STR” from All Processes

STR

STP

Barrier to Synchronize “ACK” to All Processes

If status(Stream1)=0 Sends “Wait” Otherwise Sends”ACK”

If status(Stream2)=0 Sends “Wait” Otherwise Sends ”ACK”

Copies Result Data from Host Pinned Memory to Virtual Shared Memory (P1)

Copies Result Data from Host Pinned Memory to Virtual Shared Memory (P2)

ACK Sends “Data Retrieval” Message - RCV()

Figure 7. A hierarchical view of the GPU virtualization infrastructure and the data flow

Retrieves the GPU Data Through Virtual Shared Memory

layer lies the user process API layer. The base layer is consisted of the GPU Virtualization Manager (GVM), virtual shared memory spaces for each processor as well as a request/response message queue. The API layer, which provides a virtual GPU abstraction to the user processes, physically handles the inter-layer communication and synchronization as well as data-transfer. In the base layer, the GVM is a run-time process responsible for initializing all virtualization resources, handling requests from processes and processing requests on the GPU device. The initialization of the GVM creates the virtualization resources including the virtual shared memory spaces and the request/response queues. The virtual shared memory space is implemented as POSIX shared memories for each process so that each process has its own virtual memory space. The virtual shared memory space handles data exchanges between processes and the GVM. Furthermore, the shared memory size is user-customizable to ensure the total size does not exceed the GPU memory size. The request/response queues are implemented as two POSIX message queues to stream the process requests into the GVM and provide handshaking synchronization responses. By using streaming queues, resource contention problems are prevented. The GVM also sets request barriers to ensure that SPMD tasks from different processes can be executed in parallel. On the GPU side, when initialized, the GVM creates the necessary GPU resources including the only required GPU context and CUDA streams for each process. It also creates different memory objects for each process separately to ensure data from different processes can co-exist in the GPU memory safely. These memory objects include both GPU device memory and host pinned memory. While host pinned memory provides better I/O bandwidth, it is also required to be setup for achieving concurrent I/O and kernel execution using asynchronous

Sends “Release Resource” MessageRLS() Finishes the Rest of the Program

REQ

Provides Virtual and GPU Resource (P1)

ACK

ACK

RCV

ACK

RLS ACK

Release Virtual and GPU Resource (P1)

Release Virtual and GPU Resource (P2)

Waits for Requests

Requests VGPU Resource -REQ() Sends GPU Data to GVM Through Virtual Shared Memory -SND() Starts Executing the GPU Program - STR()

ACK STP WAIT

Ă

Ă

System Host Pinned Memory

STP WAIT

Process 2 Executes CPU portion of the Task Prepares the GPU Data

Starts Executing All CUDA streams Stream 1 start point Stream 2 start point AsycMemCpy to AsycMemCpy to GPU from Host GPU from Host Pinned Memory Pinned Memory Asynchronously Asynchronously Launches Kernel 1 Launches Kernel 2 AsycMemCpy from AsycMemCpy from GPU to Host Pinned GPU to Host Pinned Memory Memory Stream 1 end point Stream 2 end point

ACK Queries the Execution Status - STP() If(WAIT), Resends “STP”

Queue

Ă

P2 Data

P1 Data

P1 Data

Base Layer

REQ

Ă

CUDA API GPU Driver GPU GPU Device Memory

Send Data to Host Pinned Memory

Send Data to Host Pinned Memory

Retrieve Data from Host Pinned Memory

GPU Virtualization Manager

Virtual Shared Memory

Requests VGPU Resource -REQ()

Ă

Send Data

Virtual Shared Memory

Queue

Executes CPU portion of the Task Prepares the GPU Data

API Layer

User Process 2 API Request Queue

Send Data

Retrieve Data

User Process 1 API

Initializes Virtual Resource for All Processes Gets the GPU Device Initializes Context Initializes GPU Resource for All Processes Waits for Requests

STP ACK

Queries the Execution Status - STP() If(WAIT), Resends “STP”

RCV

Sends “Data Retrieval” Message - RCV()

ACK

Retrieves the GPU Data Through Virtual Shared Memory

RLS ACK

Sends “Release Resource” MessageRLS() Finishes the Rest of the Program

Figure 8. The detailed execution and synchronization flow of the GVM and two user processes

streams. Moreover, the GVM takes the requested CUDA kernel functions and prepares the kernels to be executed when initialized. The abstraction of the API layer allows the user processes to interact with the underlying run-time virtualization layer in a transparent way. While the API layer exposes the programmers with a virtual GPU resource view, the programmers will only need to provide the base layer with the GPU kernel function they wish to execute on the VGPU. The programmer also need to take care of data exchange with the virtual shared memory, by following the procedure shown in Figure 8 using provided API routines such SND (), STR () and RCV () etc. Thus it requires very little effort to port existing GPU programs into the virtualization infrastructure. Figure 8 shows the detailed execution flows of the two layers as well as interaction flows between layers, which is, in other words, the interaction and synchronization among two processes and the GVM. While GVM follows the flow of initializing resources, processing requests, taking care of memory transfers and executing multiple CUDA streams, the process follows the flow of sending different requests and waiting for responses. Since requests from multiple processes come roughly simultaneously under SPMD scenario, we set necessary execution barriers in the GVM to flush multiple CUDA streams simultaneously, since this is required for I/O and execution overlapping as well as concurrent kernel execution by CUDA.

739

VI.

EXPERIMENTATION AND EVALUATION

To demonstrate that using the proposed GPU virtualization approach and infrastructure can provide effective resource sharing and performance gains under the SPMD execution, we conduct a series of experiments. We use different simulated process-level SPMD benchmarks, which launch the same benchmark program in different processes, with the affinity of each process set to a unique CPU core. We essentially conduct the experiments to compare with the same simulated process-level SPMD programs without virtualization. In other words, the performance when each process shares the GPU natively. We are mostly interested in comparing the process turnaround time, which is the time for all processes to finish executing the benchmarks after they start simultaneously. The experiments are conducted on our GPU computing node. The node is equipped with dual Intel Xeon X5560 quad-core processors (eight cores in total) running at 2.8 GHz, 48GB system memory and the latest NVIDIA Tesla C2070 GPU computing card with 6GB device memory. Tesla C2070 consists of 14 SMs running at 1.15GHz and allows maximum 16 concurrent running kernels. Both CUDA driver and SDK versions are 3.2, which run under Ubuntu 10.10 with 2.6.32-29 Linux kernel. To better understand and evaluate the parallelisms and overlapping provided in our virtualization layer, we first use two extreme benchmark cases (highly I/O-intensive and highly compute-intensive) as evaluations of our implementation and proposed model. The I/O-intensive application we use is a very large vector addition benchmark while the compute-intensive benchmark is the GPU version [19] of EP (Embarrassingly Parallel) from NAS parallel benchmarks (NPB) [20]. The EP kernel grid size is designed small merely to show the effectiveness of concurrency under virtualization, while the actual grid size decides the overlapping and concurrency extent in real applications. Initially, we perform some microbenchmarks to profile the benchmarks. Both benchmark’s experimental profiling results corresponding to the model parameter value, such as the average Tinit and Tctx_switch, are shown in TABLE II. We further conduct the experiment to evaluate process turnaround time by simulating SPMD program for both benchmarks, while launching multiple processes with the same benchmark task simultaneously. As previously mentioned, the SPMD condition requires Ntask not greater than Nprocessor. Since our GPU cluster node consists of 8 microprocessor cores, this suggests the maximum number of SPMD user processes (tasks) to be eight in our case. Figure 9 shows the effectiveness of the virtualization approach in terms of turnaround time comparisons for both benchmarks. For the I/O-intensive benchmark on the left, when process number increases, without virtualization, the turnaround time increase sharply due to the context switching overheads. With virtualization, the turnaround time still increases but is slow comparatively. This is because I/O-intensive application cannot achieve much overlapping as explained earlier, but can still eliminate context switching overheads and initialization overhead. For the compute-intensive

Figure 9. Turnaround time comparison for both I/O-intensive and compute-intensive benchmarks

TABLE II. Problem Size Grid Size Tinit(ms) Tdata_in(ms) Tcomp(ms) Tdata_out(ms) Tctx_switch (ms)

INITIAL BENCHMARK PROFILES AND PARAMETERS Vector Addition Vector Size = 50M (float) 50K 1519.386 135.874 0.038 66.656 148.226

EP Class B (M=30) 4 1513.555 0 8951.346 0.000055 220.599

TABLE III. SPEEDUP COMPARISONS BETWEEN THE EXPERIMENT AND THE MODEL (WHEN LAUNCHED WITH 8 PROCESSES) Experimental Speedup Theoretical Speedup Theoretical Deviation

Vector Addition 2.300 2.721 18.306%

EP 7.394 8.341 12.810%

benchmark on the right, with the virtualization, the turnaround time stays with the increasing number of processes, which clearly shows that the virtualization approach can achieve completely execution concurrencies for smaller applications only using a portion of the GPU resource. As a way to demonstrate the accuracy of the model, we apply the values shown in TABLE II into the equation (5) of our model. We further compare the theoretical speedup with our experimental results while we launch 8 processes for both benchmarks. As shown in TABLE III, the derived speedup deviations of the model from the experiment is less than 20%. While the deviation provides our proposed model with a very good agreement with the experimental results, the theoretical speedup also provides us an upper bound reference. The actual experimental speedup is comparatively lower is partly due to the possible overheads from our virtualization layer implementation. Since the vast majority overhead of virtualization layer comes from data transfer and message synchronization between the API and base layer, we conduct another microbenchmark using the same I/O-intensive vector addition benchmark with multiple data sizes. We measure the overheads by launching one process and compare purely the time spent on the GPU in the base layer with the process turnaround time. As shown in Figure 10, the overheads, which are the differences between the turnaround time and

740

Figure 10. Virtualization overheads

Figure 11. Performance: MM Figure 12. Performance: MG

Figure 14. Performance: CG

Figure 13. Performance: BlackScholes

Figure 15. Performance: Electrostatics Figure 16. A comparion of speedups achieved using GPU virtualization when each benchmark is launched with 8 processes

TABLE IV. Benchmark MM MG BlackScholes CG Electrostatics

DETAILS OF APPLICATION BENCHMARKS

Problem Size 2Kx2K Matrix S(32x32x32 Nit=4) 1M call, Nit=512 S(NA=1400, Nit=15) 100K atoms, Nit=25

Grid Size 4096 64 480 8 288

turnaround time between virtualization and nonvirtualization scenario. Figure 11 to Figure 15 show the performance comparisons in terms of turnaround time. It is worth mentioning that the performance improvement using one process is due to the elimination of initialization overheads by the virtualization, even with the add-on virtualization overheads. Since MM is profiled as intermediate and the grid size is large enough to occupy the whole GPU, it only benefits from I/O and kernel computing overlapping with virtualization. Both MG and CG are compute-intensive benchmarks and Class S problem size will only make MG and CG utilize partial GPU resource. Thus MG and CG can achieve more overlapping by concurrent execution under virtualization. With the default problem size and a grid size of 480, a single Black Scholes benchmark can utilize full GPU resource and can hardly be concurrently executed under virtualization. Since it is also I/O-intensive application, it is only able to achieve limited overlapping between the I/O and kernel-computing like Vector Addition as described earlier. As for Electrostatic benchmark, since it is compute-intensive while the grid size making it occupy the whole GPU, the overlapping potential is small using virtualization. However, it still benefits from zero contextswitching and initialization overhead due to virtualization. Therefore, within the five benchmarks, as each achieves certain amount of performance gain through virtualization due to overlapping and elimination of overheads, MG and CG achieve better performance gains. As Figure 16 gives an

Class Intermediate Comp-intensive I/O-intensive Comp-intensive Comp-intensive

pure GPU time, increase with the data size. Even when the data size is very large, such as 400MB, the virtualization overhead is still less than 25%, which demonstrates our virtualization layer incurs comparatively low overheads. Furthermore, we conduct several additional benchmarks to demonstrate the efficiency of the proposed virtualization approach in addressing applications with different profiles. As TABLE IV shows, MM refers to the 2048x2048 single precision floating-point matrix multiplication. MG and CG refer to GPU versions [19] of NPB [20] kernel MG and CG, respectively, with the problem size of Class S. Black Scholes [21] is a European option pricing benchmark used in financial area, adapted from NVIDIA’s CUDA SDK. We set option prices over 512 iterations as default. Electrostatics refers to fast molecular electrostatics algorithm as a part of the molecular visualization program VMD [22] and we set the problem size to be 100K atoms with 25 iterations. By evaluating I/O and computing time ratio, we further profile the class of each benchmark. Using the experimental method, we simulate process-level SPMD execution of each benchmark with multiple processes and compare the process

741

[3]

example speedup comparison scenario utilizing all available system processors (8 processes), all five benchmarks achieved speedups from 1.4 to 4.1 with the proposed virtualization approach. Therefore, while all applications can achieve certain amount of performance gains with the proposed approach, the efficiency of the virtualization approach also depends on the profiles of the applications, including the I/O and computing time ratio as well the GPU resource usage. In fact, our virtualization experimental results show a good agreement with the proposed analytical model, and demonstrate itself as an efficient approach allowing multi-processes to share the GPU resource efficiently under SPMD model, while incurring comparatively low overheads.

[4]

[5] [6]

[7] [8]

[9]

VII. CONCLUSIONS In this paper, we proposed a virtualization concept which enables efficient sharing of GPU resources among microprocessors in HPC systems under SPMD execution model. In achieving the desired objective of making each microprocessor effectively utilize shared resources, we investigated the concurrency and overlapping potentials that can be exploited on the GPU device level. We also analyzed the performance and overheads of direct GPU access and sharing from multiple microprocessors as a comparison baseline. We further provided an analytical execution model as a theoretical performance estimate of our proposed virtualization approach. The analytical model also provided us with better understanding of the methodologies in implementing our virtualization concept. Based on these concepts and analyses, we implemented our virtualization infrastructure as a run-time layer running in the user space of the OS. The virtualization layer manages requests from all microprocessors and provides necessary GPU resources to the microprocessors. It also exposes a VGPU view to all the microprocessors as if each microprocessor has its own GPU resource. Inside the virtualization layer, we managed to eliminate unnecessary overheads and achieve possible overlapping and concurrency of executions. In the experiments, we utilized our GPU computing node equipped with the latest NVIDIA Fermi GPU as the test bed. We used initial data-intensive and compute-intensive benchmarks as well as application benchmarks in our experiments. Our experimental results showed that we were able to achieve considerable performance gains in terms of speedups with our virtualization infrastructure with low overheads. The experimental results also demonstrate an agreement with our theoretical analysis. Proposed as a solution for microprocessor resource underutilization by providing a virtual SPMD execution scenario, our approach proves to be effective and efficient and can be applied to any HPC system with GPU resources.

[10]

[11]

[12]

[13] [14]

[15]

[16]

[17] [18]

[19]

[20]

REFERENCES [1] [2]

“SGI GPU Compute Solutions,” Datasheet, http://www.sgi.com/pdfs/4235.pdf, Last Accessed: 1st Oct. 2010. Cray XK6 Brochure, http://www.cray.com/Assets/PDF/products/xk/CrayXK6Brochure.pdf, Last Accessed: 28th June 2011. Top 500 Supercomputer Sites Webpage, http://www.top500.org, Last Accessed: 28th June 2011. F. Darema, “The SPMD Model: Past, Present and Future,” Proc. 8th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing interface, Sept. 2001, Lecture Notes In Computer Science, vol. 2131/2001, pp. 1. “NVIDIA CUDA C-Programming Guide,” v3.2, 8th Sept. 2010. V. Gupta, A. Gavrilovska, K. Schwan, H. Kharche, N. Tolia, V. Talwar, and P. Ranganathan, “GViM: GPU-accelerated virtual machines,” Proc. 3rd ACM Workshop on System-level Virtualization for High Performance Computing, Mar. 2009, pp. 17-24. L. Shi, H. Chen and J. Sun, "vCUDA: GPU accelerated high performance computing in virtual machines," Proc. IEEE International Symposium on Parallel and Distributed Processing, May 2009, pp.1-11. G. Giunta, R. Montella, G. Agrillo and G. Coviello, “A GPGPU Transparent Virtualization Component for High Performance Computing Clouds,” Proc. Euro-Par 2010, Lecture Notes in Computer Science, 2010, vol. 6271/2010, 379-391. J. Duato, F. D. Igual, R. Mayo, A.J. Peña, E. S. Quintana-Ortí and F. Silla, "An Efficient Implementation of GPU Virtualization in High Performance Clusters," Proc. Euro-Par 2009, Lecture Notes in Computer Science, 2010, vol. 6043/2010, pp. 385-394. M. Guevara, C. Gregg, K. Hazelwood and K. Skadron, “Enabling Task Parallelism in the CUDA Scheduler,” in Proc. Workshop on Programming Models for Emerging Architectures (PMEA), Sept. 2009, pp. 69-76. S_GPU Project Home Page, http://sgpu.ligforge.imag.fr/, Last Accessed: 1st Oct. 2010. H. Peters, M. Koper and N. Luttenberger, "Efficiently Using a CUDA-enabled GPU as Shared Resource," Proc. IEEE International Conference on Computer and Information Technology, June-July 2010, pp. 1122-1127. C-H Huang and P-A Hsiung, "Hardware Resource Virtualization for Dynamically Partially Reconfigurable Systems," IEEE Embedded Systems Letters , vol. 1, no. 1, May 2009, pp.19-23. E. El-Araby, I. Gonzalez, and T. El-Ghazawi, "Virtualizing and Sharing Reconfigurable Resources in High-Performance Reconfigurable Computing Systems," Proc. HPRCTA Workshop at SC'08, Austin, TX, Nov., 2008. Khronos OpenCL Working Group, “The OpenCL Specification,” v1.0.29, http://www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf. “NVIDIA’s Next Generation CUDA Compute Architecture: Fermi,” Whitepaper, http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIAFer miComputeArchitectureWhitepaper.pdf, Last Accessed: 1st Oct. 2010. M. Malik, T. Li, U. Sharif, R. Shahid, T. El-Ghazawi and G. Newby, “Productivity of GPUs under Different Programming Paradigms,” submitted to Concurrency and Computation: Practice and Experience. D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga, “The NAS Parallel Benchmarks,” Technical Report RNR-94-007, NASA Ames Research Center (Mar. 1994).

[21] F. Black and M. Scholes, “The pricing of options and corporate liabilities,” Journal of Political Economy, vol. 81, no. 3, May-June 1973, pp. 637–654.

st

GPGPU Webpage, http://www.gpgpu.org, Last Accessed: 1 Oct. 10. V. V. Kindratenko, J. J. Enos, G. Shi, M. T. Showerman, G. W. Arnold, J. E. Stone, J. C. Phillips and W.-m Hwu, “GPU Clusters for High-Performance Computing,” In Proc. Workshop on Parallel Programming on Accelerator Clusters (PPAC) 2009.

[22] Visual Molecular Dynamics Program Webpage, http://www.ks.uiuc.edu/Research/vmd/, Last Accessed: 14th Mar.11.

742