Improving Application Performance by Efficiently ... - Semantic Scholar

3 downloads 133614 Views 597KB Size Report
application and the platform by choosing the best performing hardware configuration ... To answer this question, we generalize the prediction method to make it.
Improving Application Performance by Efficiently Utilizing Heterogeneous Many-core Platforms Jie Shen Parallel and Distributed Systems Group Delft University of Technology The Netherlands

Ana Lucia Varbanescu Informatics Institute University of Amsterdam The Netherlands

Henk Sips Parallel and Distributed Systems Group Delft University of Technology The Netherlands

Email: [email protected]

Supervisor

Supervisor

Abstract—Heterogeneous platforms integrating different types of processing units (such as multi-core CPUs and GPUs) are in high demand in high performance computing. Existing studies have shown that using heterogeneous platforms can improve application performance and hardware utilization. However, systematic methods to design, implement, and map applications to efficiently use heterogeneous computing resources are only very few. The goal of my PhD research is therefore to study such heterogeneous systems and propose systematic methods to allow many (classes of) applications to efficiently use them. After 3.5 years of PhD study, my contributions are (1) a thorough evaluation of a suitable programming model for heterogeneous computing; (2) a workload partitioning framework to accelerate parallel applications on heterogeneous platforms; (3) a modeling-based prediction method to determine the optimal workload partitioning; (4) a systematic approach to decide the best mapping between the application and the platform by choosing the best performing hardware configuration (Only-CPU, Only-GPU, or CPU+GPU with the workload partitioning). In the near future, I plan to apply my approach to large-scale applications and platforms to expand its usability and applicability. Keywords-Heterogeneous platforms; Workload partitioning; Hardware configuration; Multi-core CPUs; GPUs; Accelerators

I. M OTIVATION AND P ROBLEM S TATEMENT In recent years, the development and usage of multicore CPUs and hardware accelerators, i.e., GPUs, FPGAs, Intel Xeon Phi, etc., has grown very fast. Furthermore, heterogeneous platforms integrating multi-core CPUs and accelerators in a compute node or a single chip (e.g., Intel Sandy-bridge, and AMD APU) have become attractive for high performance computing and distributed computing [1]. In this context, finding efficient solutions to utilize heterogeneous platforms and to improve application performance is increasingly important. As heterogeneity exists in the platforms, and parallel applications exhibit various patterns and behaviors, achieving high performance on such platforms is an inherently complex problem [2]. Despite these challenges, we believe heterogeneous platforms, in different types and forms, will increase in popularity. Therefore, their efficiency and usability must be improved. In my PhD research, we follow an

application-centric approach to tackle this problem. Specifically, we aim to propose systematic methods to design, implement, and map parallel applications to efficiently use a large variety of heterogeneous platforms. My PhD research are guided by five research questions. Q1, What is a suitable programming model for heterogeneous platforms? To answer this question, we evaluate OpenCL, the first programming model targeting heterogeneous platforms. We study how the model can be efficiently used for heterogeneous computing, and empirically validate its efficient use. More details on our approach to answer this question are given in Section II. Q2, Is using heterogeneous platforms a good option to obtain high performance? To answer this question, we find a specific type of applications which have imbalanced workloads and cannot fit well on homogeneous platforms (i.e., a GPU or a multi-core CPU). To accelerate such applications, we use heterogeneous platforms and partition the workloads to best utilize the individual computing strength of each processing unit. We further develop a framework, using autotuning, to obtain the optimal partitioning between different processing units. The empirical evaluation demonstrates improved performance over homogeneous platforms with nonpartitioned workloads. The details of the proposed solutions and the obtained results are given in Section III. Q3, Is it possible to optimize the partitioning process? As it is usually time-consuming to perform auto-tuning, and auto-tuning occupies the target heterogeneous platform to run the search, we want to investigate how to speed up the partitioning process. We propose a modeling based method that replaces the auto-tuning with a quick and correct prediction of the partitioning. We evaluate the prediction method with both real-life applications and synthetic benchmarks, showing that the prediction leads to close-to-optimal partitioning (i.e., the auto-tuned partitioning) with much lower partitioning overhead. The prediction method and the experimental results are presented in Section IV. Q4, What is an efficient, systematic way to deploy parallel applications on heterogeneous platforms? To answer this question, we generalize the prediction method to make it applicable to applications with different types (not only

the previously mentioned imbalanced ones) and datasets, and to heterogeneous platforms with different hardware mixes. Based on the generalization, we propose a systematic approach from getting a parallel application and a hardware platform to determining (1) the right hardware configuration (the execution scenario) and, when needed, (2) the optimal workload partitioning. We show both theoretically and practically how the approach is applied to various applications and platforms. More details of our systematic approach is given in Section V. Q5, How to apply the proposed systematic approach to large-scale applications and platforms? As our previous research has focused on single-kernel applications and intrakernel workload partitioning, next we plan to test and adjust the proposed approach to support applications with multiple kernels and more complex kernel structures. For the same reason, we also plan to extend our approach to clusters and HPC systems which have heterogeneous nodes by combining intra-node partitioning with inter-node scheduling. The research plan for this question is discussed in Section VI. II. T HE O PEN CL P ROGRAMMING M ODEL OpenCL (Open Computing Language) [3] is an open standard programming model designed to exploit different types of hardware processors in a unified way. The programmers parallelize the application with OpenCL, and the resulting code is portable across all the processors that support it. This cross-platform code portability makes it an interesting option for heterogeneous computing. In this context, further understanding the performance of OpenCL on different hardware architectures is critically important. We select two mainstream hardware architectures, multicore CPUs and GPUs, and we compare OpenCL with representative programming models used on these hardware architectures, (i.e., CUDA and OpenMP, respectively). As OpenCL shares a lot of similarities with CUDA and has been proved to behave well on GPUs [4], we mainly focus on the CPU side. Specifically, we studied what are the performance impact factors when using OpenCL on CPUs and when porting OpenCL code from GPUs to CPUs. We first analyze the architectural differences between the OpenCL platform model and the CPU architecture. Guided by this architectural analysis, we use the performance of regular OpenMP code as a reasonable reference (that ordinary programmers can achieve using a coarse-grained parallelism model), and gradually tune the OpenCL equivalent code to match it. By quantifying the performance impact of each tuning step, we are able to isolate those significant issues. we further proposed systematic and generic optimizations to transform OpenCL code between CPU-friendly and GPUfriendly forms. We conclude that OpenCL is a suitable programming model for heterogeneous computing: it expresses parallelism

in a common code structure, and allows for code specialization in a parameter-tuning form (a switch between GPU- and CPU-friendly optimizations) to achieve more performance on different processor families [5], [6]. Thus, I have adopted OpenCL as the main programming model for my PhD work. III. T HE G LINDA F RAMEWORK Observing that currently most applications are accelerated on either multi-core CPUs or GPUs, we find that there is a large class of applications that fit naturally on heterogeneous platforms: massively parallel imbalanced applications. These applications can be found in scientific simulation, numerical method, and graph processing [7], where a relatively few data points in the application parallelization space (called work-items in OpenCL) require more computation than other data points. These imbalanced applications can severely diminish the hardware utilization of a homogeneous platform. In this situation, using a heterogeneous platform with each processing unit taking a partition that matches its hardware capability will lead to better performance. To maximize the performance gain, the programmers need to tune the workload decomposition until it achieves a perfect execution overlap between different processing units. As the application workloads can be widely different and hardware processors can have varied hardware capabilities for different workloads, a specific tuning for one scenario may not be the best for another. Therefore, we focus on the design and implementation of Glinda, a framework to automatically determine the workload partitioning between CPUs and GPUs. Figure 1 shows the overview of the Glinda framework. The workload probe characterizes the application by its workload features (i.e., workload imbalance). The HW profiler detects and evaluates the available hardware processors. These results are combined in the matchmaker to determine the execution solution: run the application only on the CPU(s), only on the GPU(s), or using a mix of both. Next, we develop an auto-tuning based workload partitioning method: the workload is decomposed into CPU and GPU tasks, which are tuned (1) in granularity, to find the optimal data parallelism in OpenCL, and (2) in size, to achieve a perfect execution overlap. This partitioning method is implemented in the auto-tuner to empirically find the optimal decomposition, eventually leading to the bestperforming execution solution [8]. The Glinda framework has been updated with my following PhD research. A complete summary of the current version is presented in Section V. We use a set of real-world imbalanced workloads to evaluate the auto-tuned partitioning on heterogeneous platforms. The experimental results show that heterogeneous platforms offering a mix of hardware capabilities provides us with the potential to improve application performance, leading to

Figure 1.

The overview of the Glinda framework.

as much as 12× performance speedup over homogeneous platforms. By using auto-tuning, Glinda detects the optimal workload partitioning and achieves on average 7× speedup compared to hand-picked ones. IV. P REDICTING

THE

PARTITIONING

Achieving the best performance on heterogeneous platforms is only possible when the workload is partitioned to best utilize all the processing units. Obtaining such an optimal workload partitioning is not trivial: the characteristics of the application workload, the capabilities of the hardware processors, and the data transfer overhead must be taken into account. Our previous auto-tuning method is a feasible solution. However, it usually takes multiple rounds of tuning till it reaches the best partitioning and keeps the target platforms occupied. Therefore, we further develop a prediction based workload partitioning method with the aim to reduce the partitioning overhead and to still make the partitioning optimal. The prediction method [9] is built based on modeling the execution of the partitioned workload on the heterogeneous platform. Given a fitting criteria (i.e., the minimum execution time), we build a partitioning model that represents the optimal partitioning. On the application side, we use a workload model to quantify the application workload, its workload characteristics, and the data transfer between processors. On the platform side, we estimate the processors’ hardware capabilities by using a low-cost profiling. Combining these quantities into the partitioning model, the optimal partitioning is solved. The evaluation on both synthetic and real-world workloads (a total of 1395 workloads) shows that the prediction method obtains accurate partitioning (versus the auto-tuned one) for more than 90% of the total workloads. It maintains a low partitioning cost, and achieves up to 60% performance improvement over a single-processor execution. Thus, our model-based prediction method is feasible and efficient for workload partitioning on heterogeneous platforms. V. A S YSTEMATIC A PPROACH TO H ETEROGENEOUS C OMPUTING The heterogeneity of the hardware platforms and the diversity of the applications and the datasets pose significant

challenges to the workload partitioning problem. Therefore, it is necessary to develop a systematic approach that effectively characterizes the application workload and the hardware heterogeneity, and efficiently maps the application on the heterogeneous platform. Based on our previous work on imbalanced applications, we generalize the application workload model to cover balanced applications (i.e., more generic applications). Also, we extend the partitioning model from platforms with one CPU and one GPU to that with multiple GPUs. In fact, the extended partitioning model can support platforms with any hardware mix, because the modeling of hardware capabilities is based on profiling without adding hardware architectural details. We further generalize two novel metrics—(1) the relative hardware capability, and (2) the GPU computation to data transfer gap—which are the key factors for determining the optimal workload partitioning. From the user’s point of view, we also investigate both online and offling profiling methods. They are alternative options to capture the hardware capabilities, suitable for different use cases. Finally, we propose a systematic way to map parallel applications on heterogeneous platforms (see Figure 2). ����������������������������

���������������������������������������

������������������������������������

�������������� ��������

������������������������� ������������

���������������������������������� ����������������������������

������������� ������������

����������������������

������������������������ ����������������������

��������������� ��������

����������������������� ��������������

��������

��������

��������� ������������

Figure 2. A systematic approach to map parallel applications on heterogeneous platforms.

Our approach [10] consists of three main steps. The first step is modeling the partitioning. This analytical modeling integrates three aspects, the application workload, the hardware capabilities, and the data transfer, into a partitioning model, which fits the best performance criteria. Next, predicting the optimal partitioning solves the partitioning model for a given platform, application and dataset. By using profiling to estimate the two key partitioning metrics, and by substituting the estimations into the model, we predict the partitioning. The final step is making the decision in practice. This step determines the right hardware configuration (OnlyCPU, Only-GPU, or CPU+GPU with the partitioning) taking into account the actual hardware utilization. By checking if the obtained partitioning is able to use a certain amount of hardware cores on each processing unit, we take the decision to use either a single processing unit or the mix of both. As the smallest scheduling unit on GPUs is warp/wavefront, the GPU partition is further rounded up to a multiple of warp/wavefront size to improve GPU utilization. The Glinda framework is updated with the proposed systematic approach, and the summary of the current version (see Figure 1) is presented as follows. (1) The user interface

receives the application parameters (e.g., problem size), and transfers them to the workload probe and the HW profiler. (2) The workload probe characterizes the application and generates its workload model. (3) The HW profiler detects the available processors and uses profiling to estimate the two key partitioning metrics. (4) The matchmaker assumes that CPU+GPU is the best hardware configuration and asks the partitioner to determine the optimal partitioning. (5) Either the auto-tuner or the predictor performs the work and sends the obtained partitioning back to the matchmaker, where the practical hardware configuration (Only-CPU, Only-GPU, or CPU+GPU) is determined. (6) Finally, the execution unit (the target heterogeneous platform) executes the application with the determined hardware configuration (and workload partitioning). Implementation-wise, singledevice code (for single-processor execution) and multidevice code (for partitioned execution) are implemented as code candidates in the code library. The results generated in (2)-(5) are stored in repositories for reuse. We evaluate the systematic approach with 13 applications and 5 heterogeneous platforms. The experimental results show that our approach easily adapts to application and platform changes. It obtains the right hardware configuration in above 90% of the test cases, leading to efficient hardware utilization and up to 14.6× performance speedup versus an uninformed single-processor execution. VI. T HE N EXT S TEP Till now, our research work has focused on single-kernel parallel applications, and we have proposed an approach based on static partitioning to use heterogeneous platforms. To increase the impact of our research work on the really challenging applications of the real world, we plan to study how to apply/extend our approach to large-scale applications with multiple kernels and more complex kernel structures. We propose therefore a classification of applications based on their kernel structures: (1) single kernel, (2) single kernel in a loop, (3) multiple kernels in a sequence, (4) multiple kernels in a loop, and (5) multiple kernels forming a DAG graph. To accelerate each class of applications, we plan to investigate whether our static partitioning approach works and, if not, how to extend our approach to process such applications. We will use dynamic partitioning as an alternative solution, as it should work for all classes of applications and allow for asynchronous execution of partitions from different kernels. Therefore, we can generalize a recipe for different classes of applications, i.e., which class of applications should use which partitioning method (static or dynamic), how to use it, and how large is the performance gain. In addition, we are also interested in applying our approach to clusters and HPC systems equipped with heterogeneous nodes. Based on existing inter-node scheduling, our approach can be used, at a different level, for intra-node workload partitioning.

VII. S UMMARY In this section, I summarize the contributions of my PhD research work. So far, • I have evaluated OpenCL and proven empirically it is a suitable programming model for heterogeneous platforms. • I have developed the Glinda framework that uses workload partitioning to match the partitioned workload with the capabilities of the heterogeneous hardware components. • I have proposed a prediction method, based on analytical modeling, to correctly and quickly predict the optimal workload partitioning. • I have generalized a systematic approach to choose the best-performing hardware configuration (and workload partitioning) for parallel applications. My approach is feasible for platforms with different hardware mixes, and applications with different types and datasets. To complete my PhD study, I am currently working on understanding the applicability and limitations of my systematic approach for efficient heterogeneous computing. R EFERENCES [1] M. D. Hill and M. R. Marty, “Amdahl’s Law in the Multicore Era,” IEEE Computer, vol. 41, no. 7, pp. 33–38, 2008. [2] A. Gharaibeh, L. B. Costa, E. Santos-Neto, and M. Ripeanu, “On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest,” in IPDPS 2013, 2013, pp. 851–862. [3] “The OpenCL Specification v2.0,” http://www.khronos.org. [4] J. Fang, A. L. Varbanescu, and H. J. Sips, “A Comprehensive Performance Comparison of CUDA and OpenCL,” in ICPP 2011, 2011, pp. 216–225. [5] J. Shen, J. Fang, H. Sips, and A. L. Varbanescu, “Performance Traps in OpenCL for CPUs,” in PDP, 2013, pp. 38–45. [6] J. Shen, J. Fang, H. J. Sips, and A. L. Varbanescu, “An application-centric evaluation of OpenCL on multi-core CPUs,” Parallel Computing, vol. 39, no. 12, pp. 834–850, 2013. [7] R. Vuduc, A. Chandramowlishwaran, J. Choi, M. Guney, and A. Shringarpure, “On the Limits of GPU Acceleration,” in HotPar 2010, 2010, pp. 13–13. [8] J. Shen, A. L. Varbanescu, H. J. Sips, M. Arntzen, and D. G. Simons, “Glinda: A Framework for Accelerating Imbalanced Applications on Heterogeneous Platforms,” in Computing Frontiers 2013, 2013, pp. 14:1–14:10. [9] J. Shen, A. L. Varbanescu, P. Zou, Y. Lu, and H. Sips, “Improving Performance by Matching Imbalanced Workloads with Heterogeneous Platforms,” in ICS, 2014, pp. 241–250. [10] J. Shen, A. L. Varbanescu, and H. Sips, “Look Before You Leap: Using the Right Hardware Resources to Accelerate Applications,” in HPCC 2014, 2014.