Predicting Performance of Applications on Multicore Platforms

4 downloads 121324 Views 472KB Size Report
performance of an application. We validate our model by predicting performances for a real homogenous multicore platform. The results we obtained for few ...
Predicting Performance of Applications on Multicore Platforms Priti Ranadive

Vinay G. Vaidya

Symbiosis International University Pune, India [email protected]

VMAS Consulting Pune, India [email protected]

Abstract — Porting sequential applications to multicore platforms is time and cost consuming. Hence, it is desirable to predict the performance benefits that can be derived if an application were to be ported. In this paper, we present models for overheads of two different types of OpenMP constructs. The models are based on several executions of motivational example codes and by varying the parameters that may affect the performance of an application. We validate our model by predicting performances for a real homogenous multicore platform. The results we obtained for few benchmark codes are 92% accurate on a real homogeneous multicore platform. Keywords— Performance prediction; Multicore platforms;

I.

INTRODUCTION

Multicore processors are now commonly used for their known benefits. However, the challenges related to adopting multicores are many [1]. The challenges start right from identifying the right kind of multicore hardware for an application to identifying the type of scheduling. Generally, multicore hardware platforms are either homogenous or heterogeneous and both have their own advantages and disadvantages [2]. However, in most cases homogenous multicore systems are used. Many a times, in the initial phases of the application development, generic homogenous multicore desktop systems are used. Later the applications are migrated to specific embedded target multicore platforms. The next big challenge is to know the nature of the application and the kind of strategy that would be useful to get the best performance on a multicore. Applications that have a lot of I/O operations or specific hardware needs, may not be easy to migrate. However, many applications, like image processing based applications, are considered easy for migration to multicores. The primary reason for this is the ease with which one can parallelize such applications. These types of applications are called Single Instruction Multiple Data (SIMD) applications. They are easy to parallelize and hence can take advantages of the various available cores on a multicore hardware. To be able to parallelize an application we first need to find the dependencies with the code. It is a challenging task to manually identify all dependencies in an application, especially when the application is spread across files, modules containing

thousands of lines of code. Generally, in such cases automatic dependency analysis tools like [3] are used. Another challenge in migrating to multicores is the type of scheduling algorithm that is used. Many a times the desktop multicore environments use standard Linux or Windows OS that have their built-in schedulers. For environments where there is no OS, Earliest Deadline First (EDF) or Round Robin (RR) scheduling strategies are used. For applications that use parallelization libraries like OpenMP and MPI the scheduling is often taken care of by these libraries by default. These libraries could cause overheads with the number of threads they create or the number of memory accesses that happen during the execution of the application. Thus, application developers who want to migrate their applications to multicores need to be aware of dependencies analysis, parallelization strategies, scheduling techniques, multicore hardware features, etc. to get good performance of the application on multicore hardware. It is difficult to train application developers for all the above-mentioned needs. Most often the common question asked before investing efforts in migrating to multicores is what kind of performance benefits would be achieved. Hence, it would be useful to predict the performance benefits even before migrating the application to multicore hardware. Predicting performance of an application has benefits like reduction in migration efforts and costs, identifying the right migration strategies and identifying the parallelizable code sections in an application that may lead to performance degradation. In this paper, we propose an approach to predict overheads caused due to parallelization that may lead to reduction in performance. Predicting the overheads indirectly helps in predicting the overall performance of an application. II.

LITERATURE SURVEY

Various different attempts have been made to predict performance of applications on multicores. Various methods including machine learning or cache performance based predictions have been discussed in the past. The authors of [4] reiterate the fact that finding parallelism is critical. The authors show that asymmetric multicore chips may offer better speedup as compared to symmetric multicore chips. Similarly, they show that dynamic multicore chips may offer better speedup as compared to asymmetric multicore chips.

The paper then mentions about how tasks and data movements add overheads and that Amdahl’s law may not be valid for future embedded multicores. Though the paper mentions of these various results they do not talk about specific ways to address the needs of future embedded multicore challenge. They do not mention any mathematical model for predicting the performance of a multicore system. In [5] the authors discuss the shortcomings of Amdahl’s law and present a model that predicts behavior of a distributed algorithm based on network restrictions of used clusters. The shortcoming discussed is that Amdahl’s law assumes the problem size would remain the same after parallelization. However, in many practical cases this assumption may not be true. The paper then discusses how the Gustafson’s law addresses the problem in Amdahl’s law and the performance models parallel processing on multicore systems but do not include model for distributed multicore hardware systems. The paper then shows how latency and bandwidth affect the interprocessor communication and effects of the number of nodes used to predict the performance. The model proposed in the mentioned paper is based on fixed-time and is extended further to accommodate communication cost for distributed and parallel applications. Our paper differs from this approach since we do not consider distributed computing. Additionally, we model the performance overheads to predict the performance. In [6] the authors’ highlights that asymmetric multicore processors have different power and performance characteristics. Additionally, they pose challenges of mapping tasks to cores and different memory hierarchies make it difficult to predict performance. The authors further propose a software based model to estimate performance of tasks on different core types. The approach described uses static code analysis, analytical models and statistical techniques to model the performance of cores. The paper models CPI stack on out-oforder and in-order cores. Our approach of modeling and predicting performance differs from this paper in the way that we do not insert any performance counters in the code at different compiler passes. Instead, we model the overhead to make our model and prediction more accurate. The authors of [7] argue how sharing of resources creates performance issues in multicore systems. The paper then describes a machine learning based approach to predict the performance of multicore systems. The authors list program attributes for solo runs that can be used to predict concurrent-run programs that share resources. Our method differs from this since we use the sequential execution cycles to calculate overheads for parallel execution. Moreover, our model is not machine learning based. In the paper [8] the authors discuss how sharing of last level caches by cores negatively affects the performance of multicore systems. It further proposes a shared cache aware performance model that estimates the performance degradation due to cache contention. The estimation is based on distance histograms; cache access frequency and throughput-cache miss rate relationship. The paper also describes an automated method to identify the processor dependent characteristics without offline simulations and OS modifications. Our approach differs from the one mentioned in this paper since we do not identify any

processor dependent characteristics and we do offline executions on real hardware to create our model. The work mentioned in [9] includes different methods to predict cache contention in a unified framework and then evaluates the methods based on timing performance and efficiency. The work shows that cache misses as predictors help to predicts accurately the interference of co-scheduled applications. The accurate prediction is contributed to the high temporal locality of application memory references. In [10] the authors bring out the challenge of under-utilized cores due to the interference of latency sensitive applications. The performance degradation is represented as a piecewise function of the aggregate pressures on shared resources of all cores used for predicting latencies. The authors have used regression analysis to build a predictor function that also characterizes the contentiousness of the application. This work shows some important results however, our work differs from this since we do not model distributed multicore systems. The authors of [11] mention that performance prediction using system-based performance counters have been commonly used. The authors of this paper have developed application specific performance counters and the results show that the effects on performance of each application and that on the performance counter groups depends on the TLB data misses, L1 and L2 cache utilization, the branch instructions, and the system resource stalls. The authors of Parallel Prophet [12] predict speedup that can be obtained with parallelizing, by code annotations, interval profiling, dynamic fast-forward emulation and program synthesis based emulation. The authors have also mentioned a memory performance model to estimation the speedup related to memory. The prediction is targeted for loop, tasks, nested loops, recursive constructs and OpenMP synchronization constructs. III.

MOTIVATION

Writing parallel applications for multicore processors is a challenging task and even more challenging is to parallelize legacy sequential code to gain performance benefits. It is desired to know the performance benefits before migrating an application to multicores primarily to reduce the time and efforts needed to be invested. Many researchers have attempted to address this problem by creating analytical models for memory; multicore hardware etc. however, they have had limited success in practical scenarios. In this paper, we have propose models to predict the performance overheads, which indirectly help to predict the performance benefits of an application on multicore platforms. We have applied our models and predicted performance on a real homogeneous multicore hardware instead of just simulating the results. The parallelization constructs, used for parallelizing a code, themselves add to the overheads. Apart from this, there are other overheads due to the target hardware depending on the memory/cache hierarchy, architecture, number of cores etc. Instead of modeling the memory/cache overheads or communication overheads individually, we take a holistic view of creating a model that will in general derive the overheads for the constructs that were added in the serial code while

parallelization. We take advantage of the fact that one does not parallelize the entire code but only few constructs like loops, or few tasks. In which case, we need to model the overheads that will be caused by the different constructs that are added in the serial code. The most commonly used parallelization constructs are that of OpenMP and hence we chose to model the overheads created by OpenMP constructs. It is also worth noting that not all OpenMP constructs are used frequently. Hence, in our work we have chosen the most commonly used OpenMP constructs to create our model.

We have executed several motivational example codes on a real homogenous multicore hardware and plotted the execution cycles vs. the number of loops/code sections using a particular parallelization construct. Thus for an entire program we have to use different models to predict the total cycles that would be required for that type of parallelization construct. This makes our prediction more accurate.

Fig. 1 - Motivational Example 1 Our approach is also useful to see if parallelizing individual loops/code sections are creating overheads. The application developer can thus get a good grip on what he/she can parallelize to get better performance. Our method is also useful not just to predict the performance but also to take informed decisions on what code sections really give the performance benefit as compared to others. This helps the application developers to reduce their time for performance tweaking after the application is actually migrated.

remaining or non-loops part of the code. Not all loops will be parallelized hence only the execution cycles required by the loops that are parallelized are to be predicted. If we can model the OpenMP constructs that are used for parallelizing loop, we can model all possible interactions of the cores, memory etc. required for executing the loop in parallel. Thus, the overheads due to the OpenMP constructs that would be predicted are representative of all the overheads including memory/cache and inter-core communications.

Figure 1 (a) and (b) show the serial and parallel versions of a program. As seen, apart from the OpenMP constructs added in (b) the other program code sections remain the same. The execution cycles for the other program code sections would remain the same as the serial counter parts. Thus, we need to predict only the execution cycles for the code sections marked by the OpenMP constructs. As seen in figure 1 (b) the OpenMP ‘parallel for’ construct is used twice and the OpenMP ‘schedule(static, n)’ construct is used once. Therefore, if we have a model for predicting cycles for both these constructs we would be able to predict the execution cycles for the entire parallel code.

In our experiments, we have executed a motivational example code with different number of loops and with varying iterations, chunk sizes and array sizes to plot graphs of the execution cycles required vs the chunk sizes. The motivation example code is as shown in figure 2.

The total number of execution cycles required by the code would be equal to the execution cycles required for the loops in the code added to the execution cycles required for the

This has helped us to create an accurate prediction model for the given type of OpenMP construct. We can derive mathematical models for all types of OpenMP constructs and then use the models to predict the overheads that would be added depending on type and number of OpenMP constructs specifically used for parallelization. Currently, we have modeled only two frequently used OpenMP constructs viz. “#pragma omp parallel for private(..)” and “#pragma omp schedule(….)”. We then use the individual models of these constructs to predict execution cycles

of four benchmark codes on a real homogenous multicore target platform.

We used the Valgrind [13] and Kcachegrind [14] tools for simulation. The cache specifications are – L1 data cache 32K,

# include int main ( int argc, char *argv[] ) { ...... ..... # pragma omp schedule ( static,CHUNKSIZE) for ( i = 2; i