Using accurate AIC-based performance models to ... - Springer Link

1 downloads 0 Views 346KB Size Report
Diego R. Martínez · Julio L. Albín ·. Tomás F. Pena · José C. Cabaleiro ·. Francisco F. Rivera · Vicente Blanco. Published online: 15 March 2011. © Springer ...
J Supercomput (2011) 58:332–340 DOI 10.1007/s11227-011-0589-1

Using accurate AIC-based performance models to improve the scheduling of parallel applications Diego R. Martínez · Julio L. Albín · Tomás F. Pena · José C. Cabaleiro · Francisco F. Rivera · Vicente Blanco

Published online: 15 March 2011 © Springer Science+Business Media, LLC 2011

Abstract Predictions based on analytical performance models can be used on efficient scheduling policies in order to select adequate resources for an optimal execution in terms of throughput and response time. However, developing accurate analytical models of parallel applications is a hard issue. The TIA (Tools for Instrumenting and Analysis) modeling framework provides an easy to use modeling method for obtaining analytical models of MPI applications. This method is based on modeling selection techniques and, in particular, on Akaike’s information criterion (AIC). In this paper, first the AIC-based performance model of the HPL benchmark is obtained using the TIA modeling framework. Then the use of this model for assessing the runtime estimation on different backfilling policies is analyzed in the GridSim simulator. The behavior of these simulations is compared with the equivalent simulations based on the theoretical model of the HPL provided by its developers.

D.R. Martínez () · J.L. Albín · T.F. Pena · J.C. Cabaleiro · F.F. Rivera Dept. of Electronic and Computer Science, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain e-mail: [email protected] J.L. Albín e-mail: [email protected] T.F. Pena e-mail: [email protected] J.C. Cabaleiro e-mail: [email protected] F.F. Rivera e-mail: [email protected] V. Blanco Dept. Statistic and Computer Science, La Laguna University, 38271 La Laguna, Spain e-mail: [email protected]

Using accurate AIC-based performance models to improve

333

Keywords Performance · Analytical models · Model selection · Backfilling schedulers

1 Introduction Performance modeling of parallel applications becomes a crucial issue in High Performance Computing. Different modeling approaches have been proposed in the literature. Although they are less accurate than other modeling methods, analytical models have the advantage of being able to evaluate the model fast. This feature is essential in certain time-limited problems such as scheduling or dynamic load balancing. In most cases, the development of analytical models requires a serious effort as well as a deep knowledge of the parallel algorithm. The TIA modeling framework [6] provides the user with an easy to use method, based on the Akaike information criterion (AIC) [1], for obtaining accurate analytical models of parallel applications. While job scheduling does not affect the results of a job, it may have a significant influence on the efficiency of the system. Accurate performance models can provide valuable information in scheduling decisions by means of backfilling policies, because these algorithms take advantage of job runtime estimations in order to efficiently exploit the cluster resources. Backfilling algorithms are common on sites with parallel jobs because they decrease starvation cases without significant changes in the priorities of the jobs [9]. This paper shows how accurate AIC-based models provided by the TIA framework can be successfully used to assess runtime predictions in order to improve the efficiency and throughput of backfilling scheduling strategies. In particular, the AICbased model of the well-known HPL benchmark is considered as a case of study. Using the GridSim toolkit [8], the behavior of backfilling scheduling algorithms that use runtime estimations based on this model is simulated and compared to equivalent simulations based on the theoretical model of the benchmark developers [4]. The rest of this paper is organized as follows. Section 2 briefly revisits the theoretical basis of model selection techniques and the Akaike information criterion. Section 3 describes the TIA modeling framework, including the model selection method based on AIC. Section 4 shows the modeling process of the HPL benchmark using TIA. Section 5 describes the simulation process, and shows the results. Finally, Sect. 6 summarizes the main contributions of this paper.

2 Model selection Although a model with a very large number of parameters can provide a good fit of the data, it might not be adequate for obtaining valid inferences. Instead, model selection seeks models that are good approximations to the truth and from which valid inferences about the system or process under study can be made. This search is based on analyzing data to aid in the selection of a parsimonious model. Most selection methods are defined in terms of an appropriate information criterion, a mechanism that uses data to assign each candidate model a certain score. In [3], Burnham and

334

D.R. Martínez et al.

Anderson propose a general strategy for modeling and data analysis using information theory and, in particular, Akaike’s information criterion. Model selection based on this criterion is not the only reasonable approach, but it presents certain computational advantages. Akaike’s information criterion (An Information Criterion, AIC) provides a simple, effective, and objective means for the selection of an estimated best approximating model for data analysis and inference [1]. The AIC is given by AIC = −2 log(L( θ )) + 2K where log(L( θ )) is the value of the log-likelihood at its maximum point, and K is the number of estimable parameters. The heuristic view of the components of AIC clearly shows a balance between good fit (high value of log-likelihood) and complexity (complex models are penalized). Therefore, this criterion can be interpreted as a trade off between simple and over-fitted models. AIC formulates the problem of model selection as an optimization problem across a set of candidate models. The models in a specific set of candidates can be ranked by means of the AIC values so that the model with the lower AIC value is selected as the best model.

3 The modeling framework The TIA modeling framework [6] provides the user with a simple and powerful environment for analyzing the performance of parallel applications. It consists of two main stages, and the whole system is almost transparent to the user because of the automatic connection between both stages. The first stage (instrumentation) implements the user-driven instrumentation of the source code, and the information about the performance in each execution of the application is stored in XML files. In the second stage (analysis), the models are calculated by analyzing the performance information obtained from multiple executions of the instrumented code. The model selection methodology based on AIC has been implemented in the analysis stage [7], so that the data from the instrumentation stage can be used by the modeling process. This implementation provides the model with the lowest AIC score as the best model for future inferences, but also provides some statistical information to help the user to decide the suitability of the model. It performs the AIC selection using a finite set of candidate models (CM) which are automatically generated from a weighted list of application variables and execution parameters provided by the user. Therefore, this framework provides the user with a simple tool for obtaining an accurate performance model of a parallel application. Any parameterized metric can be used in the modeling process as explanatory or response variable. The only limitation is that the metrics have to be obtainable or derivable from instrumentation data. For example, it is possible to consider metrics that characterize the system heterogeneity or the power consumption, such as the mean and variance of the computational power of the execution nodes or their CPU temperatures. On the one hand, this method is useful when a deep knowledge about the parallel application is not available. This is because TIA does not search for one realistic model but it just selects, from among the set of candidate models, the model that

Using accurate AIC-based performance models to improve

335

fits the best for future inferences. On the other hand, this method is also useful for experienced analysts because the model obtained actually reflects the experimental behavior of the application and, therefore, a deep analysis of theoretical analytical models can be performed.

4 Case of study: HPL model assessment High Performance Linpack (HPL) [4] is a scalable parallel implementation of the High Performance Computing Linpack Benchmark on distributed-memory computers. HPL uses a LU factorization with row partial pivoting to solve dense linear systems with N unknowns while mapping a two-dimensional block-cyclic data distribution for load balance and scalability. HPL contains 18 performance-related parameters. Most analytical models of HPL are based on a subset of these parameters; otherwise, the models become intractable. However, four parameters are the most important ones in terms of performance: the number of unknowns (N ), the size of blocks (NB) and the two dimensional grid (P × Q). Indeed, based on an in-depth theoretical analysis of the HPL, Dongarra et al. [4] conclude that, when using the modified increasing ring variant, the total execution time of the algorithm is given by: THPL = γ3

N 2 (3P + Q) N ((NB + 1) log P + P ) 2N 3 +β +α 3P Q 2P Q NB

were γ3 is the floating point operation (FLOP) rate of the matrix-matrix operations, and α and β are the latency and the bandwidth of the network, respectively. However, this model neglects all the phenomena occurring in the processor components, such as cache misses or TBL misses, and it does not make any assumption on the amount of physical memory available on a single node [4]. Moreover, this expression corresponds to a particular HPL configuration. Using the model selection capabilities of the TIA framework, an AIC-based performance model of the HPL benchmark was assessed. As the number of HPL parameters is so high, a specific algorithmic configuration (W003L2L1) was chosen so that the HPL parameter set of the benchmark was reduced to only four parameters (N , NB , P and Q). Instrumentation instructions were introduced in the source code to measure the walltime of the HPL_pdgesv function, that is the function that actually solves the system of equations, as well as the above four execution parameters. The experiment testbed was a cluster of seven 2.3 GHz Intel Xeon Quadcore biprocessors with a Gigabit Ethernet. Table 1 resumes the experimental HPL parameter values, so that the instrumented HPL was executed once for each parameter configuration. Based on basic knowledge about the benchmark a user can propose an adequate weighted list of parameters and variables that is necessary to start the modeling process according to the TIA requirements. On the one hand, each matrix has N × N elements but the sequential algorithm has O(N 3 ) complexity. On the other hand, the function of the other three parameters is to distribute the computational load across the system in a scalable way. Therefore, the following weighted list was used to start

336

D.R. Martínez et al.

Table 1 Experimental HPL parameter values

Parameter

Values

N

6144, 12288, 18432, 24576, 30720

NB

8, 16, 32, 64, 128, 256

P

1, 2, 3, 4, 5, 6, 7

Q

1, 2, 3, 4, 5, 6, 7, 8

the model selection process of TIA:   WL = N 3 , N 2 , {1/NB}, {1/Q}, {1/P } This list generates a set with more than 16 million possible candidate models. Using the experimental performance measurements and this set of candidate models, after executing TIA, the model selection method proposes the following model as the best one: 1 1 1 − 3.2 + 48.5 NB Q N BQ N2 N2 N3 1 − 7.9 × 10−7 + 8.2 × 10−8 + 4.1 × 10−12 + 7.1 QP NB Q P 2 3 2 N N N + 7.9 × 10−10 − 5.8 × 10−7 − 2.7 × 10−7 P N BP N BP N3 N3 + 1.6 × 10−10 − 1.7 × 10−10 QP N BQP

TAIC (s) = 1.8 − 2.7 × 10−12 N 3 + 2.0 × 10−7 N 2 − 29.6

The standard deviation (σ ) of the fit of this model to the experimental data is 13.5%. Note that this model characterizes the real behavior of a specific configuration of the HPL benchmark (W003L2L1) for different matrix sizes, block sizes, and process grid configurations (see Table 1). Also, note that this model was obtained without code inspection. Also, by using other TIA capabilities, the best fit of the theoretical model of HPL developers [4] to the experimental data was also performed. The resulting analytical expression is THPL (s) = 3.2 × 10−10

2N 3 N 2 (3P + Q) + 1.0 × 10−7 3P Q 2P Q

+ 2.8 × 10−4

N ((NB + 1) log P + P ) NB

In this case, the σ of the fit is 29.1%. Note that this model was developed under some particular conditions and it does not consider second order effects. 5 Case of study: scheduling simulation Both the AIC-based and theoretical models can be used to provide the runtime estimation of the HPL benchmark on backfilling scheduling policies. The accuracy

Using accurate AIC-based performance models to improve

337

of the fit process was taken into account to efficiently evaluate the models, so that cancellations due to underestimated runtimes are avoided. Cancellations refer to jobs scheduled and not finished properly because of the underestimation of their execution times. Therefore, a reasonable approach is to overestimate the model prediction by adding a multiple of the σ of the fit. We considered at least 4σ because it guaranties less than 0.006% of canceled jobs according to a normal distribution. In fact, in the case of the AIC-based model, 4σ is enough to ensure no cancellations due to wrong estimations. In the case of the theoretical model, this factor has to be 12 because of its lower accuracy. These scalar factors were obtained empirically from the experimental data. In order to avoid the impact of cancellations in our experiments, we considered these overestimations for both the AIC-based and the theoretical models. The influence of these runtime estimation approaches on the behavior of different backfilling algorithms was compared throughout several simulations using GridSim [8]. Gridsim is a Java-based discrete event Grid simulation toolkit that was designed to study scheduler algorithms in a repetitive and controllable environment. This toolkit supports modeling and simulation of heterogeneous Grid resources (both time- and space-shared), users and application models. It provides primitives for creating application tasks, for mapping of tasks to resources, and for their management. We developed in GridSim [2] some of the most commonly used parallel local schedulers, including three backfilling approaches: – First Come-First Served, FCFS. All jobs are executed in the order in which they arrive, without other biases or preferences. – First Fit is a scheduling policy whereby the first available job that is suitable for the idle resources is executed. – Backfilling. In this policy, the arrival order is used to schedule the jobs, as in the FCFS. Nevertheless, if some local nodes are empty because the next job in the queue is not suitable, the queue is examined to execute another job. The selection of this job should not delay the execution of the rest of the queued jobs. Two ways of selecting the promoted job for this algorithm have been included. – First Fit: the first job that is suitable for execution without delaying a FCFS order is executed first. – Predictive: the first job that is suitable for execution that does not delay the best possibility of launching of previous queued jobs is executed first. – EASYBackfilling is an aggressive backfilling in which only the first element of the queue is considered. This means that a job is promoted if it does not delay the execution start of the next one in the queue, while the rest of entries in the queue are not taken into account. The motivation of this experiment is to determine how the accurate AIC-based runtime estimations can improve the performance of the scheduling process on real situations. To emulate a realistic situation, the workload was based on the model proposed by Jann [5]. The Jann model is an empirical model, based on real workloads, that determines the distribution of inter-arrival times between job submissions as well as the distribution of the job runtimes and number of processors. The experiment workload was generated by introducing several HPL jobs inside a workload generated by the Jann model, so that the whole workload keeps the Jann

338

D.R. Martínez et al.

Table 2 Simulation results for different scenarios and backfilling algorithms Scenario

LOW

HIGH

Algorithm

Simulation data (hours)  AIC AIC Theo Twait Tsim Tsim

 Theo Twait

B. predictive

1245.6

8490

1256.5

B. first fit

1237.5

9400

1243.3

EASYB.

1242.2

9070

B. predictive

76.4

B. first fit

76.4

EASYB.

75.5

AIC benefits # jobs

Twait

Tidle

10259

40.1%

17.2%

0.8%

11333

50.5%

17.1%

0.4%

1247.4

10806

39.7%

16.1%

0.4%

3141

77.9

3888

66.5%

19.2%

1.7%

3106

76.7

3668

61.7%

15.4%

0.4%

2951

76.7

3719

69.0%

20.7%

1.4%

inter-arrival time distribution. In particular, we built a workload of 1,674 jobs in which 30% of the jobs in the resulting workload are HPL jobs. The runtime of the HPL jobs were obtained from real executions of the benchmark using different N , NB and P × Q configurations. These executions are different from those used in the model assessment process. The remaining jobs follow the runtime distribution of the Jann model, and also assume perfect estimations of runtime. When using the original Jann coefficients, the HPL jobs only consume 2.2% of the cluster resource utilization in the simulations. In this scenario, the HPL jobs compete with a significant percentage of higher time consuming jobs. We refer to this as the LOW load scenario. The synthetic values generated by the Jann model have been scaled so that the HPL jobs of the resulting workload represent 30% of the workload resource time consumption. In this new scenario called HIGH load, the HPL jobs are competing with other jobs that have similar runtimes. As the original Jann coefficients were assessed in a specific cluster, the scaling of these coefficients can be interpreted as a change in the cluster parameters and the distribution of runtimes and interarrival times is kept. These two different scenarios were considered in the experiment to evaluate scheduling behavior of backfilling algorithms on situations where the HPL executions have a different relevance on the total simulated time. Table 2 summarizes the results of the simulations for the two workload scenarios and the three considered backfilling algorithms. This table shows  the total simulated time (Tsim ) and the sum of the waiting times for all HPL jobs ( Twait ), for both the AIC-based and the theoretical estimations (AIC and Theo superscripts, respectively). The last three columns show the improvements of the AIC-based estimation over the theoretical one. The first two columns are the number of HPL jobs that have lower waiting time (# jobs) and the percentage of improvement of the waiting time of HPL jobs. The last column (Tidle ) corresponds to the improvement of the idle time percentage in the whole workload simulation, including non-HPL jobs. In the LOW scenario, although the total simulated time is similar and the number of HPL jobs in which AIC estimation performs better than the theoretical one is not greater than 50%, the total waiting time of the three simulations presents a significant improvement. This fact means that the accuracy of the AIC-based runtime estimation allows the scheduler to handle more carefully the HPL jobs, especially those which have long runtimes. This actually implies a better utilization of resources as there is a slight reduction in the percentage of resource utilization. Note that in this

Using accurate AIC-based performance models to improve

339

case, the HPL jobs only represent 2.2% of the simulation resource utilization, so the improvement margin is narrow. In the case of the HIGH scenario, the results also show that the AIC-based estimation produces better behavior. From the user point of view, the jobs in which AIC estimation performs better than the theoretical estimation are for more than 60% of jobs, and the total waiting time reduction has a significant improvement in all the cases. From the system point of view, the improvement of cluster resource utilization is even better than in the LOW scenario for the predictive backfilling and the EASYBackfilling algorithms. However, the first fit policy produces a low improvement because this algorithm does not evaluate the whole queue and it does not optimize the scheduling process. On the other hand, the backfilling predictive algorithm has a better cluster resource utilization in both scenarios: as this algorithm evaluates the whole queue, it can take advantage of the more accurate AIC-based runtime estimations.

6 Conclusions The use of AIC-based performance models for obtaining accurate runtime estimations can be used by backfilling schedulings to improve their performance. This modeling technique is based on the model selection paradigm and allows the user to obtain, without code inspection, accurate models of parallel applications that are suitable for future inferences. In particular, the AIC-based model of the well-known HPL benchmark is obtained as a case of study. Using the GridSim toolkit, some scheduling simulations with different backfilling policies were performed using the AIC-based model for estimating the runtime of HPL jobs in realistic situations. The profiles of the workloads used in these simulations were generated by an empirical model based on real data. Also, two different workloads were considered to evaluate the scheduling performance on situations where the HPL executions have a different relevance on the total simulated time. The simulation behavior was compared to the one obtained when the theoretical model, provided by the HPL developers, is used to estimate HPL runtimes. The use of AIC-based HPL runtime estimations performs better in terms of waiting time and cluster resource utilization for all considered backfilling algorithms. The simulations also show that the predictive backfilling algorithm is the policy that better exploits the accurate AIC-based runtime estimations. Acknowledgements This work was supported by the Spanish Ministry of Education and Science through TIN2007-67537-C03-01 and TIN2008-06570-C04-03 projects. It has been developed in the framework of the European network HiPEAC-2 and the Spanish network CAPAP-H.

References 1. Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19:716–723 2. Albín JL, Lorenzo JA, Cabaleiro JC, Pena TF, Rivera FF (2007) Simulation of parallel applications in GridSim. In: 1st Iberian grid infrastructure conference proceedings 3. Burnham KP, Anderson DR (2002) Model selection and multimodel inference. A practical information-theoretic approach. Springer, Berlin

340

D.R. Martínez et al.

4. Dongarra JJ, Luszczek P, Petitet A (2003) The LINPACK benchmark: past, present and future. Concurr Comput 15(9):803–820 5. Jann J, Pattnaik P, Franke H, Wang F, Skovira J, Riodan J (1997) Modeling of workload in MMPs. In: IPPS/SPDP’97/JSSPP’97: proceedings of the job scheduling strategies for parallel processing 6. Martínez DR, Blanco V, Boullón M, Cabaleiro JC, Pena TF (2007) Analytical performance models of parallel programs in clusters. In: Parallel computing: architectures, algorithms and applications. ParCo 2007 7. Martínez DR, Pena TF, Cabaleiro JC, Rivera FF, Blanco V (2010) Performance modeling of MPI applications using model selection. In: 18th Euromicro conference on parallel, distributed and networkbased processing 8. Sulistio A, Cibej U, Venugopal S, Robic B, Buyya R (2008) A toolkit for modelling and simulating data grids: an extension to GridSim. Concurr Comput 20(13):1591–1609 9. Tsafrir D, Etsion Y, Feitelson DG (2007) Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans Parallel Distrib Syst 18:789–803