An Instance-based Learning Approach for

5 downloads 0 Views 363KB Size Report
cations from workload traces, using an instance-based learning algorithm. .... phase, but only the K neighbors of xq are quickly retrieved using MySQL indexes.
An Instance-based Learning Approach for Predicting Execution Times of Parallel Applications Luciano Jos´e Senger

Marcos Jos´e Santana, Regina Helena Carlucci Santana

Universidade Estadual de Ponta Grossa Departamento de Informatica Av. Carlos Cavalcanti, 4748 CEP 84030-900 Ponta Grossa, PR, Brazil [email protected]

Universidade de S˜ao Paulo Departamento de Computac¸a˜ o Instituto de Ciˆencias Matematicas e de Computac¸a˜ o Av. Trabalhador Saocarlense, 400 Caixa Postal 668 CEP 13560-970 S˜ao Carlos, SP, Brazil {mjs, rcs}@icmc.usp.br

Abstract— A new approach for predicting execution times of parallel applications is presented. The main goal is to improve decision making in parallel systems, providing the system scheduler with knowledge about parallel applications. A model is produced searching for similar applications on based experience, allowing effortless knowledge updating when new information occurs. Workload traces from three computing centers are used to evaluate the model. The model achieves prediction errors on mean application execution times between 38% and 57%. Obtained results are compared with previous work; using a trace-driven simulator it is showed how the model can improve scheduling decisions on parallel systems.

obtained by the model can be used to improve the backfill scheduling algorithm on parallel computers. The remainder of this paper is organized as follows. Section III describes the knowledge acquisition model and the instance-based learning algorithm. Section IV describes the model results using four workload traces and compares our model results with those of other researchers. Section V shows how the model can be used to improve scheduling on parallel systems. Section VI presents the concluding remarks.

I. I NTRODUCTION

Devarakonda and Iyer [9] present a statistical approach for predicting CPU time, file I/O, and memory requirements of a program. For this purpose, the authors use statistical clustering (k-means algorithm) and a Markovian model to identify high-density regions of programs resource usage. Feiltelson et al. [10] observe that repeated runs of the same application tend to have similar patterns of resource usage and that much information, related to behavior of applications, can be discovered without explicit user cooperation. Downey [11] uses a statistical approach to predict the execution time of parallel applications. The approach is to model applications recorded on workload traces and then use the generated model to predict execution times. The applications from two workload traces are divided into classes, a model is created for each class and the models are used to predict execution times. Downey categorizes applications using the system scheduler queues that applications are submitted to. A historical application profiler is presented by Gibbons [5]. This profiler is used to classify parallel applications in categories based on static templates composed of attributes such as user (who submits the application), executable name and the selected scheduler queue. These templates are used to group applications and to generate average execution times and other statistical values of each created group. These derived values are used for predicting execution times of parallel applications. Smith et al. [6] review this profiler work, presenting a technique to derive predictions for execution times of parallel applications, using search techniques to dynamically determine which application characteristics (i.e., adaptive templates) yield the best definition of similarity. Their evalua-

Many researchers have demonstrated that using parallel application knowledge may improve scheduling decisions on multiprocessor systems [1]–[4]. Nevertheless, most of the work has assumed that such knowledge is available a priori and does not provide effective indications to obtain it. There are commonly three main sources to obtain knowledge in parallel applications: the description of application requirements provided by users (or programmers) who submit the parallel application to the system; historical traces of all applications executed in a specific system over a time period, and runtime measurements from parallel applications. Among these knowledge sources, historical traces and runtime measurements have demonstrated a great potential to provide information aiming at classifying parallel applications and obtaining knowledge [5]–[8]. This paper presents a model for knowledge acquisition in parallel applications aiming at improving software scheduling decisions. The model aims at exploring similarity among applications from workload traces, using an instance-based learning algorithm. Workload traces are treated as the experience bases and a new submitted parallel application is treated as the query point. The experience bases are used by an instance-based algorithm to predict execution times of parallel applications. Results obtained with the utilization of this model on four workload traces are presented. Compared to previous work, this model has two novel aspects. First, it allows updating acquired knowledge at the occurrence of new information. Second, it can be used to improve different scheduling algorithms. Using a trace-driven simulator, we show how knowledge

II. R ELATED WORK

tion showed that genetic algorithm finds the best similarity template in workload traces used. In a similar work, Krishnaswamy et al. [12] use a rough-set algorithm to address the problem of selecting templates that best define similarity. The rough-set algorithm uses a workload trace as input and an estimated runtime as output.

III. P REDICTING EXECUTION TIMES OF PARALLEL APPLICATIONS

Our knowledge acquisition model is based on the observation that similar applications are more likely to have similar execution times than applications that have nothing in common [5], [6], [13], [14]. The model defines how to find similar applications and how to generate predictions from these applications. The model is constructed using an instance-based learning approach. Workload traces are treated as a database created from previous experiences (experience base) about parallel applications execution and depending on the scheduling algorithm requirements, some attributes, such as execution time, cpu usage or memory usage, can be considered as the attribute to be predicted [15]. In this paper, we describe our knowledge acquisition model and how this model algorithm can be used to predict execution times of parallel applications. Instance-based learning (IBL) is an approach which finds similar instances in an experience base aiming at approximating real-valued or discrete-valued target functions [16], [17]. Learning consists of simply storing the presented training data, which are composed of instances; each instance is composed of a set of input and output attributes. Input attributes describe the conditions under which an experience was observed and output attributes describe what happened under those conditions. IBL algorithms compute the similarity between a new query instance and the experience base instances, returning a set of related instances as output. Only relevant instances are used to classify the query instance. These algorithms can construct a different approximation to the target function for each distinct query instance that must be classified. This has significant advantages when the target function is very complex, but can be described by collection of less complex local approximations. Our IBL algorithm is based on the k-nearest neighbor learning. The method assumes that all instances in an experience base correspond to points in the n-dimensional space