Coping with Uncertainty in Scheduling Problems

1 downloads 0 Views 78KB Size Report
Siegel, and Jong-Kook Kim. Measuring the Robustness of a resource Allocation. IEEE Transaction on Parallel and Distributed Systems, 15(7):630–641, July ...
Coping with Uncertainty in Scheduling Problems Louis-Claude C ANON∗ Nancy University, LaBRI, Bordeaux, France Email: [email protected]

Abstract—Large-scale distributed systems such as Grids constitute computational environments that are essential to academic and industry needs. However, they present uncertain behaviors due to their scales that increase continually. We propose to revisit traditional scheduling problematics in these environments by considering uncertainty in the models.

I. P ROBLEM A. Description Scheduling, in its general form, is the operation that assigns requests to resources in some specific way. In distributed environments, we are concerned by a workload (i.e., a set of tasks) that needs to be executed on a computational platform (i.e., a set of processors). Therefore, our objective is to specify how tasks are mapped onto processors. Produced schedules can be evaluated through many different metrics (e.g., processing time of the workload, resource usage, etc) and finding an optimal schedule relatively to some metric constitutes a challenging issue. We adopt here a wide definition of uncertainty that encompasses the intrinsic stochastic nature of some phenomenons (e.g., processor failures that follow a Poissonian distribution) and the imperfection of model characteristics (e.g., inaccuracy of the costs in a model due to a bias in measurements). These two types of uncertainty are more commonly known as statistical variation and systematic uncertainty [16]. We regroup in this last category every uncertainties that stem from indeterminations such as the user behaviors that are uncontrolled although being deterministic. Juxtaposing uncertainty and scheduling problems has one invariant consequence: problems become multi-criteria. Indeed, in addition to the initial metric considered in a given scheduling problem, new metrics can measure uncertainty-related criteria such as the robustness (stability in presence of input variations) or the reliability of a schedule (probability of success). For a given problem, several solutions can then be ∗ The research effort presented in this paper was conducted under the supervision of Emmanuel J EANNOT. The PhD program covers a period of two years on a traditional total of three years.

optimal. Our objective is to generate a representative subset of these optimal solutions. Consequently, we are interested in probabilistic tools and multi-objective optimization techniques. B. Approach As uncertainty comes from several possible sources, we study different problematics. This allow us to have a global approach of how to tackle uncertainty. Moreover, we are interested in proposing generic combinatorial, probabilistic and statistical tools. The objective is to obtain tools that can be used to solve other related problems. We have selected three representative sources of uncertainty. These sources concerns either the workload, the platform, or the users. Additionally, both types of uncertainty are considered (statistical variation and systematic uncertainty). We describe below the three sources of uncertainty: Imperfection of the workload model The workload is a set of tasks with given durations. Yet, each task duration depends on the input data, the operating system scheduler, the execution architecture, etc. Then, we use random variable to model task durations. Probabilistic behavior of the platform A set of processors composes the platform. Each processor might fail because of recoverable faults or resource crashes. Therefore, we use probability distributions to model failures. Actions of malevolent users In desktop grids, softwares such as BOINC [2] rely on the benevolence of participants. In our study, we consider that users cheat with a fixed probability. Our three main problematics are built from these three uncertainty sources. For each case, we propose a problematic by selecting the workload model, the platform model, the threat model, the permitted scheduling operations, and the metrics of the objectives to optimize. The resulting three problematics are described below: Robustness Find robust and efficient schedules when task durations are arbitrary random variables.

Tasks are subject to precedence constraints that are specified in a task graph, which defines the dependence between tasks. The platform is assumed to be a cluster of heterogeneous processors. The schedule maps tasks to each processor and specifies the order of their executions. Reliability Find reliable and efficient schedules when processors are subjects to failures. This is a similar problematic to the last one, except that task duration are deterministic and tasks can be duplicated on several processors. Thus, the schedule contains the starting and ending times of each task on each processor. Collusion Schedule efficiently jobs to the participants of a desktop grid project in presence of cheaters that collude. The first objective is to minimize the resource usage and the second is to find correct results. The workload is a set of independent jobs while the platform is formed by the participant machines. Additionally, participants might organize themselves to return incorrect results. Scheduling is done in a pull-based mechanism, i.e., when a participant is free, it asks a job to a central server and returns the result when the computation is finished. C. Research methodology We applied the appropriate steps of the following general methodology on each problematic. Whenever the quality of our results is unexpectedly low, we may refine and iterate some of these steps. 1) First, we expand the scope of a classic problematic by accounting uncertainties in a generic way. Although genericity may involve harder optimization problem, this ensures the applicability of our contributions to orthogonal fields. 2) Then, new objectives related to uncertainty appear and they need to be precisely assessed. To this end, several metrics may be defined, in which case, their relevance must be empirically validated. 3) We end up with a completely defined problem (model and objectives) whose complexity class may be determined. Depending of the complexity, a method, either algorithmic or heuristic, is developed. 4) Finally, the method must be empirically validated. We focus on using realistic instance of the models on which are based our problematics, i.e., we perform simulations. Concrete experiments could provide relevant insight on the method performances or might even reveal issues in the models. However, we expect that our contributions that are

validated through simulations would still consitute a significant basis in similar areas. D. Significance This research effort has several motivations. First, dealing with uncertainties has a broad range of applications as there are many sources of uncertainties (imprecision, fault, sabotage). Then, this research domain is called to play an important role as scale increases (more risks), as systems become more popular (drawing attacker’s attentions to them) and as model complexity increases in order to represent key aspects of computations (allowing for some imprecision or probabilistic behaviors). Lastly, we are confronted to challenging problematics (the robustness and the reliability problematics are intractable, while collusion has statistical issues) and most of the questions are open. II. R ELATED WORK To the best of our knowledge, we are not aware of any work done on analyzing globally the impact of uncertainty on distributed environments. Specific research efforts are described for each considered problematic below. When costs are considered to be random variables in the robustness problematic, we first need to assess the robustness of any schedule. How to measure robustness is a subject that has not yet led to a wide accepted metric [1], [13], [4] and no work has been done on compared them. Additionally, few works had been done on robust scheduling and most existing proactive methods are based on the insertion of slack [11] and are not efficiency-oriented. Some works make use of specific theory (possibility theory [12], fuzzy logic [14]). However, we believe that these works cannot give the same insight than using random variables. For the second problematic, reliability, some works have been done on developing multi-criteria algorithms that produces reliable and efficient schedules [17], [15]. Restrictions are done on the models for tractability reasons. No complexity study exists on the problems obtained by relaxing these restrictions. The last problematic, collusion, which deals with cheating in desktop grids, has already been studied in [20], [19]. Different assumptions are done such that some quiz tasks (tasks used to check participants) are indistinguishable from regular tasks, or that it is possible to check the result of a job. Another work [18] can thwart collusive behavior without these assumptions. However, this work has several drawbacks: it assumes that the detection algorithm is not known by the workers and it has to wait until the completion of all jobs before certifying the result.

III. R ESULTS Significant progress has been made in comparing robustness metrics and finding schedules that are both robust and efficient. More precisely, we propose different methods (heuristic and meta-heuristic) with a different compromise in term of search cost and solution quality for each of these methods. Additionally, the meta-heuristic is proved to converge, namely, to eventually produce a subset of all optimal solutions. Some of these results have been published in [5], [6]. More recent results will appear in [9]. This first problematic led us to develop contributions for solving two recurrent problems. The first consists in evaluating the distribution that results from applying additions and maximums on random variables. Although this problem has already been studied in operations research (when scheduling projects), no method presents an acceptable compromise in term of efficiency and accuracy. We have proved the complexity class and developed two heuristics that estimate the final distribution (these heuristics are described in [7], [8]). The second recurrent problem concerns the multi-criteria part. We have elaborated a generic multi-criteria framework that describes a methodology for building multi-criteria greedy heuristics from a mono-criteria one. We remind that the reliability problematic deals with evaluation and optimization of the schedule reliability when duplication is permitted. We have characterized two kind of failures and two kind of replication operations leading to four distinct problems. We have shown that the evaluation of the reliability is #P’-Complete for one of them [3]. Moreover, when input task graphs are restricted to chains, we can evaluate in polynomial time the reliability of optimal schedule for one of these problem. Finally, the collusion problematic has been at the center of our attention recently and we have separated our contributions into two parts. The goal of the first part is to detect and characterize colluding behaviors (which participants are cheating?) based on observations only. This means that no scheduling action are taken and that our contribution can be plugged in any existing scheduling solutions in order to assess the proportion of colluders, their likelihood to collude, etc. Our contributions are described in [10]. Oppositely, the objective of the second part is to schedule jobs in such a way that it always certifies one of the returned results as the correct one. Works done relatively to this part includes collusion avoidance and result certification mechanisms. Interestingly enough, the problem of evaluating a distribution that results from addition and maximum operations performed on random variables [7], [8] is

common to the robustness and the collusion problematics (although in an easier form for this last). Moreover, the multi-criteria framework that is still being empirically evaluated can be used as much for robustness than for reliability. IV. O BJECTIVES For each of the three studied problematics, we describe the current objectives and the potential long term perspectives. As we feel that the study of the initial robustness problematic is quite complete, we are firstly concerned by completing and polishing our contributions concerning the two sub-problematics. Our heuristics for estimating the distribution that results from arithmetic operations on random variables need to be compared to similar mechanisms in the field of digital circuits optimization. We are planning to summarizes and confront significant heuristics for this problematic with a detailed empirical study and a complete proof of the complexity class, for which existing works are incomplete. It is a major issue because there is numerous methods and it is hard to compare existing works. Finally, as mentioned above, we are still evaluating a generic multi-criteria framework for building greedy heuristics. As a long-term perspective, we could add more uncertainty in the model. For instance, task graphs could have structures that are subject to runtime variations. The reliability problematic has many open questions. Although the most permissive model makes even the evaluation of the reliability of a solution intractable, it is unclear if modified versions could have polynomial optimization algorithms. We have finalized our main contribution, that is, the #P’-Completeness of the reliability evaluation of general schedules. The final problematic is the collusion one. We have still to explore the problematic more deeply. The characterization provided by [10] will surely allow us to base our work on a solid foundation. R EFERENCES [1] Shoukat Ali, Anthony A. Maciejewski, Howard Jay Siegel, and Jong-Kook Kim. Measuring the Robustness of a resource Allocation. IEEE Transaction on Parallel and Distributed Systems, 15(7):630–641, July 2004. [2] David P. Anderson. Boinc: A system for public-resource computing and storage. In Rajkumar Buyya, editor, GRID, pages 4–10. IEEE Computer Society, 2004. [3] Anne Benoit, Louis-Claude Canon, Emmanuel Jeannot, and Yves Robert. On the complexity of task graph scheduling with transient and fail-stop failures. Technical Report 2010-01, LIP, ENS Lyon, France, January 2010. Available at graal.ens-lyon.fr/~yrobert.

[4] Ladislau Bölöni and Dan C. Marinescu. Robust scheduling of metaprograms. Journal of Scheduling, 5(5):395–412, September 2002. [5] Louis-Claude Canon and Emmanuel Jeannot. A Comparison of Robustness Metrics for Scheduling DAGs on Heterogeneous Systems. In HeteroPar’07, pages 568– 567, Austin, Texas, USA, September 2007. [6] Louis-Claude Canon and Emmanuel Jeannot. Scheduling Strategies for the Bicriteria Optimization of the Robustness and Makespan. In 11th International Workshop on Nature Inspired Distributed Computing (NIDISC 2008), Miami, Floride, USA, April 2008. [7] Louis-Claude Canon and Emmanuel Jeannot. Precise Evaluation of the Efficiency and the Robustness of Stochastic DAG Schedules. In 10ème congrès de la Société Française de Recherche Opérationnelle et d’Aide à la Décision (ROADEF), pages 13–24, Nancy, France, February 2009. [8] Louis-Claude Canon and Emmanuel Jeannot. Precise Evaluation of the Efficiency and the Robustness of Stochastic DAG Schedules. Research Report 6895, INRIA, April 2009. [9] Louis-Claude Canon and Emmanuel Jeannot. Evaluation and optimization of the robustness of dag schedules in heterogeneous environments. IEEE Transactions on Parallel and Distributed Systems, to appear. [10] Louis-Claude Canon, Emmanuel Jeannot, and Jon Weissman. A dynamic approach for characterizing collusion in desktop grids. In 24th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Atlanta, Georgia, USA, 4 2010. [11] Andrew J. Davenport, Christophe Gefflot, and J. Christopher Beck. Slack-based Techniques for Robust Schedules. In Proceedings of the Sixth European Conference on Planning (ECP-2001), pages 7–18, Toledo, Spain, September 2001. [12] Didier Dubois and Henri Prade. Possibility Theory: An Approach to Computerized Processing of Uncertainty. Plenum Press, New York, 1988. [13] Darin England, Jon Weissman, and Jayashree Sadagopan. A New Metric for Robustness with Application to Job Scheduling. In 14th IEEE International Symposium on High Performance Distributed Computing (HPDC-14), pages 135–143, July 2005. [14] Helene Fargier, Philippe Fortemps, and Didier Dubois. Fuzzy scheduling: Modelling flexible constraints vs. coping with incomplete knowledge. European Journal of Operational Research, 147(2):231–252, 2003. [15] Emmanuel Jeannot, Erik Saule, and Denis Trystram. Bi-objective approximation scheme for makespan and reliability optimization on uniform parallel machines. In Emilio Luque, Tomàs Margalef, and Domingo Benitez, editors, Euro-Par, volume 5168 of Lecture Notes in Computer Science, pages 877–886. Springer, 2008.

[16] M. Granger Morgan and M. Henrion. Uncertainty, A guide to dealing with uncertainty in qualitative risk and policy analysis. Cambridge University Press, Cambridge, 1990. [17] Erik Saule and Denis Trystram. Analyzing scheduling with transient failures. Inf. Process. Lett., 109(11):539– 542, 2009. [18] G.C. Silaghi, Patrício Domingues, Filipe Araujo, Luís Moura Silva, and A.E. Arenas. Defeating Colluding Nodes in Desktop Grid Computing Platforms. In 22th IEEE International Parallel & Distributed Processing Symposium (IPDPS), pages 1–8, Miami, Florida, USA, April 2008. [19] Matthew Yurkewych, Brian Neil Levine, and Arnold L. Rosenberg. On the cost-ineffectiveness of redundancy in commercial p2p computing. In Vijay Atluri, Catherine Meadows, and Ari Juels, editors, ACM Conference on Computer and Communications Security, pages 280– 288. ACM, 2005. [20] Shanyu Zhao, Virginia Mary Lo, and Chris GauthierDickey. Result verification and trust-based scheduling in peer-to-peer grids. In Germano Caronni, Nathalie Weiler, Marcel Waldvogel, and Nahid Shahmehri, editors, Peer-to-Peer Computing, pages 31–38. IEEE Computer Society, 2005.