YACS - Yet Another Computing Service

4 downloads 9304 Views 2MB Size Report
distributed job execution service capable of self-management behaviour. This enables the ...... The various Globus components are hosted in dedicated servers.
YACS - Yet Another Computing Service by Atli Thor Hannesson

A thesis submitted in fulfilment of the requirements for the Degree of Master of Science From the Royal Institute of Technology (KTH) Stockholm, Sweden, June 2009

VERSION 0.7

Examiner: Professor Vladimir Vlassov School of Information and Communication Technology Royal Institute of Technology Supervisors: Joel Höglund Konstantin Popov Swedish Institute of Computer Science (SICS)

Abstract Self-management technology is an area that has received much attention in recent years as a method of dealing with ever increased complexity of computer systems. YACS is a distributed job execution service capable of self-management behaviour. This enables the service to work in the target environment of community grids, which are generally characterized by unstable membership and resources, making them complex. Functionality and self-management is achieved by building on Niche, a framework and infrastructure that facilitates and supports the creation, deployment and execution of self-managing distributed systems. YACS serves as an example of Niche’s applicability for building such systems and as a source of feedback for Niche’s continued development. In preparation for designing and building YACS, related literature was studied and a survey of other execution services conducted in order to acquire knowledge of required functionality, trends in design and architecture and of existing self-management capabilities, if any. Based upon this information YACS was designed and developed during an iterative development process. The result of this process is a robust execution service which shows self-healing capabilities of jobs, system components and services and system self-configuration for maintaining resource availability. Finally the service was evaluated for time overhead and robustness to churn with runs conducted on the Grid5000 cluster. From a functional perspective the results show that the service is capable of speeding up execution compared to the non-distributed case and that cost of job management increases with job size. From a selfmanagement perspective the results show that Niche enables self-management capabilities that reacts within configurable time bounds and have low overhead in time. Furthermore, with Niche and self-management support the service shows tolerance to churn rate frequency that is equal to a configurable timeout bound, i.e. configuration of the timeout bound affects what rate of churn the service can tolerate.

2 / 89

Acknowledgements I am very thankful for the opportunity of working on this project and the experience I have gained from it. For this I would like to express gratitude to my examiner, professor Vladimir Vlassov, for offering me this project and for guidance and helpful comments throughout its duration. Similarly to my supervisors at SICS, Joel Höglund and Konstantin Popov, for their invaluable support and guidance, and for arranging for the facilities to which I have had access at SICS. I would also like to thank Ahmad Al-Shishtawy for his helpful comments.

3 / 89

Table of contents Abstract ...................................................................................................................................... 2 Acknowledgements .................................................................................................................... 3 Table of contents ........................................................................................................................ 4 List of figures ......................................................................................................................... 5 List of tables ........................................................................................................................... 6 1. Introduction ............................................................................................................................ 7 1.1. Motivation ....................................................................................................................... 7 1.2 Outline.............................................................................................................................. 8 2. Methods.................................................................................................................................. 9 2.1 Literature study ................................................................................................................ 9 2.1.1 Execution services..................................................................................................... 9 2.1.2 Self-management....................................................................................................... 9 2.1.3 Grid computing ......................................................................................................... 9 2.1.4 P2P ............................................................................................................................ 9 2.1.5 Convergence of Grids and P2P ................................................................................. 9 2.1.6 Niche ....................................................................................................................... 10 2.2 Survey of execution services.......................................................................................... 10 2.3 Software development process....................................................................................... 10 2.3.1 Design...................................................................................................................... 10 2.3.2 Implementation........................................................................................................ 11 2.3.3 Testing..................................................................................................................... 11 2.4 Evaluation....................................................................................................................... 11 3. Results .................................................................................................................................. 12 3.1 Literature study .............................................................................................................. 12 3.1.1 Execution services................................................................................................... 12 3.1.2 Self management ..................................................................................................... 12 3.1.3 Grid computing ....................................................................................................... 13 3.1.4 P2P .......................................................................................................................... 15 3.1.5 Convergence of Grids and P2P ............................................................................... 16 3.1.6 Niche ....................................................................................................................... 16 3.2 Survey of execution services.......................................................................................... 18 3.2.1.1 Self-management scenarios.................................................................................. 18 3.2.1.2 Globus Toolkit...................................................................................................... 19 3.2.1.3 Unicore ................................................................................................................. 24 3.2.1.4 gLite ..................................................................................................................... 27 3.2.1.5 XtremWeb-CH ..................................................................................................... 31 3.2.1.6 BOINC ................................................................................................................. 35 3.2.1.7 P3.......................................................................................................................... 37 3.2.1.8 Condor.................................................................................................................. 40 3.2.1.9 Survey conclusion ................................................................................................ 43 3.3 Design............................................................................................................................. 45 3.3.1 Design guidelines .................................................................................................... 46 3.3.2 Design overview...................................................................................................... 46 3.3.3 Job management ...................................................................................................... 50 3.3.4 System management................................................................................................ 56 3.3.5 Jobs, Tasks and the execution context .................................................................... 61 3.4 Implementation............................................................................................................... 64 3.5 Evaluation....................................................................................................................... 65 4 / 89

3.5.1 Step 1: Definition of an idealized job/work model ................................................. 67 3.5.2 Step 2: Pure job setup time...................................................................................... 67 3.5.3 Step 3: Distributed job execution ............................................................................ 67 3.5.4 Step 4: Distributed job execution with self-healing scenarios ................................ 71 3.5.5 Step 5: Distributed job execution with self-configuration scenarios ...................... 73 3.5.6 Step 6: Distributed job execution subjected to churn.............................................. 76 4. Discussion ............................................................................................................................ 78 4.1 Niche experience ............................................................................................................ 78 4.1.1 Development support and self-management application model ............................. 78 4.1.2 Runtime infrastructure............................................................................................. 79 4.2 Future work .................................................................................................................... 80 5. References ............................................................................................................................ 81 Appendix A: Iterative system development process ................................................................ 86 Appendix B: Programmer’s manual......................................................................................... 87 Appendix C: Administrator’s manual ...................................................................................... 88 Appendix D: Javadoc ............................................................................................................... 89

List of figures Figure 1: Schematic view of GTA4.0 components .................................................................. 21 Figure 2: GRAM4 components ................................................................................................ 22 Figure 3: UNICORE 6 architecture.......................................................................................... 25 Figure 4: gLite architecture overview ...................................................................................... 28 Figure 5: Workload manager internal architecture .................................................................. 29 Figure 6: XtremWeb-CH architecture...................................................................................... 32 Figure 7: Architecture of the server ......................................................................................... 32 Figure 8: BOINC server ........................................................................................................... 35 Figure 9: Organization of job management software and related peer groups ........................ 38 Figure 10: The Condor kernel .................................................................................................. 41 Figure 11: Example scenario illustrating the functional job and system management parts ... 47 Figure 12: Self-management hierarchy for functional job and system parts ........................... 49 Figure 13: Job management related components, groups and bindings................................... 53 Figure 14: Job management related event subscriptions.......................................................... 54 Figure 15: The sequence of Worker self-healing steps ........................................................... 55 Figure 16: System management related components, groups, bindings and subscriptions ..... 58 Figure 17: The sequence of ResourceService self-healing steps ............................................. 60 Figure 18: Illustration of system self-configuration steps to maintain availability.................. 61 Figure 19: Structure of responsibility, timeline and flow of information for jobs and tasks ... 62 Figure 20: Jobs, tasks and execution context ........................................................................... 63 Figure 21: Distributed job execution with 10 workers and job breakup into 1-120 tasks ....... 69 Figure 22: Distributed job execution with 20 workers and job breakup into 1-120 tasks ....... 69 Figure 23: Distributed job execution with 30 workers and job breakup into 1-120 tasks ....... 70 Figure 24: Overhead trend as the number of tasks increase..................................................... 70 Figure 25: Execution overhead in worker healing scenario vs. no failure scenario................. 72 Figure 26: Execution overhead in master healing scenario vs. no failure scenario ................. 73 Figure 27: Ratio of time speedup after self-configuration; model and actual result................ 75 Figure 28: Comparison of 9-10 worker setups with respect to execution overhead ............... 75

5 / 89

List of tables Table 1: Scenarios and self-management capabilities.............................................................. 19 Table 2: Listing of job capabilities........................................................................................... 43 Table 3: Abbreviations used in evaluation results ................................................................... 66 Table 4: Timing assumptions for the idealized job/work model.............................................. 67 Table 5: Job setup overhead ..................................................................................................... 67 Table 6: Run results and model for distributed job execution ................................................. 68 Table 7: Run results for the worker healing scenario............................................................... 71 Table 8: Run results for the master healing scenario ............................................................... 72 Table 9: Results of jobless self-configuration scenario runs.................................................... 74 Table 10: Results of self-configuration scenario runs.............................................................. 74 Table 11: Result of churn scenario runs................................................................................... 77

6 / 89

1. Introduction The use of distributed computer resources is a well established and mature process. A specific example of distributed resource utilization is the use of distributed computational resources for job execution. Many job execution and management systems already exists to enable such utilization. Other systems exist for utilization of other types of distributed resources. With ever increasing number of resources such as cheap home computers and even mobile computing devices as well as ubiquitous connectivity the potential of such systems is as great as ever. However, with increased sophistication of functionality and increase in numbers, distribution and heterogeneity of resources come many difficult challenges and problems (Ganek, Corbi, 2003). Examples of these challenges and problems are that distribution increases complexity and makes manual administration costly, difficult and even sometimes impossible due to organizational boundaries; robustness can suffer due to generally less reliable resources; and insufficient scalability of systems to cope with the increased number of participants. To tackle some of these problems methods of self-management have been applied (Ganek, Corbi, 2003; Ganek, 2007). This essentially means systems which can manage themselves without human intervention. Self-management behaviours are often categorized into self-configuration, self-healing, self-optimization and self-protection (Ganek, Corbi, 2003; Ganek, 2007). A notable self-management approach is proposed in the autonomic computing initiative. The proposal is for mimicking the human autonomic nervous system with computer systems capable of similar autonomous behaviour. One way to implement this is to use control loops which autonomously sense the system state, analyze the state and, if needed, take action to maintain it within configured bounds (Ganek, Corbi, 2003; Ganek 2007; Al-Shishtawy et al, 2008). The main task of this thesis project was to build a distributed job execution service that shows self-management capabilities. The service is supposed to be able to function in a community grid setting, where the members include domestic users, small organizations, small institutions and small enterprises. Compared to larger institutional and enterprise grids the member participation of community grids is much more unstable as they join and leave unpredictably, their contributed physical resources of less quality and capability, and are often without any administrative attention. In this setting it is particularly important that the service is capable of self-healing to deal with failures resulting from the instability, and selfconfiguration to deal with changes in load and membership. The service to do this is called Yet Another Computing Service (YACS) and was developed using Niche, a distributed component management service. Niche is meant to facilitate the process of developing selfmanaging distributed applications and to provide an infrastructure in which they run (SICS, KTH, INRIA, 2009). Niche is being developed by SICS, KTH, and INRIA within the Grid4All project. The Grid4All project is a European Union funded project aimed at enabling individuals, smaller companies and institutions to connect and share resources in the these community grids (Grid4All Consortium, 2009). YACS can serve as an example of Niche’s applicability for building and running distributed and self-managing systems, and as a source of feedback for its continued development.

1.1. Motivation The main motivation behind the design and implementation of a realistic execution service meant for community grids using Niche was to evaluate Niche’s applicability for designing, building and providing runtime support for such types of services. In particular to evaluate if during runtime, the support Niche offers for self-management enables the service to continue to function and to adapt despite failures, churn and heterogeneous resources. These abilities 7 / 89

are critical for community grids which are characterized by unstable and heterogeneous resources.

1.2 Outline The report is structured according to the IMRAD format. The current chapter 1 is an introduction to the context and the purpose of the thesis project. Chapter 2 explains the methodology used during the project, for study of literature and related work, the development process and evaluation. Chapter 3 is the longest section of the report and contains results of literature study and survey of related work. Building upon the study result is the design of YACS, which comes next. Evaluation scenario setup and scenario results mark the end of chapter 3. The main content of the report ends in chapter 4 with a concluding discussion on the meaning of the results. The chapter also contains a listing of possible future work. A chapter with references, done in the Harvard format, and appendixes round up the report itself.

8 / 89

2. Methods This section describes the methodology used during the course of this project. Work progressed in a logical sequence of steps, from background reading and information gathering to design, implementation and finally evaluation, each building upon previous steps. Each of the steps is described, i.e. what was done, how and why.

2.1 Literature study The first step of any academic work is to study relevant literature and related work. This project is no exception and work began with many weeks of studying literature and related work. This formed a solid knowledge base upon which YACS design and implementation was based. The starting point for the study was the project description and directions from supervisors. As the study progressed knowledge and experience was gained and the study became more focused on the most relevant information. The following sections contain important and relevant categories of information and explains why they are relevant and were chosen. More detailed information is in chapter 3.

2.1.1 Execution services The subject of the thesis project was to build an execution service and therefore a detailed understanding of what execution services are and do was necessary. This section covers them from general perspective. Section 3.2, a survey of execution services, covers service examples in detail. These examples include services based in general grid computing infrastructures, in more dedicated and specially designed execution service infrastructures as well as services built in P2P infrastructures.

2.1.2 Self-management Autonomic computing and self-management are receiving increasing attention as systems become larger, more complex and more difficult and costly to administer. YACS was to display self-management capabilities, and as a consequence detailed information about selfmanagement was necessary.

2.1.3 Grid computing Grid computing is widely used method of sharing and utilizing distributed resources, and many grid middlewares contain execution service capabilities. Studying their use of distributed resources and, in particular, execution capabilities was necessary for YACS.

2.1.4 P2P YACS is built upon Niche, which is in turn built upon DKS, which is a P2P overlay. Furthermore, some of the related execution services surveyed are P2P based. Understanding the strengths and weaknesses of P2P technology was important for YACS as it directly affects the design of the system.

2.1.5 Convergence of Grids and P2P Following up on the study of grid computing and P2P infrastructures for sharing resources comes a study of what has been called the convergence of these distinct infrastructures to a common one. This study covers broader resource usage than just the usage of computational resource that YACS enables. However, it is relevant for it gives insight on some of the advantages and problems with the grids and P2P infrastructures and therefore gave input to the design of YACS. 9 / 89

2.1.6 Niche As mentioned before, YACS is built upon Niche. Niche is a framework for building and running distributed self-managing applications. Due to YACS’s dependence on Niche, detailed understanding of the framework is naturally required.

2.2 Survey of execution services Following up on the literature study a more specific survey of related execution services was performed. As one of the main subjects of the project is an execution service this was necessary to gain a more concrete understanding of what functionality execution services provide and how they provide it. Another main subject is self-management. For the purpose of examining the services from this perspective a framework of self-management scenarios was defined. Sources of information were papers, manuals and relevant websites. In order to gain understanding of what functionality execution services provide, each of the selected services was examined from the standpoint of which features and capabilities they offer, such as how users interact with the service and what kind of job support they offer. Another perspective were design and architectural issues. In order to gain understanding on self-management behaviours each service was examined within a framework of self-management scenarios. This framework was composed of a set of scenarios chosen in light of YACS’s prospective deployment environment, i.e. an unstable environment characterized by frequent churn. The scenarios were therefore focused on managing departures and adapting to resource availability. The framework covered the services from three standpoints and looked for the self-management capabilities, i.e. healing, optimization, configuration and protection. Since the it seems obvious that the main role of execution service is execution the first standpoint was chosen to be the execution of individual jobs. Execution of jobs needs some system infrastructure and services so selfmanagement capabilities of system components formed the second standpoint. Finally, protection was addressed by looking for security measures. Subsequent requirement specification, design and architectural decisions made for YACS were mostly based on information gathered in this survey.

2.3 Software development process Following the literature study and survey of related execution services the actual software development process for YACS was started. This process began with a requirement specification and a preliminary design. These two artefacts were broken up and assigned to iterations. The reason for doing iterative design and development is that it breaks the task up into smaller and more manageable parts. Another reason is that after each step there exists a running, tested version of the system and problems can be identified earlier than if all testing were performed at the end of the project. To help with keeping a clear and consistent focus throughout the iterations and prevent them branching into different directions the preliminary design was made beforehand and used as a guideline.

2.3.1 Design Each iteration was assigned a number of requirements. The design process was aimed at fulfilling those requirements and was done by defining appropriate system components.

10 / 89

2.3.2 Implementation Implementation is aimed at realizing the design by coding and/or modifying appropriate system components. This project was required to be done in Java and using Niche and Niche’s built-in Fractal component model.

2.3.3 Testing All iterations and fulfilment of requirements were meant to be validated by formal testing. This entailed designing a test scenario for every requirement that described how to test it and what the expected outcome was to be. Formal testing was performed for the first two iteration but due to time constraints the third iteration went straight into production and actual use during the SICS Open House demonstration day. During this day visitors were able to see YACS in actual use through a distributed transcoding application called gMovie.

2.4 Evaluation Evaluation is meant to determine the quality of the system. For YACS the notion of quality was defined as the overhead in time from using the service, most specifically the overhead involved in self-management scenarios. A further notion was the service’s robustness towards churn. For this purpose an idealized model of a job was defined. This is a job where every single wall clock second is spent doing productive work, i.e. without overhead on setup or administrative work. This model formed a baseline to which distributed executions of that same job in a static YACS were compared. Static meaning that YACS didn’t need to show any self-management behaviour during those executions. In turn, these static distributed executions formed a baseline to which distributed executions with included self-management behaviour, e.g. with self-healing and self-configuration, were compared. The final step of the evaluation was a simulation of a realistic scenario in which the system was subjected to increased frequencies of churn to see at which point the system failed, and why. This type of evaluation is common for distributed systems.

11 / 89

3. Results This chapter forms the main part of the report and is a sequential description of the most significant steps taken during project work and the artefacts produced during those steps. First comes a study of literature and related work which produces the knowledge base which following work is based upon. The final design of YACS is described thereafter. Finally, an evaluation of YACS is presented.

3.1 Literature study This section presents general information which is relevant to YACS and has influenced its design and implementation. It sets the context for a more specific study of related execution services in section 3.2. The section begins with explaining what executions services do. YACS is a self-managing execution service so next comes a study of self-management. Then comes an overview of grid computing which is one form of distributed resource sharing. Following is a study of P2P which is another form of resource sharing and on which Niche is built. YACS is built upon Niche so a study of this framework is also presented.

3.1.1 Execution services Use of distributed computation resources for job execution is an established process. An example of mature technology in this field is Condor, the first version of which was released in 1986 (Condor Team, 2009b). Countless other systems exist. Grid computing middleware like Globus Toolkit, UNIORE and gLite offer job submission and management, as well as many other services (Foster, 2005; UNICORE Forum, 2009b; Burke, 2008). Other include less general systems like XtremWeb, BOINC and a P2P based service called P3 (XtremWebCH, 2009b; University of California, 2009, Shudo, Tanaka, Sekiguchi, 2005). The services are many and their capabilities differ. The common characteristic is a functional interface providing the ability for job submission; uploading of required executables, data and initialization parameters after job acceptance; job monitoring and management during job lifetime within the service; and retrieving job results after job completion. Jobs are generally described in a job definition language and often contain requirements towards eventual execution resources, such as operating system type, minimum memory or specific CPU instruction sets. Descriptions of executable, initialization and data input and output stages is also generally provided. Most services support jobs containing collections or bags of independent subtasks. Some support interdependent tasks and conditional workflows like directed acyclic graphs. A message passing interface (MPI) for task cross-communication is often supported. Full checkpointing of task state is supported in Condor’s standard execution environment (Condor Team, 2009b). Section 3.2 studies a sample of services in detail for a closer look at functional, design and architectural trends. However, it is immediately clear that YACS needs to support the basic functionality of job submission, task execution and result retrieval.

3.1.2 Self management Self management is an area of technology that has been received increased attention since IBM launched its autonomic computing initiative (Ganek, Corbi, 2003) to tackle the ever

12 / 89

increasing complexity of computing systems. Examples of from where this complexity stems is increased sophistication of systems; from increased integration among heterogeneous systems; the explosion in number and power of computer devices following Moore’s Law; increased connectivity due to the increase in device numbers and introduction of ubiquitous high-speed broadband and Wi-Fi. Autonomic computing is modelled after the autonomic nervous system of the human body where critical procedures such as heart rate are controlled without conscious attention (Ganek, Corbi, 2003; Ganek, 2007). Mapping this behaviour onto computing systems envisions systems capable of controlling their own procedures and reacting to changes in their environments without requiring constant supervision and direction by administrators, i.e. showing autonomic self-managing behaviour. Administrators are at most being required to set high-level goals (Parashar, Salim, 2007), which the system itself finds ways of fulfilling. This self-management behaviour can be grouped into four categories (Ganek, Corbi, 2003; Ganek, 2007): 1. Self-configuration: automatically changing system deployment or architecture to adapt to changes in the environment. For example by looking for and integrating extra resources in time of high load. 2. Self-healing: automatic detection, and even prevention, of faults and subsequent steps to fix damage done. 3. Self-optimization: automatic monitoring of resource utilization and change in utilization patterns in order to maximize utility. 4. Self-protection: automatic detection of malicious behaviour and taking steps to remove the threat or limit its impact. It has been suggested that control loops are an appropriate model for implementation of selfmanagement behaviour. A control loop normally entails the following steps; monitoring system state through sensors; analyzing the sensed state for deviations from goals; deciding upon a plan to fix the deviation, if any; actuation to realize the plan for bringing about change in the system (Ganek, Corbi, 2003; Ganek 2007; Al-Shishtawy et al, 2008). YACS needs to follow control loop patterns to show self-management behaviours appropriate for the prospective setting of unstable, heterogeneous and poorly managed community grid resources. This includes both self-healing to fix immediate problems caused by component failures, and self-configuration for longer-term adaptation to changes in component availability and load.

3.1.3 Grid computing Grid computing is a technology area concerned with large scale sharing of resources between a dynamic set of participants (Foster, Kesselman, Tuecke, 2001; Coulouris, Dollimore, Kindberg, 2005). These participants can in theory be everything from individual users to large institutions or enterprises but together they form so called virtual organizations (Foster, Kesselman, Tuecke, 2001). The term virtual comes from the fact that both boundaries and hardware and software differences between participants are hidden by middleware services making resources appear uniformly accessible. Resource sharing in the form of computational power is, in the context of YACS, the most interesting form, but other types of resources shared are for example storage and networking. Considerable effort has been put into infrastructure services to control of who and how much

13 / 89

access is granted to those resources, accounting, scheduling, and general quality of service. These requirements come from large-scale users like large academic or research institutions and corporate users (Foster, Iamnitchi, 2003). Grid architecture has been described in general terms as a layered architecture of fabric, connectivity, resource and collective (Foster, Kesselman, Tuecke, 2001) layers. The fabric layer interfaces between heterogeneous local resources and low level grid operations; connectivity connects those resources; the resource layer provides higher level management and information regarding individual resources; the collective layer spans multiple resources to provide the highest level services in the service stack, such as directories, scheduling, reservation and accounting. 3.1.3.1 OGSA Open Grid Services Architecture (OGSA) is an effort by the Open Grid Forum (OGF) for the standardization of infrastructure services to ease middleware implementation and to enable cooperation among different middleware implementations. The architecture is a service oriented architecture which is exposed through Web service standards (Open Grid Forum, 2006). The use of the basic Web services standards like WSDL and SOAP stems from wide industry adoption of those standards. This architecture is to realize a set of capabilities generally needed for interoperability, cooperation and sharing of resources. Those service capabilities are: •

• • •

• •



Infrastructure services: base functionality that other services can be expected to build upon. This includes implementation of web services standards, naming, basic security, notifications and more. Execution management services: the submission and management of jobs for execution. Data services: data management services for data transfer, update, access, data location and so on. Resource management services: service for actuation and management related to actual resources or infrastructure services, e.g. restarting or reconfiguring. Also for resource reservation and monitoring. Security services: for management and enforcement of security policies Self-management services: for issues related to self-management of resources and services. This could entail setting high-level goals which the infrastructure would oriented itself to fulfil. This is accordance to the discussion on self-management in 3.1.2. Information services: for information about architecture elements such as physical resources, discovery, notifications, logging and more.

By now a number of middleware implement many of those capabilities, for example the Globus toolkit and gLite. See section 3.2 for study of those systems. 3.1.3.2 BES & JDSL The Basic Execution Service (BES) specification is standard for job submission into execution services. It describes web services interfaces through which basic standard jobs can be submitted, monitored and managed (Open Grid Forum, 2007). An administrative interface to monitor and manage the execution service itself its also specified. The service accepts a job described in the Job Submission Description Language (JSDL). This language specifies a

14 / 89

single job, i.e. data and executable required in, requirements towards execution environment and data expected out (Open Grid Forum, 2005). 3.1.3.3 Concluding remarks Although grid computing has a considerably wider scope than only resource sharing for execution of jobs it is worth studying in detail, which is done in section 3.2. A particular lesson learnt is the strong requirements for sophisticated infrastructure, i.e. infrastructure of controlled, monitored resource usage and services used to maintain sufficient quality of service and fulfil service level agreements. These requirements are generally met by having management in centralized and dedicated service entities. Centralized entities may be viable in a traditional grid setting because large participants like the academic and professional organizations provide resources which are in general stable and powerful (Foster, Iamnitchi, 2003), more so than those of community grid participants. Even so, finding a way to offer sophisticated services that facilitate execution yet work in an unstable environment should be a goal for YACS.

3.1.4 P2P Peer-to-Peer (P2P) systems are distributed systems with inherent equality in that every participant can assume the role of both client and server. In other words, every participant both uses and contributes resources to the overall pool of system resources. In a setting of such high degree of distribution, ad-hoc nature of functionality and potentially high churn the participants must have some way of discovering and communicating with the resources they need. In later generation systems this is generally done without using centralized directories by building self-organizing overlays on top of the network. These overlays come in two main forms, unstructured and structured. In unstructured overlays peers try discover resources by querying their neighbours. These neighbours in turn query their neighbours, pending some constraints. This method is very robust but costly in terms of message count and doesn’t guarantee that queries are fulfilled. Examples an unstructured overlay is Gnutella. The other form are structured overlays where, as the name implies, there is more defined structure of responsibility. These are generally implemented as distributed hash tables (DHT) where location of named resource can be efficiently looked up and routed to. To improve robustness and eventual routing success these implementations usually run some overlay maintenance procedures to deal with membership churn (Schoder, Fischbach, Schmitt, 2005; El-Ansary, Haridi, 2005; Coulouris, Dollimore, Kindberg, 2005). Examples of structured overlays are Chord, Pastry, and DKS. DKS is a fully decentralized structured overlay (Alima et al, 2003) that is particularly notable for this thesis project since the current implementation of Niche is built on top if it. P2P systems are generally more scalable than traditional client-server systems due to the higher distribution of responsibilities and higher overall resource utilization. Distribution of responsibilities, i.e. no single points of failure, and self-organization ability also makes them, in general, more robust. These properties make P2P a viable approach for large-scale systems and system characterized by frequent churn. YACS directly benefits from those properties. P3 is an example of an execution service discussed in 3.2 which implemented on top of a P2P infrastructure. Aside from being interesting in that light it is also notable for not using any system-wide centralized components (Shudo, Tanaka, Sekiguchi, 2005).

15 / 89

3.1.5 Convergence of Grids and P2P Previous sections have covered both grid computing and P2P technology. Both fields are fundamentally concerned with sharing of resources and cooperation but have different focus. Foster and Iamnitchi in (Foster, Iamnitchi, 2003) describe how grids infrastructures are concerned with strong and sophisticated system services to provide service level guarantees and general quality of service. Grid infrastructures have not been as concerned with robustness in unstable environment or scalability for massive numbers of users. On the other hand P2P infrastructures do not offer as sophisticated system services but are more scalable and robust. Foster and Iamnitchi suggest that these fields will eventually converge into infrastructure that can handle very large number of users and frequent churn, i.e. scalable and robust, and also provide general interoperability and sophisticated system services. Considering sophisticated system services it would seem that leveraging the P2P scalability and robustness and higherlevel self-management to build system services that can withstand failures and adapt to environment changes is a step towards convergence. The context is general resource sharing infrastructures which are much broader than the dedicated job execution infrastructure that YACS provides. Yet, it is interesting to see if using Niche’s P2P properties and support for self-management makes it viable for YACS to build these robust, scalable and sophisticated system services that might be needed, e.g. for resource discovery.

3.1.6 Niche Niche, formerly called DCMS, is a distributed component management service and a framework for building self-managing applications and is being built by teams from SICS, KTH and INRIA in the context of the Grid4All project. YACS is in turn built using and run with the help of Niche. YACS’s dependence on Niche requires thorough understanding of its workings, capabilities and constrains. The motivation for self-managing application is well established by now. Niche facilitates designing and building such systems with a self-management application model, an API and a runtime infrastructure for running such systems (SICS, KTH, INRIA 2009). 3.1.6.1 Niche and the Fractal component model Niche manages component based applications and services and uses the Fractal component model. Fractal is a component model in which software logic is contained within components, exposed through well-defined interfaces and communicated with through bindings to those interfaces. Aside from user-defined interfaces Fractal defines four types of special interfaces, called attribute, binding, content and lifecycle controllers. These controllers enable component reconfiguration, changing communication channels through bindings, composite component creation through component content change and starting and stopping execution. These features are of use for Niche and self-management. In the context of self-management control loops, runtime reflection and introspection can play a part in sensing. After analyzing state, needed actuation can be performed with runtime deployment, starting, stopping, binding and reconfiguration. Niche has furthermore extended the Fractal component model with a number of useful features. Most notable are component groups and group communication. Apart from being useful in general groups are particularly useful for unstable systems as the entity communicating with the group doesn’t need to concern itself with group membership changes. The group communication possibilities are send-to-all or send-to-any, which sends 16 / 89

to a random component in the group. In particular the send-to-all method is useful for replication schemes, e.g., pending ordering of messages, to build distributed state machines. Send-to-any is particularly useful for scalability as this method randomly distributes message handling to the group members. Pending the quality of the random generator the distribution should be evenly spread out to all members. (SICS, KTH, INRIA 2009). 3.1.6.1.2 Niche’s self-management application model A Niche supported application is usually split into two parts, a functional part and a selfmanagement part. The functional part contains Fractal components with logic implementing the functional role of the application. The self-management part contains Fractal components called management elements which monitor the functional part, analyze and take action to manage it. The management elements are either application specific or Niche provided, e.g. built-in failure sensors. Application specific management elements are hierarchically categorized into three types, watchers, aggregators and managers. • Watchers watch specific parts of the functional part through sensors. • Aggregators collect sensory information from one or more watchers to form a wider view of the system. • Managers sit at the top of the hierarchy and gather information from one or more aggregators to form the most global view of the system, perform end analysis and decide on whether actuation to change the system is needed. Together these management elements display control loop like behaviour of sensing, analysis and actuation to realize self-management of various forms. (SICS, KTH, INRIA 2009). 3.1.6.1.3 Niche’s runtime infrastructure and properties On each connected physical node a Niche process is active which provides runtime services to components deployed on that node. This includes general services like communication as well as services for self-management like sensing and actuation. In Niche all architecture elements are identified with a unique and network transparent identifier. Network transparency is important as it relieves the application programmer of having to worry about lower level issues like hosting and location responsibilities, routing and communication. In case of functional components the communication is through the interfaces and bindings explained previously. In the case of management elements the communication is usually through event triggering and subscriptions to those events. In terms of scalability Niche tries to balance load by distributing management elements evenly to connected nodes and can subsequently relocate these elements should network changes require. However, should there be efficiency constraints Niche also enables controlled placement of management elements with respective functional components for more efficient communication. For robustness and reliability of the self-management part Niche offers transparent replication of management elements and near transparent restoration of those elements after failure. This is implemented by Niche delivering relevant communication messages to all replicas, although it does not guarantee the delivery order. Notable is that the current Niche implementation is built using the DKS overlay. Among other things, what Niche receives from there are properties of self-healing and self-organization for robustness towards churn. These are important properties for YACS due to unstable resources in its prospective running environment. (SICS, KTH, INRIA 2009).

17 / 89

3.2 Survey of execution services In this section a representative collection of existing services, schedulers and related technologies are surveyed to determine a common set of execution service functionalities, to reflect on architectural highlights and to analyze their behaviour under certain scenarios that are believed relevant to execution services and to systems characterized by frequent changes in member composition, such as many distributed systems are. These scenarios provide a framework to analyze how these services react in those circumstances. Most importantly they show the type of management required to react to in those scenarios, i.e. whether the system can self-manage or needs help from external sources, such as administrative staff. Lessons learnt from studying those services and their behaviour helped guide the design of YACS, and as such provide realistic input for further work on Niche.

3.2.1.1 Self-management scenarios The following scenarios are chosen because they are relevant for the functionality of execution services and are relevant in environment in which YACS has to function, i.e. an environment of network-connected computers, characterized by dynamic churn of resources, and in many cases without the benefit of a stable infrastructure to run community services. This setting is most appropriate for systems such as community grids, like those that Grid4All is targeting. This also does apply to grids such as enterprise grids, even despite their more stable and structured environment of reliable nodes and other infrastructure, as they are increasing in size and distribution which makes them harder to administer, increases likelihood of failures and of centralized, grid-wide services becoming bottlenecks (Foster, Iamnitchi, 2003). The choice of scenarios reflects that the main functionality aimed for in YACS is job execution, hence the job scenarios. Furthermore, to be able to perform jobs a system software and service infrastructure must be in place, hence system scenarios. 3.2.1.1.3 Job resource departure At the micro level, how is the job processing affected when job resources depart? Most importantly, how is the job or a sub-task of it restored? At the macro level, how is the entire system affected? Do measures have to be taken to optimize the total, system-wide pool of resources? This scenario affects self-configuration, self-healing, self-optimization. 3.2.1.1.2 Job resource – high availability At the macro level, how does the system react when there is high availability of resources? Does it scale back on resource reservations to ease demands on underlying hardware and to reduce amount of system management that needs to be performed? This scenario affects self-configuration, self-optimization. 3.2.1.1.3 Job resource – low availability Similar to the high availability scenario, how does the system react when there is resource scarcity? Does it contact the underlying resource layer in search of further resource reservations. Does it attempt to control the demand from processing entities? For example, by giving preference to certain types of jobs?

18 / 89

This scenario affects self-configuration, self-optimization. 3.2.1.1.4 System resource departure What happens to the system if resources which have been running system services crash, or depart voluntarily? Does the system re-deploy that service autonomously and independently? Does it require manually set up rules to fix? Does it need direct and manual administration by administration staff to restore functionality? This scenario affects self-healing. 3.2.1.1.5 System resource high load Can the system adapt system wide services to high load? For instance, can it replicate these services to distribute the load? If so, what are the consistency demands between those replications? This scenario affects self-configuration, self-optimization. 3.2.1.1.6 System resource low load Does the system perform any changes if activity within it is low? If there are system service replications, its the number of those reduced to lower management costs? This scenario affects self-configuration, self-optimization. 3.2.1.1.7 Malevolent resources Does job processing functionality detect and react if submitted jobs are illegal? Does the system detect and react if unauthorized nodes are running in the system? Does the system detect if job processing groups are taking up unfair quantities of resources? This scenario affects self-protection. 3.2.1.1.8 Scenario summary Self-management applicability Management Scenario scope Configuration Healing Optimization Protection Job Resource Departure Job level x System level x x Job Resource - High availability System level x x Job Resource - Low availability System level x x System resource Departure System level x System resource high load System level x x System resource low load System level x x Malevolent resource Job level x System level x Table 1: Scenarios and self-management capabilities

3.2.1.2 Globus Toolkit 3.2.1.2.1 Overview The Globus Toolkit made by the Globus Alliance is probably the most influential and widely used grid computing toolkit currently (Globus Alliance, 2009a). It offers a set of components that can be used to enable standardized access to computer resources of varying sizes, types and capabilities. The latest stable version is 4.2.1.

19 / 89

3.2.1.2.2 Main features and capabilities The Globus toolkit is built as service oriented architecture based mostly on web services. It offers a set of infrastructure services that are commonly used by applications, such as information services, data services, and security services. It provides many of the capabilities presented in the OGSA specification (Globus Alliance, 2009b). It also offers a number of containers; in Java, C, and Python; for hosting client built services. Services which in turn can use the built-in infrastructure services. The built-in infrastructure services that Globus offers mainly fall into four categories, execution management, data management, monitoring, discovery and security. These services can be used through command line utilities and through programmatic access. The most directly relevant category for YACS is the execution management that Globus offers. It includes an execution service which can handle the entire lifecycle of jobs that clients submit to it, i.e. initiation, data stages, scheduling, execution, monitoring, management and termination. Job submissions, through either command line or client API, are by default done by supplying a job description made in a job description language (JDL). The service which provides this interface is called Grid Resource and Allocation Management (GRAM). This service has a built-in scheduler called fork scheduler but more commonly it delegates to external schedulers for actual job processing, and as such supports and provides a unified interface to those schedulers. Supported schedulers are Condor, PBS, Torque and LSF (Foster, 2005). The job description language (JDL) is used to specify the required details to perform the job. These include the name of the executable, arguments, file stages, requirements on processing resources and many more. The JDL can also be used to convey so called multi-jobs which are a collection of jobs. Jobs within the collection can be either independent or require some synchronization and/or rendezvous mechanism (Foster, 2005). Furthermore, message-passinginterface (MPI) jobs are supported (Globus Alliance, 2009c). Globus supports the OGSA-BES specification and correspondingly the job submission definition language (JSDL) (Martin, Feller, 2007). Data management Both external applications and services and built-in services such as the execution management require handling of data and the Globus toolkit therefore offers services for data management; GridFTP offers high-performance data transfers; reliable file transfer service for reliable movement of data, including retries upon failures; a replica location service (RLS) for locating replicated files; OGSA-DAI for data access and integration of structured and semistructured data (Foster, 2005). Information management Globus offers monitoring and discovery (MDS) services for maintaining information about which resources are present in a particular setting and the state of those resources, as well as enabling discovery upon need. Globus services include three aggregation mechanisms for providing access to this information, MDS-Indexes, MDS-Archives and MDS-Triggers. The collection of MDS-index information can be collected into a grid-wide index. MDS-archive offers persistent storage of state data for historical queries. MDS-trigger services enables

20 / 89

users to define their own information needs and have this information pushed to them once it crosses a given threshold. Security Globus offers various services for authentication and authorization. Distinct possibilities are possible through the latest web-service implementations and prior implementations, with the WS implementation offering more possibilities. Fundamentally, they still provide similar options of authentication by X.509 certificates, secure transmission of data and access control lists, with possibilities down to individual services. 3.2.1.2.3 Architectural and functional highlights

Figure 1: Schematic view of GTA4.0 components (Foster, 2005, p.20)

The various Globus components are hosted in dedicated servers. These servers also host user built services within special containers. Most of the infrastructure services have by now been implemented with web service interfaces but some remain, most notably GridFTP and RLS.

21 / 89

Figure 2: GRAM4 components (Globus Alliance, 2009d)

Of particular interest for this project are the components of GRAM. The GRAM service receives a job submission, delegates the processing to its own local scheduler or through scheduler adapters to external schedulers. Job state is maintained in the scheduler event generator (SEG). For data management it uses the reliable file transfer service. 3.2.1.2.4 Scenarios 3.2.1.2.4.1 Job resource departure Globus GRAM supports running the job locally through a fork like scheduler or forwarding the job to an external scheduler or resource management system such as Condor or PBS. In the latter case it is the responsibility of the corresponding system to react to resource failures. For forked jobs a Fork Starter process is started for each job which starts tasks, monitors them and returns exit codes to the Scheduler Event Generator and Job State Monitor (Globus Alliance, 2009e). No information has been found regarding options for automatic restart of jobs which have failed due to system failures and not due to logic failures within the job itself. No information has been found regarding behaviour on loss of contact to external job schedulers. It appears that GRAM leaves it up to the client to decide what action to take regarding job failure and makes no distinction whether the failure is internal to the job or due to failures within GRAM itself. Information can be found by examining job states, exit codes and error logs. 3.2.1.2.4.2 Job resource – high availability

22 / 89

As mentioned before GRAM usually forwards job scheduling to external job schedulers and resource management systems, where the responsibility for handling this situation lies therefore. The author has not found any information regarding whether GRAM is able to monitor the load of schedulers it is connected to, which could be used to decide which one to use. Handling of locally forked jobs is subject to the capabilities of the hardware running GRAM. This is presumably subject to host environment administrative monitoring policies. 3.2.1.2.4.3 Job resource – low availability See discussion above regarding high availability of job resources. 3.2.1.2.4.4 System resource departure No information has been found regarding behaviour under internal component failures, such as when a Job State Monitor, Scheduler Event Generator or a Fork Starter fails. Job state is stored in a persistent manner on disk to ensure no loss of critical data due to failure of the service (Globus Alliance, 2009f). For data management GRAM uses Reliable-File-Transfer (RFT) for reliable data management, reliability which is achieved by checkpointing of transfers and by retrying upon failures (Liming, 2008). No information has been found regarding automatic restart of GRAM service itself upon complete failure but it can be presumed to be handled by host environment administration tools, pending local administrative policy. 3.2.1.2.4.5 System resource high load No specific information found. Presumably falls under host environment administrative monitoring policies. 3.2.1.2.4.6 System resource low load No specific information found. Presumably falls under host environment administrative monitoring policies. 3.2.1.2.4.7 Malevolent resources Globus uses certificates (Foster, 2005) to authenticate and authorize users and jobs for execution and data access. Furthermore it tries to run jobs in as restricted environment as possible. No information has been found regarding self-protection schemes. 3.2.1.2.5 Globus conclusion No apparent self-management.

23 / 89

3.2.1.3 Unicore 3.2.1.3.1 Overview UNICORE, which stands for UNiform Interface to COmpution REsources, is a grid middleware solution created by the UNICORE Forum for providing and gaining access to heterogeneous computing and data resources. The latest version is 6. 3.2.1.3.2 Main features and capabilities Unicore offers a web service interface to services such job submission and job management and data management and transfers. Unicore has also developed and made available a number of client interfaces; an Eclipse client; an graphical application client, command line client and programmable APIs. These client interfaces enable creation of both jobs and more complex workflows, with conditional statements, analysis of exit codes etc (Schuller, 2008). These workflows are made possible both by pre-processing and parsing of flows on the client side but also by a built-in BPEL workflow engine (Schuller, 2007). For information management each computing site has a registry of information. Furthermore, cross-site registries can be set up to enable data sharing and facilitate virtual organizations (Schuller, 2008). Among the options Unicore offers for security is mutual authentication, as well as policies and access control lists set up through a user database that can be shared among different Unicore sites (UNICORE Forum, 2009a). 3.2.1.3.3 Architectural and functional highlights A typical Unicore setup is called a Usite which is an administrative domain of resources which provide Unicore services. Users access this Usite through a single gateway. Within each Usite there are at least one Vsite which is an interface to set of computing resources, such as a cluster. Access to the Vsite is controlled through a network job scheduler (XNJS) which offers job management services. The XNJS itself is accessed through a set of web service interfaces called atomic services. Along with a site registry these services correspond to the capabilities that OGSA defines for grids. The OGSA-BES specification is supported. The computing resources themselves are interfaced through a target system interface, which hides details of the underlying job and resource management system, e.g. Torgue. Alternatively, i.e. when then there is no resource management system, the job can be forked locally (UNICORE Forum, 2009b).

24 / 89

Figure 3: UNICORE 6 architecture (Schuller, 2007b)

Figure 3 shows the setup of an Usite and a path of a request through the gateway, through a web service interface and security infrastructure down to the XNJS execution management and ultimately the target site through its target site interface. 3.2.1.3.4 Scenarios 3.2.1.3.4.1 Job resource departure Unicore supports both simple jobs in which an error status is reflected and error information can be retrieved and acted upon, as well as more complex workflows where predefined behaviours can be set up. This functionality depends on work done by the user and no information has been found regarding autonomous healing behaviour in this circumstance. However, the actual processing is performed at the target site which is most often managed by an external management systems such as Torque. 3.2.1.3.4.2 Job resource – high availability As mentioned before Unicore usually forwards jobs to external job and resource managers, where the responsibility for handling this situation lies. The author has not found any information regarding whether Unicore is able to monitor the load of schedulers it is connected to.

25 / 89

Handling of locally forked jobs is subject to the capabilities of the local hardware resources. This is subject to host environment administrative monitoring policies. Worth noting is that Unicore supports Java management extensions (JMX) on Vsites, which could offer information in this situation. 3.2.1.3.4.3 Job resource – low availability Arguments made before in discussion of high availability apply here as well. 3.2.1.3.4.4 System resource departure Unicore components can be set up to persistently store information for use when resource is restored. Most importantly the XNJS can be set persistently store job state so that jobs are not lost in case if its failure (Berghe, 2006). Regarding automatic restoration JMX offers some possibilities but this is an area in which work remains: “A UNICORE Grid is composed of many services deployed on various machines. To monitor service health and manage the infrastructure, new tools and approaches will be needed.” (UNICORE Forum, 2009c) This situation could fall under host site administrative policies and management, and health of machines monitored and possibly restarted through administrative tools and procedures. 3.2.1.3.4.5 System resource high load Except for JMX support no specific information has been found. 3.2.1.3.4.6 System resource low load Except for JMX support no specific information has been found. 3.2.1.3.4.7 Malevolent resources Unicore uses certificates (UNICORE Forum, 2009b) to authenticate and authorize users and jobs for execution and data access. No explicit information has been found regarding selfprotection schemes. 3.2.1.3.5 Unicore conclusion No specific information found regarding self-management capabilities.

26 / 89

3.2.1.4 gLite 3.2.1.2.1 Overview The gLite resource sharing middleware was produced as part of the Enabling Grids for EsciencE (EGEE) project. It is composed by using and adapting parts from other infrastructures, like Globus and Condor (Burke et al., 2008). The current version is 3.1. 3.2.1.2.2 Main features and capabilities The gLite middleware offers an extensive set of features and capabilities that span workload, data and information management, as well as security. For clients it offers both graphical and command line user interfaces for resource information, job submission, management and monitoring, and file management. For programmable access services are accessible through API packages and web service interfaces, including OGSABES (OMII-EU, 2009). For workload management gLite supports submissions of jobs through its own job description language (JDL), as well as OGSA-JSDL. Various types of jobs are supported, single, sets, DAG, parametric, MPI and, notably, interactive jobs. Real-time output of data is also supported. The workload management has an accounting system that track user resource usage which can be used for purposes such as limited access to resources, i.e. quotas, or charging for access (EGEE, 2008). Notable security features include Grid Security Infrastructure (GIS), including X.509 certificates and TLS/SSL encryption. For execution safety user credentials are kept as limited as possible (EGEE, 2005). 3.2.1.2.3 Architectural and functional highlights The gLite architecture’s main parts are workload, data and information management sections, all of which rely on security features. An overview is represented in figure 4.

27 / 89

Figure 4: gLite architecture overview

A computing element (CE) represents a set of computing resources. The worker nodes in the set are managed by a Local Resource Management System (LRMS). The LRMS is accessed through a generic interface called Grid Gate (GE) (Burke et al., 2008). A storage element (SE) provides uniform access to a wide range of different data storage resources (Burke et al., 2008).

28 / 89

Information services are provided for resource discovery, resource state and other information needs. Two components are used for this purpose, the MDS system from Globus and. Relational Grid Monitoring Architecture (R-GMA). Both can be logically placed at a higher level to collect information from multiple sites within a virtual organization. Workload management Workload management is provided by a Workload Management System which is hosted by a Resource Broker (RB) machine.

Figure 5: Workload manager internal architecture (EGEE, 2005, p.47)

Much like Condor the workload manager keeps a queue of submitted jobs, or tasks, and performs matchmaking to find suitable computing elements. For this purpose the manager keeps what is called an “Information Super Market” (EGEE, 2008) with information about the computing elements. The computing element is then accessed through the GridGate interface as discussed above. 3.2.1.2.4 Scenarios 3.2.1.2.4.1 Job resource departure The client can configure the number of retries in light of grid component failure (EGEE, 2008). The system can handle complete loss of contact with a CE by switching to another (EGEE, 2008). Handling of individual worker departures, i.e. within CEs, is ultimately the responsibility of the local resource management system. 3.2.1.2.4.2 Job resource – high availability No information has been found regarding autonomic behaviour of the workload manager. Presented with jobs in the task queue it will go through the process of selecting a set of computing elements (CE) which fulfil job requirements. From this set the most appropriate CE is chosen by taking into account how many jobs are currently running and queued at each. (Burke et al., 2008). On the other hand, implementations of the CE can, when they have

29 / 89

available resources, proactively pull jobs instead of waiting for the workload manager to push jobs to it (EGEE, 2005). 3.2.1.2.4.3 Job resource – low availability No conclusive information found. 3.2.1.2.4.4 System resource departure The user interface can be aware of several WMS and made to automatically try different ones until successful submission has been achieved (Burke et al., 2008). No conclusive information found about self-healing behaviour when WMS fails. 3.2.1.2.4.5 System resource high load Logging and bookkeeping server can be made to run on a separate machine from the WMS to reduce load on that machine (Burke et al., 2008). No conclusive information found regarding self-optimization in this situation. 3.2.1.2.4.6 System resource low load No conclusive information found regarding self-optimization in this situation. 3.2.1.2.4.7 Malevolent resources The system has built in account and security features such as certificates. 3.2.1.2.5 gLite conclusion Some self-healing mechanisms found regarding job processing. No conclusive information found regarding the workload management system itself.

30 / 89

3.2.1.5 XtremWeb-CH 3.2.1.5.1 Overview XtremWeb-CH (XWCH) is an infrastructure for what the authors, a team from University of Applied Sciences in Geneva, Switzerland, call “an effective Peer-to-Peer System for CPU time consuming applications” (XtremWeb-CH, 2009). It is an extension of the XtremWeb (XW) which was originally a research project done at Université de Paris Sud in Orsay, France. 3.2.1.5.2 Main features and capabilities XWCH offers clients the possibilities of submitting jobs for processing by a collection of workers, which are managed by a centralized coordination service. The jobs can range from simple bags of tasks to more complex workflows. The client can monitor and manage the job during execution. XWCH is able to process a workflow request detailing the order of execution and dependent data movement for subsequent execution steps. Implementing this kind of execution was before the responsibility of the client (Abdennadher, Boesch, 2005). XWCH is also able to perform some limited optimization of task allocation which takes into account the workload of requesting workers (Abdennadher, Boesch, 2007). For security parties can be authenticated by certificates and communication encrypted using standard SSL/TLS techniques (Cappello et al., 2005). Some work seems to have been performed on sandboxing techniques to enforce worker safety (Cappello et al., 2005) but the status of this work is unclear. 3.2.1.5.3 Architectural and functional highlights A XWCH setup has a central coordination point which keeps track of tasks, workers and schedules the processing of jobs. Clients interact with this coordination service to submit executables, required data and job requests, for job monitoring and management and result download. Job requests are done through a request specification language (RSL). Workers are responsible for processing the job request and do so by pulling necessary executables and data from the coordination service. Periodically they send the coordination service a “heartbeat” message, which enables the coordination service to conduct job management. Finally, workers upload results to a result collector which is hosted by the coordination service. Workers are able to cooperate amongst themselves without the coordination service having to be involved. This increases the robustness of the service by reducing requirements on the centralized nature of the coordination. Computing resources can be set up as data warehouses which host data files required or produced. These warehouses also play a role when workers which need to cooperate cannot communicate directly.

31 / 89

Figure 6: XtremWeb-CH architecture (XtremWeb-CH, 2009)

Figure 6 gives an overview of the overall system. The original author speculated about deploying the coordination service in a higher degree of distribution, such as on Chord or Pastry (Cappello et al., 2005) but information about further work in that direction has not been found. Figure 7 shows a more detailed view of the internal architecture.

Figure 7: Architecture of the server (Fedak et al., 2000)

32 / 89

The XWCH implementation has not changed the centralized coordination architecture but has improved the system by reducing its role in the job processing by enabling workers to communicate directly. 3.2.1.5.4 Scenarios 3.2.1.5.4.1 Job resource departure The coordination service detects this condition through the heartbeat mechanism and automatically assigns tasks the worker was responsible for to other resources (Abdennadher, Boesch, 2007). In addition, each worker persistently saves state to enable resumption of task processing once that worker has been restarted (Cappello et al., 2005). These seem to be conflicting approaches which might lead to multiple runs of the same task. No information further information about this case has been found. 3.2.1.5.4.2 Job resource – high availability XWCH does take worker load into account when assigning individual tasks (Abdennadher, Boesch, 2007). For overall management a centralized dispatcher selects tasks which are distributed to a collection of schedulers. The task selection policy is configurable in runtime through Java reflection API (Fedak et al., 2000). No further information has been found regarding system initiated optimization techniques, i.e. self-optimization. 3.2.1.5.4.3 Job resource – low availability See above discussion above regarding high availability. 3.2.1.5.4.4 System resource departure Effects of coordination service failures have been reduced by the worker ability to communicate directly with other workers instead of having to route such communication through the centralized service. Furthermore, workers resend result files until notification has been received from result file server (Cappello et al., 2005). For internal coordination behaviour the dispatcher server is responsible for selecting bags of tasks to perform and distributes them to a cluster of scheduling servers. The dispatcher notices if scheduler instances fail and is able to redirect workers which were connected to that scheduler. No information has been found for behaviour when other system components fail, such as the main dispatcher server. Aside from adapting to failures of individual schedulers no information regarding self-healing has been found. 3.2.1.5.4.5 System resource high load Dispatcher server and coordination server policies are configurable in runtime which could potentially be used to tackle cases of high load. However, no information has been found regarding self-management capability in this respect and presumably this falls under host environment administration policies and tools.

33 / 89

3.2.1.5.4.6 System resource low load See above discussion of system resource high load. 3.2.1.5.4.7 Malevolent resources Traditional techniques of certificate authentication are implemented. It is unclear whether sandboxing techniques are implemented (Cappello et al., 2005). 3.2.1.5.5 XtremWeb conclusion XtremWeb provides some interesting ideas for consideration such as redeployments in light of worker failures and attempts and scheduling in light of worker load and random departures. Some capabilities for healing are also present in circumstances of coordination service component failures, i.e. scheduler failure, but it is unclear what happens when more critical components fail, such as the single dispatcher or result server.

34 / 89

3.2.1.6 BOINC 3.2.1.6.1 Overview Berkeley Open Infrastructure for Networked Computing (BOINC) is an infrastructure to build applications that employ the resources of volunteers in a scheme called “volunteer computing” (Anderson, Fedak, 2006). This infrastructure is in use in a number of projects, most notably SETI@home. In the SETI@home setting numbers from 2006 indicate just over 330 thousand contributing hosts (Anderson, Fedak, 2006). The latest version is 6.4.5. 3.2.1.6.2 Main features and capabilities The BOINC infrastructure enables placement of workunits on centralized servers to which volunteer workers connect and download these workunits. Each worker then independently processes the workunit and returns a result. The infrastructure has a built in accounting system that tracks contributions of volunteers. Experience from the SETI@home projects shows this to be a source of motivation for the volunteers (Anderson, 2004). 3.2.1.6.3 Architectural and functional highlights A BOINC project has a centralized science database that contains the base of work to be done. This is replicated to data servers and scheduling servers which handle communication with workers. Workers pull workunits and push workunit results.

Figure 8: BOINC server (Anderson, Korpela, Walton, 2005)

3.2.1.6.4 Scenarios 3.2.1.6.4.1 Job resource departure BOINC does not explicitly handle individual resource failures but is set up to collect a configurable number of workunit results, in what is called “redundant computing” (Anderson, 2004). The infrastructure distributes workunits until this number of results has been achieved or a timeout has been reached. 3.2.1.6.4.2 Job resource – high availability

35 / 89

Not directly applicable. BOINC distribution is based on a backlog of workunits waiting to be processed. BOINC is a generic infrastructure and presumably the break-up of work into workunits and subsequent scheduling is dependent on the application in question. The hardware characteristics of hosts are uploaded to the centralized database (Anderson, Fedak, 2006). This is used to select suitable workunits. This information can presumably be used to optimized workunit production on the server side. 3.2.1.6.4.3 Job resource – low availability See above discussion. 3.2.1.6.4.4 System resource departure Data and scheduling servers can be replicated to handle client requests. Furthermore the system uses “exponential backoff” (Anderson, 2004) to prevent overloading servers. This enables a “relatively modest server complex” for supporting large client bases (Anderson, 2004). No other information has been found regarding automatic behaviour in these circumstances. 3.2.1.6.4.5 System resource high load See above discussion on exponential backoff. 3.2.1.6.4.6 System resource low load No specific information found. 3.2.1.6.4.7 Malevolent resources Redundant computation of workunits enables the system to compare results and discard faulty ones. Files are signed with certificates to prevent illegitimate injection of data. 3.2.1.6.5 BOINC conclusion BOINC is designed to operate in environments which is characterized by high churn and heterogeneity of resources. Churn tolerance is evident in redundant computing techniques, exponential backoff to reduce load on centralized services. These techniques can be of use for design of YACS. An aspect of BOINC that is ill suited for YACS design is the considerable centralized infrastructure needed, both in services and hardware.

36 / 89

3.2.1.7 P3 3.2.1.7.1 Overview Personal Power Plant (P3) (Shudo, Tanaka, Sekiguchi, 2005) is a peer-to-peer (P2P) application that enables users to share computing power. It was built at the Grid Technology Research Center (GTRC) in Ibaraki, Japan. However, the project appears inactive since the year 2005 and is not currently listed at GTRC’s website (GTRC, 2009). 3.2.1.7.2 Main features and capabilities P3 enables participants to share computing power and cooperate on equal ground, in traditional P2P style, on performing computations. Every participant provides computing power to the system and every participant is able to submit jobs that utilize this computing power The application is built on top of a P2P overlay called JXTA. The JXTA project was begun by Sun Microsystems and remains active (JXTA, 2009). JXTA supports communication through firewalls and NATs, formation of peer groups and discovery of resources without requiring a centralized service. P3 participants join this overlay and form job groups which peers discover and join to contribute their resources. P3 supports two main types of parallel application patterns, a master-worker pattern and a message-passing-interface pattern (MPI). Security features are enabled by digital signatures. In the master-worker pattern, faulty and potentially dangerous results are discarded through redundant processing and voting. Currently there is not an execution sandbox support for host protection. 3.2.1.7.3 Architectural and functional highlights Through a controller element a peer creates a job group for the job it wants performed and subsequently advertises the existence of this group in the underlying overlay. Peers which receive this advertisement can choose whether to join the group and contribute resources to performing the task. The controller monitors the job progress and provides information to the client. For message passing applications the controller element issues each participating host an identification number. Each host knows the identity of other hosts. Once this phase is complete the processing can start. For master worker applications the controller designates one of the hosts to be the master. The master subsequently directs the processing of the job.

37 / 89

Figure 9: Organization of job management software and related peer groups (Shudo, Tanaka, Sekiguchi, 2005)

3.2.1.7.4 Scenarios 3.2.1.7.4.1 Job resource departure In the master-worker scheme the system detects this circumstance through time-outs and redeploys tasks to other peers. No information has been found regarding behaviour in the message passing scheme. 3.2.1.7.4.2 Job resource – high availability In the current version an available host is selected at random. P3 authors suggested less naive selections for future work (Shudo, Tanaka, Sekiguchi, 2005). 3.2.1.7.4.3 Job resource – low availability No specific information found. 3.2.1.7.4.4 System resource departure Not applicable. No centralized components are used by P3, e.g. for resource discovery. 3.2.1.7.4.5 System resource high load Not applicable. No centralized components are used by P3. 3.2.1.7.4.6 System resource low load Not applicable. No centralized components are used by P3. 3.2.1.7.4.7 Malevolent resources Files are digitally signed which prevents external entities from injecting malevolent files (Shudo, Tanaka, Sekiguchi, 2005).

38 / 89

The master-worker scheme supports some self-protection in that tasks are redundantly processed and the results compared. This improves chances of faulty and malevolent results being correctly rejected. 3.2.1.7.5 P3 conclusion The architecture approach taken by P3 is interesting due to the scalability inherent in the absence of centralized infrastructure and the self-organizing and healing abilities of P2P overlays. At the application level the system also transparently handles resource departures in the master-worker scheme, demonstrating self-healing abilities. This is of considerable benefit for the developer since P2P communities are characterized by frequent node churn. The downside to the approach taken by P3 is a lack of system-wide information aggregation and management based upon that information, such as for improving overall resource utilization in the system.

39 / 89

3.2.1.8 Condor 3.2.1.8.1 Overview Condor is a workload management system maintained by the University of WisconsinMadison. It is widely used throughout the world, in various settings, from industry to academia and with sites of over one thousand machines (Condor, 2009a). The current version is 7.2.0. 3.2.1.8.2 Main features and capabilities The main idea of Condor is matchmaking of job requests and resource offerings. A centralized matchmaker resides in a pool of resources and learns of individual resource capabilities and state through a advertisement system called ClassAds. Job management entities, called Agents, which want to have jobs executed also advertise their requirements to the matchmaker. Using this information the matchmaker tries, in an opportunistic manner, to match and schedule requests with appropriate resources. The resources are able to strictly control what kind of jobs they are willing to perform and from which parties. Similarly the agents can specify detailed requirements regarding the type of resource needed, such as operating system and memory requirements. Condor offers nine execution environments, called universes, which have different options. These universes are called Standard, Vanilla, MPI, Grid, Java, Scheduler, Local, Parallel and VM. It supports sequential jobs and parallel jobs, and jobs meant for remote resource systems (Condor, 2009b). Of the universes the Standard universe has particularly notable features such as built-in checkpointing, remote procedure calls which connect transparently to the submitting machine and the most advanced sandboxing capabilities. For cross-site sharing of resources Condor supports a number of options (Thain, Tannenbaum, Livny, 2005). In a scheme called flocking agents can contact matchmakers in other pools for execution of jobs among its resources. Another option is Condor-G in which an agent has been enabled to communicate with the GRAM of Globus sites, instead of using a Condor matchmaker. A further option is called “gliding-in” (Thain, Tannenbaum, Livny, 2005) in which the Condor services, such as matchmaking and execution, are executed as job in the remote system. The agent can then communicate with this job, i.e. the executable content of the job, as if it were a regular Condor pool. Notable security features include both Kerberos and Grid Security Infrastructure (GSI) authentication and authorization, secure transfer of data, sandboxing possibilities and limited user accounts for running jobs. 3.2.1.8.3 Architectural and functional highlights A Condor site, or pool, is composed of a set of resources and a central manager, which keeps track of the state of the system, through period updates sent by every resource. The central manager also has the role of a matchmaker that matches job requests to available and appropriate resources. The resource machine, shown as “Resource” in figure 10, runs a processes that advertises its state to the central manager. Upon being matched with a job another process is launched that takes care of running the job, including sandboxing if possible, and necessary communication with the submitting machine.

40 / 89

The submitting machine, shown as “Agent” in figure 10, runs a scheduler that spawns a shadow process for each job. This shadow process conducts the necessary communication and coordination with the sandbox running on the resource machine. Additionally, metaschedulers can run on the submitting machine to process more complex jobs such as directed acyclic graphs (DAG). Worth noting is that any resource in a Condor site can be an agent and act as a scheduler for a given job. This improves robustness. Optionally a special server can be set up to store checkpointing files. Otherwise checkpointing files are stored on the submitting machine.

Figure 10: The Condor kernel (Thain, Tannenbaum, Livny, 2003)

3.2.1.8.4 Scenarios 3.2.1.8.4.1 Job resource departure Condor agents have built in support for transparently handling departures by finding a replacement resource. Checkpointing is available in some jobs and universes and reduces this cost of this situation. 3.2.1.8.4.2 Job resource – high availability The matchmakers responsibility is to inform of matches between request requirements and resource offerings. The matchmaker scheduling is according on a user priority scheme (Tannenbaum et al., 2001) and usage quotas (Condor, 2009b). The majority of scheduling and planning falls on the requestor and resource. It is up to the requestor to plan the scheduling of its jobs given resource opportunities, i.e. matches. Similarly the resource can plan how to offer its capabilities, and can refuse matches. This behaviour is highly configurable dependent on the implementation in question (Thain, Tannenbaum, Livny, 2005). 3.2.1.8.4.3 Job resource – low availability See discussion on job resource high availability. 3.2.1.8.4.4 System resource departure

41 / 89

All Condor machines have a dedicated process which monitors the health of other Condor processes on the machine and restarts them if needed and informs a system administrator. Furthermore, due to the critical and centralized nature of the central manager, i.e. the matchmaker, the pool can be set up with several backup managers. On each manager machine a specialized process, Condor_had, is running and cooperating with other Condor_hads. Upon current central manager failure these processes fully activate a backup manager. The same applies for network partitions, in each partition a manager will be activated (Condor, 2009b). 3.2.1.8.4.5 System resource high load No conclusive information found regarding behaviour changes in this situation. Presumably local administration policy specifies monitoring mechanisms on actual hardware. 3.2.1.8.4.6 System resource low load No conclusive information found regarding behaviour changes in this situation. Presumably local administration policy specifies monitoring mechanisms on actual hardware. 3.2.1.8.4.7 Malevolent resources Condor has numerous security options as introduced in 3.2.1.8.2. A quota system can be used to prevent hoarding of resources. No specific information has been found regarding adaptive behaviour or runtime analysis. 3.2.1.8.5 Condor conclusion Condor is a mature and extensive system with several possibilities for execution. Of particular interest for YACS design is the distribution of job scheduling to potentially multiple agents. The ClassAd mechanism and opportunistic matching of resources is also interesting since it implies only best-effort attempts at optimal resource usage. The most notable example of selfmanagement behaviour is potential for self-healing in case of unexpected resource departure. A design feature of Condor that does not fit YACS is the centralized matchmaker. This service can be made robust with replicas but this requires considerable administrative attention.

42 / 89

3.2.1.9 Survey conclusion The preceding part of this survey has covered a number of important aspects regarding execution services and related technologies. It is by no means an exhaustive discussion as this task would take long periods of time and hundreds or thousands of pages, for any of those technologies. Worth noting is that in many cases little or no concrete information has been found regarding behaviour under the specified scenarios. However, examination of systems’ architecture, services, roles and behaviours has still produced much information that helped to make reasonable and educated decisions regarding functionality requirements, design considerations, self-management guidelines and other tasks that will be performed during the rest of YACS design and implementation. The following text will summarize the functional, architectural and management issues uncovered in the survey. Finally, a reflection will be presented on what are considered the most important lessons learnt. 3.2.1.9.1 Feature and functionality summary The basic functionality of an execution service is allowing a job to be submitted, uploading required resources, executing, monitoring and managing the job and providing results. The technologies surveyed can all be said to implement this behaviour. BOINC is a deviation from this pattern as it involves processing a sets of pre-defined workunits, i.e. doesn’t allow “general” submission of jobs. However, it implements the basic functionality and this set could be considered as a large batch job awaiting processing. The systems allow access by through command line clients, application programming interfaces and in some instances, graphical user interfaces. This access basically entails submission of jobs, monitoring functions, management functions and collection of results. The grid infrastructures, Globus, Unicore and gLite benefit from a web service and resource interface. The services accept a variety of job types, single and multi jobs. Many have built in support for parallel applications with MPI interfaces and Master-Worker patterns. Conditional workflows are also commonly supported. Condor is the most mature and specialized execution system of those surveyed and notably provides several universes, including one with built-in checkpointing of jobs. The grid execution management systems are often deployed using Condor, or other similar systems. Notably, Unicore has a built in BPEL engine. Globus Unicore gLite XWCH P3 Condor

single, multi-jobs, serial, parallel, MPI single, multi-jobs, serial, DAGs, BPEL single, multi-jobs, serial, parallel, MPI, DAGs, Interactive jobs single, multi-jobs, serial, parallel, MPI, DAGs single, multi-jobs, serial, parallel, MPI, Master-Worker single, multi-jobs, serial, parallel, MPI, Master-Worker, DAGs, Universes, Checkpointing Table 2: Listing of job capabilities

The most common security feature is authentication and signing using certificates. The grid middleware and Condor support detailed access lists. The services generally try to limit jobs by running them in as restrictive mode as possible. 3.2.1.9.2 Architectural summary

43 / 89

Fundamentally all of the technologies surveyed are distributed systems built to enable utilization of distributed resources. However, the scope of this resource utilization is different between systems. The grid middleware, Globus toolkit, gLite and Unicore, are for example built to provide unified interfaces to heterogeneous resource and provide more than just execution services, e.g. distributed storage, and do in fact often rely on underlying execution or resource management systems such as Condor. Other systems, such as BOINC are more exclusively designed for execution using distributed computational resources. A common approach in these systems is reliance on well-defined, purposely deployed and pre-configured centralized system services to manage the utilization, with P3 being the only exception. In terms of execution management, Globus has the GRAM, gLite has the WMS, Unicore has the Gateway and XNJSs, XtremWeb has the coordination service, BOINC has a centralized server complex and Condor the central manager, i.e. the matchmaker. Yet, even despite this “centralized” nature, clients can in some cases interact with multiple sites/instances of these systems, and system instances themselves can cooperate and provide higher-level, cross-site services, for example in the notion of virtual organizations. An alternative architecture is the one employed by P3. This architecture is P2P-like and without specialized system-wide services. With the P2P approach comes self-organization and healing abilities on the communication layer as well as more equality in distribution of responsibilities. 3.2.1.9.3 Self-management summary Regarding self-management of jobs. Many of the systems support automatic and transparent relocation of job tasks if currently processing resource is unexpectedly lost, this includes Condor, XtremWeb and BOINC. Other simply report to the user and leave the decision making to the user. Regarding system self-management. Condor comes closest to self-management of system components. Each node has a “watchdog” process that monitors the health of system processes and restarts them upon failure. It also supports automatic replication of the centralized matchmaker, but this is subject to considerable attention and setup by administration staff. Overall, evidence of self-management behaviour is rather scarce in these systems. Especially when it comes to system components, for healing, optimization and protection. Management of system components seems generally to be reliant on attention by administrative staff and policies. There is some evidence of automatic healing upon process-resource departure. 3.2.1.9.4 Reflection As described in the functionality summary, the basic requirements are the ability to submit a job, upload necessary files, process the job while offering management and monitoring, and finally to retrieve results. This functionality can be greatly extended, e.g. with more complex jobs such as bags of tasks, bags of cross-dependent bags, workflows and execution environment options. A sensible approach would have been to approach this in a step-wise fashion, as is described in the project description, i.e. first individual jobs, then job collections and so on. Another idea was a hybrid approach by using and/or integrating existing tools for this purpose, such as parts of Condor.

44 / 89

The architecture trend of dedicated and centralized service seems to require considerable administration and specialized setup and hardware, per site as well as for cross-site cooperation. This is quite understandable from a number of perspectives; one is that organizations want to have strict control over their site’s resources, another is more optimal resource utilization and a third is more deterministic behaviour, which can help enable service level agreements and quality of service guarantees. However, questions have been raised about the manageability of this architecture as system size and distribution increases (Foster, Iamnitchi, 2003). In the setting that Grid4All is aiming for, i.e. community-grids, preconfigured infrastructure is not desirable (Haridi, Vlassov, Ghodsi, Arad, 2009). These communities are, by nature, not in a position to host dedicated and centralized service points nor might they have the administrative resources to do so anyway. In this setting selfmanagement is an important goal. As described, P3 is an alternative design that comes without dedicated and centralized services and is self-managing to a certain degree. However, decentralization also has some negative properties. The lack of system-wide knowledge means that system-wide selfmanagement goals like resource utilization can hardly be optimized, for example, resources in one section of the system could be under-utilized while another section is critically overburdened. Elements and ideas from both architectures were considered during the design of YACS. Ideas included distributing job management, i.e. pools of nodes hosting job-management components each of which controls the processing of an individual job. Other ideas were system-wide functional matchmaking components which aggregate resource information and help with resource allocation. A related idea to functional matchmaking was using system wide management elements for aggregation about functional resource availability and jobs and to use this information for system self-management and optimization.

3.3 Design After the initial study phase of literature and the survey of execution services the actual YACS development process was started. The process began with creating a formal requirement specification and preliminary design based upon information learnt during the study and preparation phase. This design was then broken up into manageable parts and each part assigned to an iteration. Each iteration went through a design phase, implementation and testing. At the end the number of iterations which had been done were three, jobmanagement, system-management and stronger service. This section of the thesis report summarizes those iterations and present the final design of the system. Documentation from the development process, i.e. the requirement specification, preliminary design and iterations can be found in appendix A. This appendix is notable for the evolution of the design can be deducted from reading through it, from the initial ideas presented in the preliminary design to the eventual end design and implementation which is presented here. The remained of this section is organized as follows: 1. Design guidelines: a listing of the main constraints which guided the design of YACS. 2. Design overview: an overview of the design and its breakup into two main parts: a. Job management. b. System management and services. 3. Job management: the part focused on the execution of jobs. a. Design guidelines: the main factors that guided design of the job management part.

45 / 89

b. Entities: main design entities; e.g. functional and self-management components and important data classes. c. Relationships: a visual representation of the entities and the relationships between them. d. Self-management control loops: a dedicated description of the selfmanagement behaviour done in the job management part. 4. System management and services: the part focused on system state and providing services that the job management part needs. a. Design guidelines: the main factors that guided design of the system management part. b. Entities: main design entities; e.g. functional and self-management components and important data classes. c. Relationships: a visual representation of the entities and the relationships between them. d. Self-management control loops: a dedicated description of the selfmanagement behaviour done in the system management part. 5. Jobs, tasks and execution context: The service is to provide execution service. This section describes the type of work that can be executed and the services that facilitate this work.

3.3.1 Design guidelines The design of the YACS is subject to a number of important guidelines, requirements and constraints. These form a context for more detailed analysis in 3.3.2-4. The most immediate requirements come from the role of the system, i.e. utilization of distributed resources for computation of jobs. However the most important constraint is the environment in which YACS is supposed to run, i.e. in community-grids which are characterized by a limited and unstable environment, e.g. limited ability for manual management due to complexities, cost and ownership; extensive heterogeneity of resources and network connections; and member churn. This environment’s instability and limited ability for manual management calls for self-management abilities, such as self-healing to deal with churn and self-configuration and optimization to deal with heterogeneous resource properties and changes in supply and demand.

3.3.2 Design overview The following is introduction overview of YACS, but more detail is presented in the next sections, 3.3.3-5. 3.3.2.1 Functional overview From a functional perspective YACS is split into two parts, job management and system management. Job management is concerned with managing the execution of a particular instance of a job which has been submitted into the system. System management is concerned with providing system services to the job management part so that it can fulfil its job management and task execution role. This service entails monitoring the system’s functional resources and allowing for discover-for-use functionality. The split between parts is illustrated in the following figure 11:

46 / 89

Figure 11: Example scenario illustrating the functional job and system management parts

The resource service is a centralized entity, composed of at least one resource service component. It gathers information from functional resources in the system to form a best effort view of membership and availability status. It can be described as a best-effort approach as the service does not guarantee that its information is up to date. The information is gathered periodically and it might happen in the meantime that resources join, leave, become busy or free. Those functional resources are master and worker components deployed on physical nodes connected to the network. In figure 11 it is illustrated how the resource service knows of two free functional components, a master and worker at physical resources 11 and 47 / 89

12. The service will actually also contain information about the busy functional components but the free ones are of particular interest as the self-management tries to keep the number of free ones, i.e. availability, above a configurable minimum threshold. YACS can contain multiple job master components. They can be asked to take on and manage jobs submitted into the system, each managing one job at a time. This involves maintaining job state, finding worker resources for sub-tasks of the job and returning the job result once the job is done. As an example from within figure 12 is a job master component at physical node/resource 4 managing a submitted job identified by #3. The worker components expose hardware resources to do the actual work. This work is described in a user programmed task. A job consists of a collection/bag of these tasks. Worker are asked by masters to perform these tasks and if accepted and once finished they report their results back to the master. Again, as an example from within figure 12 is a worker at physical node/resource 2 executing a sub-task A of job #3. 3.3.2.1 Self-management overview As explained previously, most recently in 3.3.1, there is a need for self-management capabilities within YACS. Such capabilities can also be split into two parts, one managing the functional job management part, the other managing the functional system management and services part. Self-management of job management is concerned with self-healing. It monitors the state of masters and workers and redeploys jobs and tasks on replacement resources in the event of failure. Self-management of system management is concerned with both self-healing and selfconfiguration. The self-healing is responsible for healing the resource service if any of the components that make up the service fail. This is critical since the resource service is used by all jobs. The self-configuration senses the availability of free functional resources within the system and if it is below the minimum threshold takes step to deploy additional functional components. Self-management of the system is broken up into management components which have different roles and responsibilities. They form a hierarchy of information gathering and decision making that covers the micro level of individual components and up to the macro level of system wide management. This hierarchy is illustrated in the following figure:

48 / 89

Figure 12: Self-management hierarchy for functional job and system parts

Watchers are the first major level in the self-management hierarchy. They sense state and departures in individual functional component groups through sensors placed at the grouped components. Watchers forward information to the aggregators. The master and worker watchers have the additional role of healing failures of masters or workers assigned to jobs or tasks by re-deploying jobs or tasks on replacement components. Aggregators aggregate information from all individual component groups for a higher level view of their respective system parts. From the aggregated data they analyze system state and possibly request actuation on the system, i.e. if availability is below the configured minimum threshold. The ConfigurationManager is at the top of the hierarchy and has the most global system view, i.e. based on information received from all the aggregators. Based on this global view it has to possibility of making informed decisions about whether to go through with actuation requested by aggregators. Worth noting is that Niche provides replication of management elements and restoration upon failure. This is an important property in an unstable environment and one which is used by YACS to provide robust service. In the default deployment of YACS, every single self49 / 89

management element is replicated three times. All of the replicas will receive all sensing events, but only one will be active and perform actual management in reaction to that event. The other two only modify their state in accordance to the event. If one of these replicas fails, Niche will deploy a replacement replica elsewhere and use either of the two remaining ones to restore state to the new one.

3.3.3 Job management The role of the functional job management part is to take on submitted jobs, execute them and return results. The management role is performed by job master components, the execution role is performed by worker components. The management of jobs is the focus of this section, e.g. which component is responsible for the job and how computational resources are found. Section 3.3.5 and appendix B, programmer’s manual, cover the details of the jobs themselves, e.g. what kind of tasks they can contain. 3.3.3.1 Design guidelines, principles and decisions The overriding guideline for design of the job management is distribution of management responsibility. This is done by distributing job management to multiple job master components, each of which is responsible for one job at a time. The distribution is important in an unstable environment where resource joins and failures occur unpredictably. This distribution makes the system more robust since failure or departure of a master affects only one job, not all as would happen if there were a centralized job management entity that failed. This also makes the system more scalable as the work is more evenly distributed among participating resources. This approach is guided by and based on studies of related work in already established systems like Condor and P3 which use multiple job agents and controllers, respectively, for individual job management. 3.3.3.2 Entities 3.3.3.2.1 Functional components Frontend The frontend class is a fractal component and serves as a generic channel that is used to submit a job into the system. It has been used during development, testing and evaluation but is not an integral part of the system. Any Niche capable fractal component can bind to the interface of the system resource service to discover a master resource, and then bind to the found master to submit a job to it. Master The Master class is a fractal component and is responsible for the management of a submitted job. It creates a master group and worker group around the components associated with the particular job, maintains the state of the job, directs its execution and reports results once completed. It continuously tries to discover workers to take on remaining job sub-tasks by querying the system resource service. Each master periodically reports its state to the system resource service to enable the discovery of masters to take on new jobs and to enable monitoring of availability within the system. The master is considered busy from the point of accepting a job and until its result has been reported to the client. Worker The Worker class is a fractal component that is responsible for executing individual tasks as assigned by a Master component.

50 / 89

Each worker periodically reports its state to the system resource service to enable the discovery of workers to take on new tasks and to enable monitoring of availability within the system. The worker is considered busy from the point of accepting a task until its result has been reported to the master. This means that the worker can take on multiple task within one job, although just one at a time. 3.3.3.2.2 Other notable functional entities Job The Job class contains a collection/bag of user programmed and/or defined tasks meant to be executed. This entity is covered in more detail in 3.3.5 and appendix B contains a full programming guide. Task The Task class contains a definition of work that needs to be executed and associated data needed for that work. This entity is covered in more detail in 3.3.5 and appendix B contains a full programming guide. 3.3.3.2.3 Groups The global Frontend group This group contains all of the Frontend components which are deployed in the system. The global Master group This group contains all of the Master components deployed in the system. The GroupCreationManager self-management element, described in 3.3.3.2.4, is subscribed to this group and listens for group creation events originating from the Masters. The global Worker group This group contains all of the Worker components deployed in the system. The Resource Service group This groups contains all of the Resource Service components that make up the Resource Service. It is covered in detail in 3.3.4. Job Master group Every single job that is submitted into the system results in the creation of two job specific groups, one containing the Master managing the job, and one containing the Worker executing tasks of the job. Currently there is only one Master managing each job but having a group around it makes Master failures transparent to the clients of the groups, e.g. the Workers. It will also make future changes to job management implementation transparent to clients, e.g. implementing stronger reliability guarantees by having more Masters. An instance of a MasterWatcher self-management element, described in 3.3.3.2.4, is subscribed to events triggered from members of this group. Those events are the StateChangeEvent, signifying job state change, and the ComponentFailEvent, signifying the failure of a Master component. Job Worker group The job Worker group is the second of the two job specific groups that are created once a job has been submitted and accepted. It groups all of the Workers executing tasks belonging to that particular job.

51 / 89

An instance of a WorkerWatcher self-management element, described in 3.3.3.2.4, is subscribed to events triggered from members of this group. Those events are the StateChangeEvent, signifying task state change, and the ComponentFailEvent, signifying the failure of a Worker component. 3.3.3.2.4 Self-management elements CreateJobGroupManager The GroupCreationManager class is a fractal component and self-management element of type manager. This manager monitors creation of the job specific master and worker groups, which is are created by the master after it accepts a job. After detecting creation events it deploys a MasterWatcher and a WorkerWatcher management element which provide selfmanagement for that particular job. MasterWatcher The MasterWatcher class is a fractal component and self-management element of type watcher. Upon initialization it deploys a StateChangeSensor for the master group it is watching. It is then subscribed to events coming from that group, i.e. events of type StateChangeEvent and ComponentFailEvent. The StateChangeEvent contains a metacheckpoint for the job. The meta-checkpoint contains information which is needed to heal the job in case of master failure, which is reported by the ComponentFailEvent. WorkerWatcher The WorkerWatcher class is a fractal component and self-management element of type watcher. Upon initialization it deploys a StateChangeSensor for the worker group it is watching. It is then subscribed to events coming from that group, i.e. events of type StateChangeEvent and ComponentFailEvent. The StateChangeEvent contains a metacheckpoint for the task. The meta-checkpoint contains information which is needed to heal the task in case of worker failure, which is reported by the ComponentFailEvent. StateChangeSensor The StateChangeSensor is a class is a fractal component and a self-management element of type sensor that channels required information from the functional component sensed to the corresponding self-management watcher. To channel this information a StateChangeEvent is triggered. 3.3.3.3 Relationships The following figures illustrates the components which are most relevant to job management and the relationships between them in the form of groupings, bindings and subscriptions. Bindings are the for communication with functional components while event generation and subscriptions are used for communication between self-management elements, see section 3.1.6.

52 / 89

Figure 13: Job management related components, groups and bindings

53 / 89

Figure 14: Job management related event subscriptions

Figure 13 shows the functional component types, Frontend, Master and Worker, all of which are grouped into respective system global groups. The resource service is exposed as the Resource Service group. The Frontends are bound to this group and use it to find an available Master. After binding to a found Master the job is submitted to that Master. After accepting a job the Master creates two groups that are associated with that particular job. The Job Master group to group the Master component managing the job, and the Job Worker group to group the Workers executing sub-tasks of the job. The group abstraction that Niche provides is important as it makes development much easier. First, the Masters can communicate with all of the Workers at “once” without having to bind to everyone individually. More important is that the Workers in the worker group don’t need to be aware of the members of the master group. They only need to know of the group. This means that master component failures and healing is transparent to the Workers. The Master is bound to the Resource Service group and uses the binding to find Worker resources to execute job sub-tasks. After discovery these Workers are added to the Job Worker group. The Master’s creation of the respective, job specific groups is detected by CreateJobGroup management elements which upon detecting the group creation events deploys the MasterWatcher and WorkerWatcher self-management elements specific to this job and its groups. The MasterWatcher and WorkerWatcher are bound to the Resource Service group. This binding is used to find replacements for healing purposes if either current Masters or Workers fail. After finding a resource they bind to that specific resource to instruct it to take on a job or task. They are also bound to their respective Job specific group. This is useful for when they fail and are restored as they can then at once ask all Masters and Workers to checkpoint job and tasks state so that the watcher will again have an up-to-date view. Niche attempts to record all relevant events that happen during restoration of management elements and replay

54 / 89

after restoration is complete. However, the recording of all events is not guaranteed so the binding method can be used to increase chances of consistency. Figure 14 shows event subscriptions. In particular it shows how the watchers are subscribed for failure notifications and state changes of group members in the job group they are watching. 3.3.3.4 Self-management control loops The primary role of self-management of the functional job management part of YACS is selfhealing. The Master and Worker watchers sense both the state of masters and workers and failures if they happen. Upon sensing failures they perform healing and actuation of changes by deploying previously checkpointed state on replacement masters or workers. A worker self-healing process is illustrated in figure 15. The process follows the same pattern when a master component is healed by the MasterWatcher.

Figure 15: The sequence of Worker self-healing steps

The checkpoint state is sensed through a state checkpointing interface. Through this interface changes to a job instances are recorded by the MasterWatcher and changes to individual tasks are recorded by the WorkerWatcher. This recording forms the knowledge base of those watchers. When master or worker failures happen the watchers look up the associated state in the knowledge base and assign it to a replacement master or worker component. The service now checkpoints the initial job state and after every update made to it due to task changes, e.g. if a worker is found for a sub-task of the job. For tasks the service checkpoints the initial and end states. User defined checkpointing is also possible as during task execution the task programmer/task itself can ask for a state checkpointing, e.g. so that partial results won’t be lost due to later failure.

55 / 89

A non primary role for these watchers is collecting data for system wide control loops. Information about started jobs and tasks, their progression and completion is forwarded to the system wide aggregators. System wide self-management is covered in 3.3.4.

3.3.4 System management The role of system management is collect information about functional resources in the system, those being masters and worker components, so that they can be discovered for use. This role is performed by a resource service, which is a group of at least one resource service component. 3.3.4.1 Design guidelines, principles and decisions An initial principle for the design of YACS was to avoid using any centralized entities, mainly to improve scalability but also to improve robustness. The approach for the first iteration followed this principle, whereby functional resources were discovered by simply broadcasting. This is similar to what the P3 system does (Shudo, Tanaka, Sekiguchi, 2005). The approach was changed during the second iteration to use a more centralized resource service entity. This entity is a group of at least one resource service component. This was done to improve resource request matching and resource allocation, as well as to collect system information used in system wide self-management. This approach is used in the well established Condor system, which uses a Matchmaker service for request and resource matching (Condor team, 2009b). An important factor in making this change were the capabilities that Niche offers. Component groups and a send-to-any capability, which sends to a random component in the group, coupled with even distribution of resource service components throughout the underlying DKS peer overlay means that scalability is maintained. Scalability is further improved by resource service components monitoring their own load and implicitly asking for additional resource service components to be deployed to share the load. Robustness is maintained by self-healing. The approach is different from that of Condor, which uses specially dedicated entities for matchmaking, in that any node capable of hosting a component can take part in the resource service. 3.3.4.2 Entities 3.3.4.2.1 Functional components ResourceService The ResourceService class is a fractal component and is responsible for providing resource management and allocation service to other components in the system. The service receives state information from resources within the system and constructs a view of the system, in terms of free and busy resources, their types and properties. This state is sensed by the selfmanagement part. A number of ResourceService components can be expected to be running at any time and they lazily disseminate their state to other members of the Resource Service group. These components monitor their own load in terms of number of handled requests in a defined time period. If the number of requests becomes too high then a flag is be raised. This flag is sensed by the self-management part which then attempts to deploy additional components to share the load. 3.3.4.2.2 Groups The Resource Service group The Resource Service group contains all of the ResourceService components which are deployed in the system. Clients of this group, i.e. Frontends, Masters and Workers, are only aware of the existence of this group, not of the individual ResourceService components. This

56 / 89

makes the Resource Service implementation and any changes due to failures or load transparent to its clients. The ServiceWatcher self-management element, described in 3.3.4.2.3, is subscribed to events triggered by members of this group. Those events are the ServiceManagementEvent containing load or availability information, and the ComponentFailEvent, signifying ResourceService component failure. 3.3.4.2.3 Self-management elements LoadSensor The LoadSensor is a class is a fractal component and a self-management element of type sensor that channels system resource state and service state information sensed from the ResourceService component to the ServiceWatcher. ServiceWatcher The ServiceWatcher class is a fractal component and a self-management element of type watcher. It senses the membership of the ResourceService group, i.e. detects joins and failures, and channels state and load information up in the self-management hierarchy to the ServiceAggregator. ServiceAggregator The ServiceAggregator class is a fractal component and a self-management element of type aggregator. It collects and analyzes information about the total availability of resources within the system, i.e. ResourceService, Master and Worker components. Upon noticing high or low availability thresholds being crossed it triggers corresponding events to the ConfigurationManager. WorkerAggregator The WorkerAggregator class is a fractal component and a self-management element of type aggregator. It collects information about all tasks within the system, e.g. count of started, completed and deleted tasks. It forwards this information to the ConfigurationManager by triggering events. MasterAggregator The MasterAggregator class is a fractal component and a self-management element of type aggregator. It collects information about all jobs within the system, e.g. count of started, completed and deleted jobs. It forwards this information to the ConfigurationManager by triggering events. ConfigurationManager The ConfigurationManager class is a fractal component and a self-management element of type manager. It is ultimately responsible for making changes to the system architecture on request. It has the most global system view, by information from all of the aggregators, and could use this information to decide whether and how to act on requests. Currently is primary function is to process deployment request events from the ServiceAggregator. To do this it attempts to find physical Niche nodes with available resources and to deploy new additional functional components there. 3.3.4.3 Relationships

57 / 89

The following figure illustrates the components which are most relevant to system management and services, and the relationships between them in the form of groupings, bindings and subscriptions.

Figure 16: System management related components, groups, bindings and subscriptions

Figure 16 shows the system wide Resource Service group which needs to contain at least one functional ResourceService component. All of the Master and Worker functional components are bound to this group in order to send periodic status updates to the system resource service. As shown previously in figure 13, and partially in figure 16, the Frontends, Masters, MasterWatchers and WorkerWatchers are also bound to this group for the resource request/discovery interface. From all of these bindings it can be seen that the resource service is a highly used and loaded part of the system. To improve robustness and scalability the beneficial send-to-any-in-group feature of Niche, discussed in 3.3.4.1, is used for these bindings. The load is then evenly distributed to the ResourceService components in the group. Also, if the load on these

58 / 89

components becomes too high they raise a flag which prompts the self-management part to try to increase the distribution degree by adding more ResourceService components to the group. As incoming updates and requests are handled by only one ResourceService component at a time there will be inconsistencies among them in their view of system state. To order to reduce these inconsistencies each ResourceService component periodically disemminates its view to the others. Therefore, pending no other changes they eventually reach a stable and consistent view of the system. Figure 16 also shows the hierarchy of self-management elements. The ServiceAggregator is subscribed to management events from the ServiceWatcher which tell of availability in the system and if load is too high on the ResourceService components. The MasterAggregator is subscribed to all MasterWatchers and from there collects statistics about active, completed and deleted jobs in the system. Similarily the WorkerAggregator is subscribed to all WorkerWatchers and collects statistics about active, completed and deleted tasks. At the top of the hierarchy sits the ConfigurationManager which is subscribed to management events from all aggregators. Thereby learning of availability, load, job statistics and task statistics from the aggregators. Currently, its primary role is to blindly process deployment requests from the ServiceAggregator when it determines that functional ResourceService, Master or Worker components are needed in the system. The ConfigurationManager already has a wealth of data that it can use to perform sophisticated analysis and planning, for example it could chose to ignore requests for additional Masters or Workers if from the job and task statistics it knows that most are about to be completed, thereby increasing free availability again. This is subject to future work. 3.3.4.4 Self-management control loops The role of self-management of the system management part of YACS is twofold, selfhealing of the resource service and self-configuring the system to maintain availability of free functional components. The ServiceWatcher senses the state of the system through sensors on resource service components; as well as failures of resource service components, should they happen. This information is sent to the ServiceAggregator which analyzes the state to see if free availability is too low and actuation is needed to change the system, i.e. by deployment of additional functional components. The ConfigurationManager performs this actuation by trying to find physical nodes with enough free capacity take on new functional components. Self-healing and Self-configuration are illustrated in the following figures:

59 / 89

Figure 17: The sequence of ResourceService self-healing steps

60 / 89

Figure 18: Illustration of system self-configuration steps to maintain availability

The continued sensing of state, analysis of aggregated information and possible actuation based on the analysis are examples of typical control loop behaviours. Current analysis is simple, based on counting and comparing with pre-configured thresholds. However, substantial information about system state, e.g. resource count, type, availability; job type, count and status; task type, count and status, and so on is already available in the selfmanagement hierarchy. Using this information for more advanced control loops is certainly possible and could be subject to future work.

3.3.5 Jobs, Tasks and the execution context YACS supports jobs as collections/bags of independent tasks. Management of jobs follows a master-worker pattern where the job master assigns tasks to workers for execution. The structure of responsibility, timeline and flow of information is illustrated in figure 19:

61 / 89

Figure 19: Structure of responsibility, timeline and flow of information for jobs and tasks

A Frontend component submits a job to a Master component, which in turns discovers Worker components to execute individual tasks. Task changes are reported to the Frontend as they happen. In the end the Master reports a job result back to the Frontend. YACS is a Java based service and therefore jobs and tasks need to programmed in Java also. On the other hand, tasks can of course be programmed to call external non-Java programs or scripts. 3.3.5.1 Jobs Job is a simple container class used to contain tasks meant for execution, and to contain all task results, collectively forming the job result. The job class is also used for the so called meta-checkpoints used by the MasterWatcher for self-healing redeployment in case the Master component fails. Extension of jobs, for example to include user-programmed task scheduling and expansion or reduction of task collectios is predicted to be easy to implement. This could be subject to future work.

62 / 89

3.3.5.2 Tasks The Task class is a base class for all tasks meant for execution within YACS. The Task class exposes an interface that YACS needs to start execution of the task once it has been deployed and instantiated at a Worker. YACS places no restriction on what the task actually does, it can perform work within the class itself, call external programs or scripts and so on. Task derived classes are contained within a TaskContainer class. This container includes the task class definition which is loaded by a ClassLoader at the Worker so that it can dynamically accept new task types. It also stores an array of initalization variables. This array is also used to stored limited task state in the so called meta-checkpoints. As explained previously those meta-checkpoints are used by the WorkerWatcher for self-healing redeployment in case the Worker component fails. 3.3.5.3 Execution interface and context The task is executed at a Worker component. This Worker forms an execution context for the task; it accepts the task container containing the task definition, instantiates, deploys and starts execution of it. Once the task has finished execution, the task container and the task result within it is returned to the Master. During task execution the Worker’s execution context provides access to service functions which are of benefit for the task. This comes in the form of a meta checkpointing function which instructs YACS to store task state. This state is provided to the task if it has to be redeployed at another worker. The task can then start from that state instead of having to restart from the beginning. Figure 20 summarizes the structure of jobs, tasks and execution context:

Figure 20: Jobs, tasks and execution context

63 / 89

3.4 Implementation Implementation of the service exactly follows the design from section 3.2. Functional and self-management components follow the naming introduced in the design, are programmed as Java classes and defined in the extended Fractal component model architecture description language (ADL). The same applies to all design interfaces except they are defined as Java interfaces, which are implemented by the components. Events are Java classes and named in accordance with the design. Job, TaskContainer and Task are regular Java classes. Appendix B provides a programmer’s manual that describes how to program tasks, submit them and receive results. Appendix C provides an administrator’s manual that describes how to configure, deploy and start a YACS instance. Appendix D provides information about Javadoc documentation for classes and interfaces in the design.

64 / 89

3.5 Evaluation Evaluation is meant to determine the quality of a system. Here the notion of quality is the overhead in time resulting from use of YACS under different operating conditions. What has been of most importance for this project is the overhead in time and efficiency of its selfmanagement capabilities, those being self-healing and self-configuration. A further notion is YACS’s ability to tolerate churn. The evaluation was done in six steps. The first five were interdependent but the last one independent. First an idealized job/work model is defined. This model serves as a baseline to which jobs executed in YACS were compared to. These executions were static, i..e without self-management scenarios, and in turn served as a baseline for the next step when executions subject to self-management were made. Comparing timings of all these runs and models quantifies overhead involved. The last step was tolerance to membership churn. A job was submitted into YACS, which combined with the churn, forced YACS to do self-management. The notion of tolerance here was whether the service is able to complete the job. Setup All of the execution runs were performed on Grid5000 which is a 1597 physical node (Grid5000, 2009a) grid of clusters located in France meant for research into large-scale parallel and distributed systems (Grid5000, 2009b). Step 2 was run on 8, 16 and 32 physical node setup. Steps 3-4 on a 32 node setup. Step 5 on a 32-33 node setup. Step 6 on a 128-192 node setup. Measurements and scale All timings presented in following sections are in milliseconds, unless otherwise specified. A percentage ratio of overhead to idealized model is used to as a common scale to compare runs of different properties, e.g. in the number of Workers used, task lengths, number of tasks and so on.

65 / 89

Abbreviations In the following sections a number of abbreviations are used frequently. These abbreviations are listed in the following table: ADWC

IDWC HJCFE HTCFE JS

JTIJ

MSEK OH

OH% TPT TPW TTFU WC WD

WORC WS WU

Actual Distributed Wall Clock time, i.e. the actual time needed to execute the job. This is measured as the time that elapses between a client job submission and receiving the result. Idealized Distributed Wall Clock time, i.e. the time needed for distributed job execution without any overhead. Handle Job Component Failure Event. The time needed by the MasterWatcher to process a Master failure. Handle Task Component Failure Event. The time needed by the WorkerWatcher to process a Worker failure. Job setup. The time it takes to do required setup for individual jobs. This includes creating groups, deploying management elements and more. Join Time Into Job. How long a job has been running before a physical node joins the system. Used to measure how fast node resources can be put to use by the job. Milliseconds. OverHead in milliseconds compared to the idealized model, i.e. how many extra milliseconds did the job need to complete compared to what the idealized no-overhead model predicted. OH as ratio of idealized model time: (AJET-Model CPU time) / Model CPU time. Time per Task. For how long an individual task is executed at a Worker. Tasks per Worker. How many tasks a single worker is expected to perform during a particular job. Time To First Utilization. For how long a functional component is sitting idle after joining system before being put to productive use. Time needed for non-distributed job execution without any overhead, i.e. idealized non-distributed wall clock time. Worker Deployment. How long Niche needs to discover a node resource and to deploy a functional Worker component upon the discovered resource. With-Out Race Condition. A deployment bug in Niche occasionally causes a period of blocking that can skew timing results. Worker Started. Time from a physical node joining the system before a worker component has been started at that node. Worker Utilization. Utilization measured as a ratio of how many workers out of the total workers deployed in the system were put to use. Table 3: Abbreviations used in evaluation results

66 / 89

3.5.1 Step 1: Definition of an idealized job/work model The idealized model is a unit of work that takes exactly 60 minutes to perform model and contains no overhead. This translates into WC of: Minutes 60

Seconds Milliseconds 3.600 3.600.000 Table 4: Timing assumptions for the idealized job/work model

The ideal job is implemented in YACS as a job containing a collection of sleep tasks. Each sleep task occupies a Worker resource for a certain amount of time. The combined occupation time of all tasks amounts to the idealized execution time of the job. The reason for choosing a sleep task is it enables timing assumptions, by eliminating differences in CPU speed and memory, and for that it has no IO, which eliminates effects of network and file storage latencies. Factoring out variations in tasks due to different Worker capabilities enables the creation of the idealized model and enables focused evaluation on the management overhead of the service itself.

3.5.2 Step 2: Pure job setup time This step quantifies time involved in pure setup of a job, i.e. a job without any tasks. Functional logic creates Master and Worker groups and does appropriate bindings. The nonfunctional self-management logic detects these events and deploys Master and Worker watchers. This step was run in different setups with different number of physical nodes to see if scale was a factor in the setup time.

average min max average WORC*

8 nodes 410 403 425

16 nodes 388 333 438

32 nodes 4.361 389 *10.319 361

Table 5: Job setup overhead

These numbers are based on 4 runs of the 8 and 16 node setups and 5 runs of the 32 node setup. The 32 node setup was twice subject to a known race condition within Niche which can occur when a component being deployed happen to end up on the same node as the deployer. A timeout for this condition was set to 10.000 milliseconds and explains the max value of 10.319. The average without race condition (WORC) is similar to the other setups.

3.5.3 Step 3: Distributed job execution Here the defined job from 3.5.1 is executed in a number of different settings and compared to the idealized time of 60 minutes. This quantifies the overhead involved, such as from submitting the job, managing it, finding execution resources and returning results. Furthermore the individual distributed executions are compared amongst each other to see effects of changing the number of Workers involved or the number of tasks within the job. The scenarios are set up in such a way that YACS doesn’t have to show any self-management behaviour, i.e. no failures causing self-healing or lack of availability causing selfconfiguration.

67 / 89

Table 6 shows the results of evaluation runs using different numbers of workers, tasks and task length. Model Worker count

Actual WC

TPT

TPW

IDWC

WU

Run#

1

3.600.000

0,10

3.600.000

10%

1

3.600.000

5

720.000

0,50

720.000

50%

1

721.241

1.241

0,17%

3.600.000

10

360.000

1,00

360.000

100%

1

361.336

1.336

0,37%

10

3.600.000

10

360.000

1,00

360.000

100%

2

361.345

1.345

0,37%

10

3.600.000

10

360.000

1,00

360.000

100%

3

361.353

1.353

0,38%

10

3.600.000

10

360.000

1,00

360.000

100%

4

361.372

1.372

0,38%

10

3.600.000

10

360.000

1,00

360.000

100%

5

361.207

1.207

0,34%

10

3.600.000

20

180.000

2,00

360.000

100%

1

363.425

3.425

0,95%

10

3.600.000

30

120.000

3,00

360.000

100%

1

375.470

15.470

4,30%

10

3.600.000

60

60.000

6,00

360.000

100%

1

383.345

23.345

6,48%

10

3.600.000

120

30.000

12,00

360.000

100%

1

466.938

106.938

29,71%

20

3.600.000

1

3.600.000

0,05

3.600.000

5%

1

3.603.173

3.173

0,09%

20

3.600.000

5

720.000

0,25

720.000

25%

1

721.275

1.275

0,18%

20

3.600.000

10

360.000

0,50

360.000

50%

1

361.548

1.548

0,43%

20

3.600.000

20

180.000

1,00

180.000

100%

1

201.896

21.896

12,16%

20

3.600.000

30

120.000

1,50

240.000

100%

1

255.019

15.019

6,26%

20

3.600.000

60

60.000

3,00

180.000

100%

1

195.588

15.588

8,66%

20

3.600.000

120

30.000

6,00

180.000

100%

1

247.139

67.139

37,30%

30

3.600.000

1

3.600.000

0,03

3.600.000

3%

1

3.602.901

2.901

0,08%

30

3.600.000

5

720.000

0,17

720.000

17%

1

720.977

977

0,14%

30

3.600.000

10

360.000

0,33

360.000

33%

1

361.204

1.204

0,33%

30

3.600.000

20

180.000

0,67

180.000

67%

1

191.880

11.880

6,60%

30

3.600.000

30

120.000

1,00

120.000

100%

1

132.372

12.372

10,31%

30

3.600.000

60

60.000

2,00

120.000

100%

1

134.636

14.636

12,20%

30

3.600.000

120

30.000

4,00

120.000

100%

1

180.343

60.343

50,29%

10

3.600.000

10 10

Tasks

ADWC 3.603.123

Oh. 3.123

Oh.% 0,09%

Table 6: Run results and model for distributed job execution

Visualizing these results in figures 21 to 23 shows how total wall clock time execution time is reduced by using distributed execution.

68 / 89

Comparison of non-distributed and distributed job execution with 10 workers 4.000.000 3.500.000 3.000.000

Msek

2.500.000

Non-distributed model time Distributed model time

2.000.000

Actual time Overhead time

1.500.000 1.000.000 500.000 0 1

5

10

20

30

60

120

Tasks

Figure 21: Distributed job execution with 10 workers and job breakup into 1-120 tasks

Comparison of non-distributed and distributed job execution with 20 workers 4.000.000 3.500.000 3.000.000

Msek

2.500.000

Non-distributed model time Distributed model time

2.000.000

Actual time Overhead time

1.500.000 1.000.000 500.000 0 1

5

10

20

30

60

120

Tasks

Figure 22: Distributed job execution with 20 workers and job breakup into 1-120 tasks

69 / 89

Comparison of non-distributed and distributed job execution with 30 workers 4.000.000 3.500.000 3.000.000

Msek

2.500.000

Non-distributed model time Distributed model time

2.000.000

Actual time Overhead time

1.500.000 1.000.000 500.000 0 1

5

10

20

30

60

120

Tasks

Figure 23: Distributed job execution with 30 workers and job breakup into 1-120 tasks

The following figure 24 shows that the trend is for the overhead to increase with the number of tasks within the job. For each task a Worker resource has to be discovered, bound to and an assigned the task. This process is expensive. Longer task lists mean more discoveries that have to be made, more bindings and assignments performed and longer waits for task before their turn is up, resulting in rapid growth of overhead. Overhead as percentage increase to idealized model time 60,00%

% increase over model

50,00% 40,00% 10 workers 30,00%

20 workers 30 workers

20,00% 10,00% 0,00% 1

5

10

20

30

60

# of tasks

Figure 24: Overhead trend as the number of tasks increase

70 / 89

120

One idea of reducing this cost could be to periodically ask for as many Workers as there are remaining tasks, instead of periodically asking for each and every remaining task. This approach would reduce the number of discovery function invocations. An added benefit is that overall less load would be placed on the resource service. Another idea could be to assign, in a single binding invocation, multiple tasks to each Worker, thereby reducing the number of bindings and assignments needed. This would require some task queuing mechanism in the Worker.

3.5.4 Step 4: Distributed job execution with self-healing scenarios This step quantifies the overhead resulting from failure of individual functional job components. The steps is evaluated as two different scenarios, first is a killing of a single active Worker component, the second is the killing of a single active Master component. 3.5.4.1 Self-healing after Worker failure In this step a job with 10 * 6 minute tasks was submitted into a setup of 11 Workers. During the job one of the Workers was killed. This was detected by the WorkerWatcher which healed the job by deploying the associated task to the single remaining free Worker. The task is restarted at the replacement Worker so all time spent at the original Worker is wasted.

Model

IDWC 360.000 360.000 360.000 360.000 360.000 Average Average Min Max

Actual Wasted time 202.456 131.077 132.492 127.763 120.836 3,49% 3,07% 2,82% 5,17%

IDWC w/waste 562.456 491.077 492.492 487.763 480.836

Run# 1 2 3 4 5

ADWC 581.515 516.451 506.421 501.497 496.480

JC 318 10.314 304 306 346

OH 19.059 25.374 13.929 13.734 15.644

OH % 3,39% 5,17% 2,83% 2,82% 3,25%

HTCFE 56 47 81 45 42

Without race condition, i.e. WORC

Table 7: Run results for the worker healing scenario

The WorkerWatcher handles Worker failure by finding a replacement Worker and redeploying failed tasks to there. The time needed for this handling process (HTCFE) is eclipsed by the total overhead time (OH), most of which stems from the timeout period. The timeout period, used by the Niche failure detector, was configured to be 10.000 milliseconds. The worst case scenario in failure detection is close to double the timeout, i.e. when a node is detected at end of period N+1 after failing right after sending a heartbeat message at the beginning of period N. An overhead of between 10.000 and 20.000 can therefore be expected and this is confirmed by evaluation runs made. Note that run #2 was subject to a deployment race condition which increases the overhead, but this is not related to self-healing.

71 / 89

Additional overhead from Worker healing 4,00% 3,49% 3,50% Overhead ratio to model

3,07% 3,00% 2,50% 2,00%

Overhead

1,50% 1,00% 0,37%

0,50% 0,00%

10 workers

10 workers with worker healing

10 workers with master healing (WORC)

Figure 25: Execution overhead in worker healing scenario vs. no failure scenario

Compared to the 10 task, 10 workers with no failure run from 3.5.3 the increase in overhead to idealized model is considerable, the average goes from 0.37% to 3.49%, or 3.07% if the race condition is discarded. However, as a ratio of total time the overhead is still less than 5%. Again, the occurrence of a race condition in one of the results skews the results somewhat at the expense of the healing case. 3.5.4.2 Self-healing after Master failure In this step a job with 10 * 6 minute tasks was submitted into a setup of 2 Masters. During the job the active Master was killed. This was detected by the MasterWatcher which healed the job by re-deploying it to the single remaining Master.

Model

IDWC 360.000 360.000 360.000 Average Average Min Max

Actual

Run# 1 2 3

ADWC 363.396 378.247 361.568

JC 293 10.314 339

OH 3.396 18.247 1.568

OH % 0,94% 5,07% 0,44%

HJCFE 151 148 77

Total healing time 15.942 12.584 12.954

2,15% 0,69% Without race condition, i.e. WORC 0,44% 5,07% Table 8: Run results for the master healing scenario

This scenario is more difficult to compare to the idealized model since the Master runs in parallel to the productive work being done in the Workers. This means that the healing time overhead can be masked by the Worker time. An extra overhead would only occur if the failure happened before task assignment or returning of results. Time spent by the MasterWatcher on handling the failure (HJCFE) is again eclipsed by the timeout period. Its

72 / 89

configuration to 10.000 milliseconds means that detection occurs between 10.000 and 20.000 and an overhead of approximately that time can be expected. This is exemplified by the runs. Additional overhead from Master healing 2,50% 2,15%

Overhead ratio to model

2,00%

1,50% Overhead 1,00% 0,69% 0,50%

0,37%

0,00% 10 w orkers

10 w orkers w ith master healing

10 w orkers w ith master healing (WORC)

Figure 26: Execution overhead in master healing scenario vs. no failure scenario

Compared to the 10 task, 10 workers with no failure run from 3.5.3 the increase in overhead to idealized model is again considerable, the average goes from 0.37% to 2.15%. But worth noting is the occurrence of the race condition in one of the healing runs. This instance pushes the overhead up to 5.07%. The average of the two other runs is 0.69% which is a modest increase to the no failure case.

3.5.5 Step 5: Distributed job execution with self-configuration scenarios This step quantifies the overhead involved in putting an newly joined node to work within the system. The step is evaluated is two different scenarios, first a node joins into a system where availability of free Workers is below the minimum threshold. The second part expands the first by adding job into the dimension to quantify how fast the new Worker can be put to use and how much the job is sped up. 3.5.5.1 Single node join Here the initial setup of the system contained a single free Worker component. This was below a configured free availability threshold of two such components. Therefore the selfmanagement was continuously trying to find additional physical nodes with resources to deploy Workers upon. One minute after starting the system a node with enough resource capacity to take on a Worker joined the system.

Run # 1 2 3 4 5

WS 7.773 16.483 6.844 15.642 8.396

WD 766 846 740 925 894

Average

11.028

834

73 / 89

Min 6.844 740 Max 16.483 925 Table 9: Results of jobless self-configuration scenario runs

YACS’s self-configuration works by the ServiceWatcher and Aggregator sensing availability within the system, through the RS components. This sensing was configured to happen every 10 seconds. In addition the ServiceAggregator was configured to allow the system to stabilize for 10 seconds from last request to the ConfigurationManager before requesting again. The worst case scenario for YACS detecting and deploying on the newly joined node should therefore be 20 seconds. This is exemplified in the by the evaluations runs. In every case a Worker is started on the new node within 20 seconds. The Niche functions of resource discovery and component deployment invoked by YACS’s ConfigurationManager take close to 1 second (WD). 3.5.5.2 Single node join when running a starved job Evaluation of the effects of a node joining into the system when there is a Worker starved job was done by running a scenario where a job with ten 3 minute tasks is being run in a system with 9 Workers. Without overhead this job should need 6 minutes to complete. Into this setup a Worker capable node was joined. If there were no overhead from joining and assigning this should speed the job up since the deployed Worker takes on the remaining task. Model

Actual

TPT

JTIJ

IDWC

360.000

180.000

540.000

1

360.000

180.000

540.000

2

360.000

90.000

450.000

360.000

100.000

360.000

90.000

Run#

ADWC

JC

WS

TTFU

WD

OH

OH%

561.968

21.968

4,07%

300

13.988

674

561.935

21.935

4,06%

334

11.684

581

705

3

471.752

21.752

4,83%

299

9.621

541

692

460.000

4

481.918

21.918

4,76%

347

14.695

631

741

450.000

5

471.608

21.608

4,80%

335

12.659

239

840

Average

21.836

4,51%

323

12.529

533

745

Min

21.608

4,06%

299

9.621

239

692

Max

21.968

4,83%

347

14.695

674

840

748

Table 10: Results of self-configuration scenario runs

Utilization of the newly joined node is subject to a worst case period of 20 seconds, as explained in 3.5.5.1. In each case a functional Worker is deployed within 20 seconds. An additional overhead stems from the way the Master discovers resources for tasks. In this case the worst case scenario there would be extra 5 seconds, but as it happens in each case the new Worker is put to use within 1 second from it starting. This can be seen in “TTFU” which stands for “Time To Functional Use”. A Niche discover and deploy time of approaching 1 second is again demonstrated in “DWC”. As is to be expected the total time needed for the job is reduced by the introduction of new worker. According to the model a job of 10 * 3 minute tasks in a 9 worker setup should take 6 minutes, or 720.000 milliseconds. In the join scenario this time is reduced relative to at what time into the job the node joined, see columns IDWC vs. ADWC and following figure 27:

74 / 89

Actual Distrib. Wall Clock time improvement to 720.000 msec baseline 40%

38% 34%

35%

38%

36% 33%

34%

30% 25% % faster

25%

25% 22%

22% Model

20%

Actual

15% 10% 5% 0% 1

2

3

4

5

Run #

Figure 27: Ratio of time speedup after self-configuration; model and actual result

This scenario changed the Worker count from 9 to 10. Comparing to other 10 task scenarios, with 9 and 10 worker respectively indicates some additional overhead, as seen in figure 28: Comparison of 10 task overhead with different Worker counts

Overhead ratio to idealized time

6,00%

5,00%

4,00% Average 3,00%

Min Max

2,00%

1,00%

0,00% 9 workers

10 workers

9-10 workers

9-10 workers wo/join time

Type of scenario

Figure 28: Comparison of 9-10 worker setups with respect to execution overhead

75 / 89

Some overhead is to be expected from the join procedure, as is visualized in the “9-10 workers” columns. However, factoring out the join overhead, as seen in “9-10 workers wo/join time”, there still appears to some extra overhead compared to the stable 9 and 10 worker setups.

3.5.6 Step 6: Distributed job execution subjected to churn This step evaluates the quality of YACS in terms of tolerance to churn. Therefore, unlike in previous steps, the timing of overhead is not the most immediate notion of quality. 3.5.6.1. Setup The initial physical setup consisted of 128 connected nodes and 64 nodes in reserve, ready to join. All of these nodes were equal except for the boot node. It was configured to remain free of all functional and management components except for the frontend, through which jobs are submitted and result sent to. The remaining 127 nodes (and all 64 reserve nodes, if joined) had equal chance of receiving functional and management components. The initial functional setup consisted of 1 frontend, 1 resource service component, 2 masters and 31 workers. The self-management in terms of configuration was directed to try to maintain availability of at least 1 free functional component of each type. 3.5.6.2. Notion of tolerance To determine the tolerance to churn a job consisting of 30 * 6 minute tasks was submitted into YACS. If YACS managed to complete the job given the frequency of churn it is said to be able to successfully tolerate that particular frequency. The tasks were a modified version of the sleep tasks used in previous steps in that they metacheckpointed their status every 10 seconds. This was mainly done to reduce the time needed to evaluate. Without checkpointing every failure could result in an additional delay of up to 6 minutes. Frequent checkpointing has a beneficial evolutional side-effect because this places considerable load upon the WorkerWatcher responsible for maintaining the meta-checkpoints. 3.5.6.3. Frequency and type of churn Churn frequency tested was in the range of 240 seconds to 5 seconds. The 240 second scenario means one failure during the lifetime of the job and is equivalent to that tested in 3.5.4. Given a 10 second timeout setting used the 5 second frequency scenario was meant to test YACS at a point beyond what it is expected to reliably cope with. The types of churn used were three: • Alternating leaves and joins, beginning with a leave. • Only leaves • Random leaves and joins. Some runs also had the constraint that all leaves were directed towards nodes participating in the submitted job. 3.5.6.4. Result

76 / 89

Run

Churn rate

Churn type

Only MW leaves?

Master heals

Worker heals

RS leaves

ME reinits

Node count

Tasks finish.

ADWC

OH

OH%

1

240.000

alt

no

0

0

0

0

128

30

384.584

24.584

6,83%

2

240.000

alt

no

0

0

0

0

128

30

382.955

22.955

6,38%

3

240.000

alt

no

0

1

0

0

128

30

374.135

14.135

3,93%

4

60.000

alt

yes

0

3

0

0

132

30

379.133

19.133

5,31%

5

30.000

alt

yes

0

5

0

0

135

30

399.197

39.197

10,89%

6

15.000

alt

yes

0

9

0

10

145

29

Failure

-

-

7

60.000

leave

yes

0

4

0

0

128

30

384.042

24.042

6,68%

8

30.000

leave

yes

0

10

0

4

128

30

384.258

24.258

6,74%

9

15.000

leave

yes

0

17

0

24

128

29

Failure

-

-

10

30.000

rand

no

0

2

0

0

137

30

384.047

24.047

6,68%

11

15.000

rand

no

0

3

0

6

142

30

374.279

14.279

3,97%

12

5.000

rand

no

0

8

0

8

175

29

Failure

-

-

13

10.000

rand

no

0

3

0

8

151

30

384.264

24.264

6,74%

Table 11: Result of churn scenario runs

The service is consistently successful until the churn rate is in the vicinity of the configured 10.000 millisecond timeout. Although there are successful runs around that mark there are also failed runs. This is not unexpected as when churn frequency is so near the timeout rate it can happen that an entity failure is not detected by the monitoring failure detector before the detector itself happens to suffer failure also. In all of the three cases that failed the reason was a Worker departure which was not reported to the WorkerWatcher, which could therefore not take steps to re-deploy the associated task to another Worker. Steps could be taken within YACS itself to counter this scenario, e.g. by some failure detection polling mechanism. This could be subject to future work.

77 / 89

4. Discussion During this course of this thesis project, YACS, a distributed job execution service capable of self-management behaviour, has been implemented. It is set within the context of the Grid4All project, which aims to connect cheap resources of smaller users, like domestic users and small organizations, institutions and enterprises. These kinds of resources are generally less stable than those of traditional grids and connected by communication links that vary much more in latency and throughput than those of the traditional grids. In an effort to provide a robust service and good quality of service in such an unstable environment, YACS employs self-management behaviours of self-healing and self-configuration to repair damage from failures and to adapt to changes in availability and load. Self-management has been suggested to be an appropriate method of dealing with increased complexities and cost of computer systems (Ganek, Corbi, 2003), stemming from increased sophistication, scale and distribution and heterogeneity of participants. In this light, YACS’s use of self-management to deal with the complexities of its prospective runtime environment is justified and an interesting case study. Results from testing and initial evaluation indicate that YACS can indeed provide a functioning and adaptive service even in the light of failures and considerable churn. At the individual job execution level it has been shown to heal failures of allocated resources. This provides better quality of service for the client user as the desired work is more likely to complete. This also reduces responsibilities of the client programmer as the programmer doesn’t need to write specific procedures to deal with these kinds of problems. At the system level YACS has been shown to be able to maintain necessary system services by healing upon resource failure. Job execution relies on these services so this capability is important. Furthermore, YACS has shown capability for self-configuration by monitoring system state and specifically taking action to change the state and availability by deploying additional functional components. This important to maintain responsiveness of the service even in the light of membership change or load. The capabilities of YACS are primarily due to being built upon Niche. Even in a situation of considerable churn Niche’s runtime infrastructure appears to enable robust communication as well as maintaining a fault-tolerant self-management hierarchy that provides system state sensing ability and actuation API to change the state.

4.1 Niche experience Since YACS relies heavily on Niche some special attention on Niche is in order. A quantitative evaluation of Niche has implicitly been conducted through the evaluation of YACS. Here, a more subjective and qualitative view of Niche is presented in a form of a series of observations made during the course of the project. First, Niche role during development is covered, then its role during runtime.

4.1.1 Development support and self-management application model From a development standpoint the experience is that Niche offers a good and clear application model. Separating functional and self-management parts simplifies design by making responsibilities clearer. It also improves quality of the implementation as individual components and respective source code are focused on one main role, i.e. either functional or self-management functionality. The alternative of mixing those roles would make the logic more complex and harder to program, thereby increasing chances of faulty code.

78 / 89

The Niche model of management elements types and implied hierarchical structure is also logical and in line with the control loop pattern suggested for self-management implementations. In general it could be said that Sensors form the sensing and monitoring part of the loop. Watchers and Aggregators aggregate data, provide partial analysis and contribute to the knowledge base. Managers can be seen to perform end level analysis, planning and actuation of changes. Another positive note is the group abstraction. This masks changes resulting from node failures or joins, which simplifies programming considerably. One example of where this simplifies YACS is in the case of the Worker components assigned to a particular job. The Workers are not aware of which Master component is responsible for the job, they only know the group to which it belongs. If there were no group and the Master failed then to heal the job every single Worker would have to be contacted independently and informed of the replacement Master. The group abstraction also hides implementation details. For example, currently each job is managed by one Master but if stronger guarantees are to be implemented then increasing the replication degree is a logical step. Clients of the master group, i.e. the Workers, would not need to be aware of the composition of the group or how the members reach consensus on state and action. They simply send their message to the group and let the members take care of the rest. A further point on the benefits of groups, is that coupled with the “send to any/random member” possibility they provide a convenient way of taking advantage of the scalability inherent in P2P deployments. What can appear to be a single centralized service can really be implemented as a group where load is distributed to a group of components already well distributed in the overlay. This is done in the system Resource discovery Service. On the negative side, a notable weak point is the current lack of component un-deployment API. This means, pending node departures, that management elements will accumulate within the system as the total number of submitted jobs increases. These management elements take up resources, require system administrative attention and cause complications in events of failure. This also makes it impossible to implement self-management aimed at adapting to environments with surplus availability of functional resources. However, this can be understandable since Niche is still in development. A notable development problem that was often encountered during the development comes from the use of Fractal and in relation to templates, interfaces, bindings and groups. As its stands interfaces are named and specified in special fractal files as both client and server interface, depending on role. In source code they are also named and specified during bindings, template specification and group creation. All in all this makes for many places and it becomes very easy to make mistakes, as frequently experienced.

4.1.2 Runtime infrastructure Niche offers various runtime communication methods, sensing and actuation support that enables both functional and self-management behaviour. As mentioned at the beginning of this chapter the initial indications from evaluation is that Niche is able, up to the configured timeout point, to provide these services and support even in event of churn. Niche does at runtime offer transparent replication and restoration of management elements for fault-tolerant self-management capabilities. This is of course a very important and positive level of service that a self-managing application running in an unstable environment could not

79 / 89

be without. However, the fact that the consistency of replicas is not guaranteed makes the work of the application programmer considerably harder. Fortunately this case was solvable in YACS.

4.2 Future work Although much work and effort has been put into this project there are areas where more work needs to be performed, improvements made or extensions introduced. Following are some examples. Functional improvements YACS offers only a basic set of job execution functionalities. Ideas for extensions include: • Support for jobs with dependent tasks, and even workflows. • Support for cross-communication between tasks, like MPI or PVM. • Support for user programmed task scheduling. • Integrated storage management interface. • Security features, e.g. use certificates for authentication. Another issue is client access to the service. Currently there exists a generic Frontend component that launches whatever application the client requires. The application then interacts with the service, i.e. submits job and receives result, through the Frontend component. In theory any component that can bind to the required interfaces can submit and receive result but as of now this is done through the Frontend component. Formalizing how to access to the service is required. Internal logic can be improved in many areas. Notable areas include: • Making resource discovery by Master components more intelligent and efficient. • Extending the resource service discovery interface. Self-management improvements Current system self-configuration is primitive and only set to reactively maintain static goals of resource availability. The possibilities for extensions here are endless. For example, there already is a lot of information about availability, jobs and tasks within the self-management hierarchy. This can be used for various self-configuration and self-optimization schemes, for example proactive self-configuration based on information learnt from studying usages trends over time. Evaluation A considerably more exhaustive evaluation is needed, for example by trying larger deployments than 192 nodes and trying different distributions of churn events. Comparison to other execution services would certainly be interesting.

80 / 89

5. References International workshop on Peer-To-Peer Systems. 2003. Berkeley, CA, USA, (2003). On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing, Foster, I.; Iamnitchi, A. Globus Alliance. (2009a) About the Globus Toolkit [Internet]. Available from [Last accessed: 2009-05-23]. Globus Alliance. (2009b) Globus: OGSA - The Open Grid Services Architecture [Internet]. Available from [Last accessed: 2009-05-23]. Foster, I. (2005) A Globus Toolkit Primer. Globus Alliance. Globus Alliance (2009c) GT 4.2 GRAM4: Job Description Schema Doc [Internet]. Available from [Last accessed: 2009-05-23]. Teragrid 2007. 2007. Madison, WI, USA, (2007). GT4 GRAM: A Functionality and Performance Study, Martin S., Feller M. GRAM4 Components. (2009d). [Online image]. Available from WS_GRAM_components.png, [Last accessed: 2009-05-23]. Globus Alliance (2009e) Execution Management: Key Concepts [Internet]. Available from [Last accessed: 2009-05-23]. Globus Alliance (2009e) Execution Management: Key Concepts [Internet]. Available from [Last accessed: 2009-05-23]. Globus Alliance (2009f) Execution Management: Key Concepts [Internet]. Available from < http://www.globus.org/toolkit/docs/4.2/4.2.1/execution/key/#execution-approachpersistency>[Last accessed: 2009-05-23]. Open Source Grid and Cluster Conference 2008. 2008. Oakland, CA, USA, (2008). Globus Primer: Introduction to Globus Software, Liming, L. UNICORE Migration Workshop. 2008. Langen, DE, (2008). UNICORE. Version 6 Architecture, Schuller, B. D-Grid tutorial. 2007. (2007). UNICORE. An in-depth view, Schuller, B. UNICORE Forum (2009a) UNICORE/X [Internet]. Available from [Last accessed: 2009-05-23].

81 / 89

UNICORE Forum (2009b) UNICORE – Documentation - FAQ [Internet]. Available from [Last accessed: 2009-05-23]. D-Grid tutorial. 2007. (2007b). UNICORE. Version 6, Schuller, B. Berghe, S. (2006) Using the NJS and TSI (v4). Fujitsu Laboratories of Europe. UNICORE Forum (2009c) UNICORE – Community - Development [Internet]. Available from [Last accessed: 2009-0523]. Burke, S. et al (2008) gLite 3.1 User Guide. EGEE. OMII-EU (2009) BES-Enabled CREAM [Internet]. Available from [Last accessed: 2009-05-23]. EGEE (2008) EGEE’s User’s Guide: WMS Service. EGEE. EGEE (2005) EGEE Middleware Architecture. EGEE. XtremWeb-CH (2009) XtremWebCH Wiki [Internet]. Available from [Last accessed: 2009-05-23]. Abdennadher, N.; Boesch, R. (2005) A Large Scale Distributed Platform for High Performance Computing. Lecture Notes in Computer Science, 3795/2005. Springer-Verlag Berlin Heidelberg. Abdennadher, N.; Boesch, R. (2007) A Scheduling algorithm for High Performance PeerTo-Peer Platform. Lecture Notes in Computer Science, 4375/2007. Springer-Verlag Berlin Heidelberg. Cappello F. et al (2005) Computing on large-scale distributed systems: XtremWeb architecture, programming models, security, tests and convergence with grid. Future Generation Computer Systems 21. 2005, p. 417-437. XtremWeb-CH Architecture. (2009). [Online image]. Available from < http://www.xtremwebch.net/mediawiki/images/e/e4/>Architecture.gif, [Last accessed: 200905-23]. Fedak, G., et al. (2000) XtremWeb: A Generic Global Computing System. Universit´e Paris Sud. Anderson, D.; Fedak, G. (2006) The Computational and Storage Potential of Volunteer Computing. Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid. 2006, p. 73-80. IEEE Computer Society. Anderson, D. (2004) BOINC: A System for Public-Resource Computing and Storage. Proceedings of the Fifth IEEE/ACM International Workshop on Grid Computing. 2004, p. 4-10.

82 / 89

Anderson, D.; Korpela, E.; Walton, R. (2005) High-Performance Task Distribution for Volunteer Computing. First International Conference on e-Science and Grid Computing. 2005, p. 196-203. JXTA (2009) JXTA Community Projects [Internet]. Available from [Last accessed: 2009-05-23]. GTRC (2009) Grid Technology Research Center [Internet]. Available from [Last accessed: 2009-05-23]. Shudo, K., Tanaka, Y.; Sekiguchi, S. (2005) P3: P2P-based Middleware Enabling Transfer and Aggregation of Computational Resources. IEEE International Symposium on Cluster Computing and the Grid. 2005, p. 259-266. Condor Team (2009a) What is Condor? [Internet]. Available from [Last accessed: 2009-05-23]. University of Wisconsin-Madison. Condor Team (2009b) Condor, Version 7.2.0 Manual. University of Wisconsin-Madison. Thain, D.; Tannenbaum, T.; Livny, M. (2005) Distributed Computing in Practice: The Condor Experience. Concurrency and Computation: Practice & Experience, 17. p. 323356. Thain, D.; Tannenbaum, T.; Livny, M. (2003) Condor and the Grid In: Berman, F.; Hey, A.; Fox, G. Grid Computing – Making the Global Infrastructure a Reality. John Wiley & Sons, Ltd. Tannenbaum, T. et al. (2001) Condor and the Grid In: Sterling, T. Beowulf cluster computing with Linux. Cambridge, MA, USA, MIT Press. Haridi, S.; Vlassov, V.; Ghodsi, A.; Arad, C. (2009) Grid4All: Self-* Grid: Dynamic virtual organizations for schools, families, and all [Internet]. Available from: [Last accessed: 2009-05-23]. KTH. Condor Team (2009a) What is Condor? [Internet]. Available from [Last accessed: 2009-05-23]. University of Wisconsin-Madison. Condor Team (2009b) How did the Condor project start? [Internet]. Available from [Last accessed: 2009-05-27]. University of Wisconsin-Madison. Learning Support Services (2009) Harvard Guide - Skills for Learning [Internet]. Available from . Leeds Metropolitan University. Ganek, A., Corbi, T. (2003) The dawning of the autonomic computing era. IBM Systems Journal, 42, 1, 2003.

83 / 89

Parashar, M.; Salim, H. (2007) Autonomic Computing: Concepts, Infrastructure, and Applications. CRC press. Ganek, A. (2007) Overview of Autonomic Computing: Origins, Evolution, Direction. In: Parashar, M.; Salim, H. (2007) Autonomic Computing: Concepts, Infrastructure, and Applications. CRC press. Al-Shishtawy, A. et al. (2008) Distributed Control Loop Patterns for Managing Distributed Applications. Proceedings from 2nd IEEE International Conference on Self-Adaptive and Self-Organizing Systems Workshop. 2008, p. 260-265. Foster, I.; Kesselman, C.; Tuecke, S. (2001) The Anatomy of the Grid: Enabling Scalable Virtual Organizations. International Journal of Supercomputer Applications, 15, 2001. Coulouris, G.; Dollimore, J.; Kindberg, T. (2005) Distributed Systems: Concepts and Design. Addison-Wesley Publishers Limited. Open Grid Forum (2006) The Open Grids Services Architecture, Version 1.5. The Open Grid Forum. Open Grid Forum (2007) OGSA Basic Execution Service, Version 1.0. The Open Grid Forum. Open Grid Forum (2005) Job Submission Description Language (JSDL) Specification, Version 1.0. The Open Grid Forum. Schoder, D.; Fischbach, K.; Schmitt, C. (2005) Core Concepts in Peer-to-Peer networking. In: Subramanian, R.; Goodman, B. (2007) Peer to Peer Computing: The Evolution of a Disruptive Technology. Idea Group Inc. El-Ansary, S.; Haridi, S. (2005) An Overview of Structured Overlay Networks In: Wu, J. (2005) Theoretical and Algorithmic Aspects of Sensor, Ad Hoc Wireless and Peer-toPeer Networks. CRC Press. SICS, KTH, INRIA (2009) DCMS Programming Guide, February 2009. SICS, KTH, INRIA Alima, L. et al. (2003) DKS(N, k, f): A Family of Low Communication, Scalable and FaultTolerant Infrastructures for P2P Applications. Proceedings of the 3rd International Symposium on Cluster Computing and the Grid. 2003. Grid5000 (2009a) Grid5000:Hardware [Internet]. Available from [Last accessed: 2009-05-27]. Grid5000 (2009b) Grid5000:Home [Internet]. Available from [Last accessed: 2009-0527].

84 / 89

Grid4All Consortium (2009) About Grid4All [Internet]. Available from [Last accessed: 2009-06-05]. UNICORE Forum (2009b) Distributed computing and data resources [Internet]. Available from [Last accessed: 2009-06-05]. XtremWeb-CH (2009b) XtremWebCH. Nabil Abdennadher. [Internet]. Available from [Last accessed: 2009-06-05]. University of California (2009) BOINC. [Internet]. Available from [Last accessed: 2009-06-05].

85 / 89

Appendix A: Iterative system development process SEE yacs_thesis_design.doc

86 / 89

Appendix B: Programmer’s manual SEE yacs_programmers_manual.doc

87 / 89

Appendix C: Administrator’s manual SEE yacs_admin_manual.doc

88 / 89

Appendix D: Javadoc

89 / 89