Adaptive Checkpointing in Dynamic Grids for Uncertain Job Durations

3 downloads 7363 Views 243KB Size Report
better than adaptive approaches that make use of exact information on job execution times. Keywords. Grid computing, fault-tolerance, adaptive checkpointing.
Adaptive Checkpointing in Dynamic Grids for Uncertain Job Durations Maria Chtepen, Bart Dhoedt, Filip De Turck, Piet Demeester INTEC-IBBT, Ghent University, Sint-Pietersnieuwstraat 41, Ghent, Belgium {maria.chtepen, bart.dhoedt, fillip.deturck, piet.demeester}@intec.ugent.be Filip H.A. Claeys MOSTforWATER NV, Koning Leopold III-laan 2, Kortrijk, Belgium [email protected] Peter A. Vanrolleghem modelEAU, Université Laval, Québec, Qc, G1K 7P4, Canada [email protected] Abstract. Adaptive checkpointing is a relatively new approach that is particularly suitable for providing fault-tolerance in dynamic and unstable grid environments. The approach allows for periodic modification of checkpointing intervals at run-time, when additional information becomes available. In this paper an adaptive algorithm, named MeanFailureCP+, is introduced that deals with checkpointing of grid applications with execution times that are unknown a priori. The algorithm modifies its parameters, based on dynamically collected feedback on its performance. Simulation results show that the new algorithm performs even better than adaptive approaches that make use of exact information on job execution times.

Keywords. Grid computing, fault-tolerance, adaptive checkpointing.

1. Introduction Fault-tolerance is an important issue in the domain of grid computing, since grids are composed of highly distributed, decentrally managed and thus potentially unreliable resources. Application (job) checkpointing is a technique that is commonly applied to provide fault-tolerance in grids. The efficiency of this technique strongly depends on a good choice of a checkpointing interval: an overly short checkpointing interval leads to a large number of redundant checkpoints, which delay job processing by consuming computational and network resources; on the other hand, when a checkpointing interval is too long, a substantial amount of work has to be redone in case of a

resource failure. The optimal length of a checkpointing interval is however extremely hard to determine before run-time, when no exact knowledge on job and grid parameters is available (job execution time, resource failure pattern, etc.). Furthermore, characteristic to grid parameters is that they can dynamically change over time, which implies that even if an appropriate checkpointing interval is initially chosen, the performance of a static checkpointing algorithm, that relies on this choice, will degrade over time. To deal with this issue, research in the checkpointing area has recently turned its attention to adaptive checkpointing solutions [27]. The latter allow for dynamic modifications of an initial checkpointing interval, as more information becomes available on grid workload and system parameters. In this paper a new adaptive checkpointing approach, named MeanFailureCP+, is introduced. MeanFailureCP+ is designed to operate in absence of exact information on job length. The algorithm avoids unnecessary checkpointing by modifying its internal parameters in function of dynamically collected feedback on the system performance. We compare the performance of the new algorithm against the performance of in our previous work introduced adaptive solution (MeanFailureCP) [1]. MeanFailureCP is designed to modify (increase or decrease) a job checkpointing interval as a function of mean failure frequency of resources where the job is being executed, and the total job execution time. The main disadvantage of this algorithm is that it relies on the assumption that the exact job length can be

provided in advance, while for most existing real-world applications this cannot be taken for granted. Furthermore, our recent research has shown that MeanFailureCP, while significantly outperforming periodic checkpointing, still introduces a considerable amount of redundant state savings. On the other hand, MeanFailureCP+ not only weakens the requirement for the exact job duration to be known in advance, but also further reduces the checkpointing overhead. This paper is organized as follows: Section 2 gives an overview of related work; Section 3 summarizes the operation of MeanFailureCP; in Section 4 the MeanFailureCP performance is evaluated; Section 5 discusses MeanFailureCP+; MeanFailureCP+ is evaluated in Section 6; and, finally, Section 7 concludes the paper.

2. Related Work In [7] an on-line checkpointing algorithm is proposed that can be seen as a predecessor of modern adaptive solutions. The algorithm uses on-line knowledge of the current cost of a checkpoint when it decides whether or not checkpointing has to be performed. The main idea behind the algorithm is to look for points in an application in which its state size is small and in which placing a checkpoint is the most beneficiary. In these points checkpointing is performed frequently, while in points with high cost, long checkpointing intervals are used. An obvious disadvantage of this approach is that it does not take into the account the resource failure pattern. In [4] and [5] the so-called cooperative checkpointing concept is introduced, which addresses system performance and robustness issues by allowing the application programmer, the compiler and the run-time system to jointly decide on the necessity of each checkpoint. The algorithm proposed in our paper is also based on this concept and thus can be seen as an cooperative (adaptive) heuristic. In [6] adaptive checkpointing is applied for fault detection and recovery. Overhead is reduced by differentiating frequencies of occurrence of store checkpoints (SCPs) and compare checkpoints (CCPs). The disadvantage of this scheme is that it requires accurate information on remaining job execution time and the expected remaining number of failures before job termination. [2], in turn, considers only dynamic checkpointing interval reduction in case it leads to computational gain, which is quantified by the

sum of the differences between the means for fault-affected and fault-unaffected job response times. In [3] yet another adaptive fault management scheme (FT-Pro) is discussed. FTPro combines adaptive checkpointing with proactive process migration. The approach optimizes application execution time by considering the failure impact and the prevention costs. FT-Pro supports three prevention actions: skip checkpoint, take checkpoint and migrate. An adaptation manager selects an appropriate action in response to failure prediction. The effectiveness of FT-Pro strongly depends on the quality of this prediction.

3. MeanFailureCP ti

C

I

I RErj < MFr I rj < α ∗ Erj : I rj = I j r

I = 2I

C

R

ti

I

C

j RErj > MFr RE r ≥ MF r j j Irj < α ∗ Erj I r < α ∗ Er :

I rj = 2 I

I rj = I

Figure 1. Operation of MeanFailureCP on a resource running a single job

MeanFailureCP is an adaptive algorithm that dynamically modifies the initially specified checkpointing interval to optimize the number of checkpoints taken and thus to reduce the computational overhead. The size of the adopted checkpointing interval (Irj) is determined by the currently remaining job execution time (RErj) and the average failure interval (MFr) of the resource r where the job j is assigned. Operation of the algorithm is visualized in Fig. 1. MeanFailureCP is first activated after a short time period ti (defined by the end-user) after the beginning of job execution (Step 1). Early activation of the algorithm opens the possibility to modify the checkpointing interval at an early stage of job processing. In each iteration the algorithm checkpoints the job state and determines the timestamp for the next checkpointing event as follows: If RErj < MFr and Irj < α × Erj, where α is a user-specified parameter and Erj is the total execution time of the job j on the resource r: the checkpointing interval is increased Irjnew = Irjold + I, where I is the length of the initial checkpointing interval provided by the end-user (Step 2). The first condition leads to reduction of checkpointing overhead for sufficiently stable

resources or almost finished jobs. The second condition prevents excessive growth of Irj, compared to the job length. If RErj > MFr or Irj ≥ α × Erj: the checkpointing interval is decreased Irjnew = Irjold – I (Step 3). When reducing the checkpointing interval, the following constraint should be taken into account: C < β × Erj ≤ Irjnew, where β < 1 is a user-defined value that secures that the time interval between consecutive checkpoints never decreases below the time overhead added to a job execution time by each checkpoint (C). Experiments have shown that to prevent undesirably steep decreases of the checkpointing interval, the value assigned to β should be at least 0.01, or 1% of a job length. Finally, modifying values of Irj by I ensures fast achievement of (sub)optimal checkpointing frequency in most distributed environments.

4. Performance Evaluation of MeanFailureCP Probability Density Function 0.4

Probability (%)

Dev = 10 Dev = 100

0.3

Dev = 1,000 0.2

0.1

0 50.0

55.0

60.0

65.0

70.0

Job Length (min)

Figure 2. Probability density function of job length distribution for example values of deviation (Dev) parameter

MeanFailureCP assumes that the exact job length is known beforehand. However, there are two problems with this assumption. First of all, it seems to be inapplicable for a large group of the real-world applications, for which only a very rough estimation of the total job length can be provided in advance. Secondly, recent simulation experiments show that knowledge of the exact job length does not necessarily lead to the better algorithm’s performance. The latter, is a consequence of the fact that MeanFailureCP does not generate the optimal number of checkpoints, which leads to some redundancy. By carefully calibrating the algorithms parameters, this redundancy can be eliminated to the large extent. As opposed to [1], in this section we evaluate the influence of the quality of job length

estimates on the performance of MeanFailureCP. Using the discrete event grid simulation environment, called DSiDE [1], we model a heavily loaded dynamic grid consisting of 128 computational resources, equally spread over 4 globally distributed sites. Jobs submitted to the considered grid have a normally distributed length with an average of 1 hour and a standard deviation varying as shown in Fig. 2. The checkpointing overhead C varies from 2 to 5 s and the data size of each checkpoint, which is transferred over the network to a single checkpointing server, is 10 MB. α and β are respectively initialized with 2 and 0.01. Two simulation scenarios are considered: in the first scenario grid resources are assumed to be highly unstable (c.f. desktop grid), with the dynamics of the failure occurrence modeled by means of a Weibull distribution with the shape parameter k = 1800 (30 min) and the scale parameter λ = 0.7; in the second scenario failures happen less frequently (k = 10800, λ = 0.7), which means that jobs have high probability to execute without being disturbed by a failure. For both scenario’s we observe the performance of MeanFailureCP when Erj is either calculated using the exact job length, or the average length over all submitted jobs, or a certain deviation from this average. Fig. 3 and Fig. 4 show for the unstable grid the number of successfully executed jobs and the average number of checkpoints saved per job, for varying probability density functions of job length distribution. Fig. 5 and Fig. 6 show the same parameters for the second simulation scenario. The deviation from the average job length is depicted in the figures with “+” and “–” signs, where, for example “avg-30%”, means that the length of the submitted jobs was assumed to be the average over all jobs decreased with 30%. The simulation results show that MeanFailureCP does not necessarily perform better for the exact job length. For instance, in case of highly unstable resources (see Fig. 3 and Fig. 4), there is a relatively large set of approximation values for which the algorithm performs just as good or even better. In the example at hand, the system performance improves with 10% when the length of the submitted jobs is assumed to be twice as high as the average value. This can be explained by the fact that the assumed job length in combination with the mean failure frequency of resources further optimizes the number of checkpoints

performed, compared to the exact algorithm. However, as can be seen in the figures, when the number of checkpoints taken keeps reducing, the performance of MeanFailureCP considerably degrades. Simulation experiments have shown that there can be several (sub)optima, on the positive and on the negative side of the average, however one of them, if any, always lies on the positive side. In general, a decrease in the assumed job length below the average value leads to a rapid decrease in number of checkpoints, since in that case the equation RErj < MFr almost always evaluates to true, which results in the growth of the checkpointing interval. Actually, the reason for the above described behavior of the algorithm lies by the imposed limitations on the growth/decrease of checkpointing intervals. For instance, the equation β × Erj ≤ Irjnew ensures that even in case of much overestimated job length and frequent failure, which would normally lead to exaggerated checkpointing, the interval is limited to a percentage of the predicted job length.

flexible than periodic checkpointing [1], in unstable grids, the algorithm is still subject to further performance improvement. On the other hand, when the grid system is stable (see Fig. 5 and Fig. 6) MeanFailureCP performs more or less similar for all considered values for a job length. This is the result of overall limited checkpointing, resulting from long failure-free intervals, and the reduced effect of failure on the system performance. Clearly, the optimal job length prediction depends on several parameters, such as the length of the failure-free interval, limits on the checkpointing interval, checkpointing overhead etc. It is not only hard to collect a reliable estimation for these parameters beforehand, but also the actual values of the considered parameters will presumably change over time, which undermines the usability of the static estimates. Therefore, in the following section we introduce MeanFailureCP+ that perform dynamic search of the optimal job length estimation, using run-time information on the system performance. Jobs Successfully Executed

Jobs Successfully Executed

2000

2700 MFCP(avg-90%)

MFCP(avg-90%)

2500

MFCP(avg-30%)

1600

# Jobs

MFCP(exact) MFCP(average) MFCP(avg+30%)

1200

# Jobs

MFCP(avg-30%)

MFCP(exact)

2300

MFCP(average) MFCP(avg+30%)

2100

MFCP(avg+100%)

MFCP(avg+100%)

MFCP(avg+1000%)

1900

MFCP(avg+1000%)

MFCP(avg+3000%)

MFCP(avg+3000%)

1700

800 10

100

10

1,000

Figure 3. Average number of jobs executed by MeanFailureCP, with varying job length estimation, in an unstable grid

1,000

Figure 5. Average number of jobs executed by MeanFailureCP, with varying job length estimation, in a stable grid Average Number of Checkpoints

Average Number of Checkpoints

35

80

30

MFCP(avg-90%)

60

MFCP(avg-30%) MFCP(exact) MFCP(average)

40

MFCP(avg+30%) MFCP(avg+100%)

20

MFCP(avg+1000%) MFCP(avg+3000%)

0

# Checkpoints

# Checkpoints

100

Standard Deviation

Standard Deviation

MFCP(avg-90%) MFCP(avg-30%)

25

MFCP(exact)

20

MFCP(average) MFCP(avg+30%)

15

MFCP(avg+100%) MFCP(avg+1000%)

10

MFCP(avg+3000%)

5 0

10

100

1,000

Standard Deviation

10

100

1,000

Standard Deviation

Figure 4. Average number of checkpoints saved per job by MeanFailureCP, with varying job length estimation, in an unstable grid

Figure 6. Average number of checkpoints saved per job by MeanFailureCP, with varying job length estimation, in a stable grid

The above result suggests that despite the fact that MeanFailureCP is more efficient and

5. MeanFailureCP+

A typical grid application generates batches of similar jobs, which are more or less simultaneously submitted to the grid for processing. Therefore, opposite to MeanFailureCP, which regards individual jobs and requires their exact length to be known in advance, MeanFailureCP+ operates on job batches and needs only rough initial job length estimation (ILb) to be provided by an end-user. Observation of real-world application leads us to the conclusion that spreading of job lengths within a single batch can be approximated by a normal distribution. The average of this distribution can be derived from historical information on previous application runs and utilized for initialization of ILb. To optimize the system throughput, MeanFailureCP+ monitors dynamically the number of jobs processed during a monitoring interval of predefined length MIb and based on this feedback modifies subsequent job length estimates (Lb) in such a way that the checkpointing overhead is minimized without significantly penalizing the system faulttolerance. The length of the interval MIb should be chosen in function of ILb. Similar to MeanFailureCP, MeanFailureCP+ is first activated after a short time interval ti after the beginning of the job execution and is afterwards called each time Irj expires. The algorithm proceeds as follows: If Tc – Tm < MIb, where Tc is the current time and Tm stands for the begin time of the last monitoring interval: MeanFailureCP is run with Erj = Lb = ILb. If Tc – Tm ≥ MIb and NMI < 2, where NMI is the number of monitoring intervals already elapsed: MeanFailureCP+ slightly increases the job length estimation with a small randomly chosen value, called deviation value (D), which is in our case set to 0.1: Lb = Lb + Lb × D. Afterwards, MeanFailureCP is executed with a new value for Erj = Lb. The gradual increase of Lb allows the algorithm to explore other estimations of job length and to escape from an eventual local maximum. In the following phase the algorithm evaluates the effect of this slight increase in job length, however, at this point in its execution there is still insufficient performance data collected to perform the evaluation (NMI < 2). If Tc – Tm ≥ MIb and NMI > 2: performance of the algorithm over the past two monitoring intervals is evaluated. Each time a job successfully terminates its execution, the job count of the current monitoring interval is

incremented. In this phase, the algorithm compares the number of jobs executed during the last monitoring interval (NJL) against the number of jobs executed during the last but one monitoring interval (NJLBO). If NJL = NJLBO, the deviation value is again slightly incremented D = D + 0.1, together with the estimated job length Lb = Lb + ILb × D. If NJL > NJLBO, it means that recent changes have positive effect on the algorithm’s performance. Therefore, we again increase the deviation percentage D = D + 0.1. Afterwards, the new value of D is compared against the performance increase PI = (NJL − NJLBO) ÷ (NJLBO × 0.01). If PI > D, Lb is modified as follows Lb = Lb + ILb × PI, otherwise Lb = Lb + ILb × D. This operation ensures that the increase in the estimated job length is at least proportional to the achieved performance increase. Finally, if NJL < NJLBO, it means that the current value of Lb is too high and has to be reduced. The size of the reduction is chosen to be proportional to the decrease in the performance: Lb = Lb − ILb × ((NJLBO − NJL) ÷ (NJLBO × 0.01)). Once optimal values of Lb and MIb are found for a particular application, they can be saved to be used for the following application runs.

6. Performance Evaluation MeanFailureCP+

of

We evaluate the performance of MeanFailureCP+ in the simulated grid environment described in Section 4. The initial job length estimation ILb is set to 1 hour, or the average length of all submitted jobs, and the monitoring interval MIb is consequently initialized with 30 min, 1 hour and 2 hours. Fig. 7 and Fig. 8 show the simulation results for two varying frequencies of grid failure scenarios. For comparison, next to the number of successfully processed jobs by MeanFailureCP+, the figure depicts the number of jobs processed by MeanFailureCP with the exact job lengths. Also the best result (MFCP, avg+100%), achieved in Section 4, is presented in the figures. In the case of highly unstable grids, MeanFailureCP+, with the monitoring interval equal to the average job length, leads to the best job throughput. However, MeanFailureCP+ where MIb initialized with different values, also performs better than MeanFailureCP. As can be expected, within a stable grid the benefit of MeanFailureCP+ is less significant.

Jobs Successfully Executed

2000 1900 1800

# Jobs

MFCP(exact) MFCP+(0,5h)

1700

MFCP+(1h)

1600

MFCP+(2h) MFCP(avg+100%)

1500 1400 1300 10

100

Standard Deviation

the exact job length is an overly strict requirement, which does not necessarily lead to optimal algorithm performance. Simulation results show that MeanFailureCP+, without a priori knowledge of the exact job length, increases grid throughput with up to 10%, compared to the throughput of MeanFailureCP, initialized with exact values.

8. References

1,000

Figure 7. Number of jobs successfully executed by MeanFailureCP+, with varying monitoring interval, in an unstable grid Jobs Successfully Executed 2700

[1] Chtepen M, Claeys F.H.A, Dhoedt B, De Turck F, Demeester P, Vanrolleghem P.A. Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolarant Grids. IEEE Transactions on Parallel and Distributed Systems 2009; 20(2): 180-190.

2500

# Jobs

2300

MFCP(exact) MFCP+(0,5h)

2100

MFCP+(1h) MFCP+(2h)

1900

MFCP(avg+100%)

1700 1500 10

100

1,000

Standard Deviation

Figure 8. Number of jobs successfully executed by MeanFailureCP+, with varying monitoring interval, in a stable grid

7. Conclusion Adaptive job checkpointing is a highly suitable technique to provide fault-tolerance in heterogeneous and decentrally managed grids. Its main advantage is that it allows for dynamic modification of checkpointing intervals in function of application and system parameters collected at run-time. This paper introduces an adaptive checkpointing algorithm named MeanFailureCP+ that operates in absence of information on the total job duration. The algorithm initially requires only a rough estimation of the job length, which is modified at run-time, based on dynamically collected information on the algorithm’s performance. We compare the performance of this new feedbackbased approach against the performance of an adaptive checkpointing algorithm, named MeanFailureCP. MeanFailureCP determines an appropriate checkpointing frequency based on job execution time and resource failure frequency. MeanFailureCP imposes, however, that the exact job length is known before runtime. In this paper we show that knowledge of

[2] Katsaros P, Angelis L, Lazos C. Performance and Effectiveness Trade-off for Checkpointing in Fault-Tolerant Distributed Systems. Concurrency and Computation: Practice & Experience 2007; 19(1): 37-63. [3] Lan Z, Li Y. Adaptive Fault Management of Parallel Applications for High-Performance Computing. IEEE Transactions on Computers 2008; 57(12): 1647-1660. [4] Oliner A, Rudolph L, Sahoo R. Cooperative Checkpointing: a Robust Approach to Large-Scale Systems Reliability. In: Proceedings of the 20th Annual International Conference on Supercomputing; 2006 June 28 - Jul 1; Cairns, Queensland, Australia. [5] Oliner A, Sahoo R. Evaluating Cooperative Checkpointing for Supercomputing Systems. In: Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS’06); 2006 Apr 25-29; Rhodes Island, Greece. [6] Xiang Y, Li Z, Chen H. Optimizing Adaptive Checkpointing Schemes for Grid Workflow Systems. In: Proceedings of the 5th International Conference on Grid and Cooperative Computing (GCC’06); 2006 Oct 21-23; Ghangsha, Hunan, China.

[7] Ziv A, Bruck J. An On-Line Algorithm for Checkpoint Placement. IEEE Transactions on Computers 1997; 46(9): 976-985