Probabilistic Checkpointing in Time Warp Parallel ... - Semantic Scholar

9 downloads 0 Views 275KB Size Report
While increasing the checkpointing frequency increases the state saving cost, an infrequent scheme also esca- lates the coast forward e ort when a large ...
Probabilistic Checkpointing in Time Warp Parallel Simulation Seng Chuan Tayand Yong Meng Teoy

National University of Singapore 3 Science Drive 2 Singapore 117543

email: [email protected]

Abstract

In the Time Warp (TW) protocol, the system state must be checkpointed to facilitate the rollback operation. While increasing the checkpointing frequency increases the state saving cost, an infrequent scheme also escalates the coast forward e ort when a large number of executed events are redone. This paper proposes a probabilistic approach to checkpointing. We derive the rollback probability, and compute the expected coast forward e ort if a state is not saved. To reduce implementation overheads, the rollback probability and coast forward cost are predetermined and make available at runtime as a lookup table. Based on the derived expectation, a state vector is saved only if the expected coast forward e ort is larger than the state saving cost and vice versa. Our experiments show that the cost model reduces the simulation elapsed time by close to 30% as compared to saving the system state after each event execution, and saving the system state at a prede ned interval.

1 Introduction

Time Warp (TW) is an optimistic mechanism [7] used to manage the event execution in parallel discreteevent simulation (PDES). As compared to the conservative mechanism [1, 10], the events processed in TW logical processes (LPs) may violate the causality constraint [4]. An out-of-sequence event message (M), also called straggler, is identi ed if the local virtual time (LVT) of the destined LP is greater than the timestamp of the arriving straggler (TS(M)). When such a causality error occurs, a rollback procedure annuls those events simulated ahead of time. The destined LP performs the recovery by sending noti cations to its successors to cancel the messages it has erroneously sent, restoring itself to a latest state before TS(M), and re-executing its simulation from thereon. Hence, each LP has to maintain a timestamped state queue to allow for recovering a correct past state. Checkpointing the simulated system can be a costly overhead in TW mechanism [11]. The  Centre for Information Technology and Applications, Faculty of Science y Department of Computer Science

conventional approach is to save the state whenever an event is executed (or the checkpointing interval w = 1). As in the conventional scheme every event can be a potential recovery point, it allows rollback to be carried out eciently, i.e., M can be immediately executed after the closest state of timestamp less than TS(M) is restored. Nonetheless, such a checkpointing scheme can become expensive in terms of wall-clock time especially when the state vector size is huge. The proposed probabilistic state saving scheme is a specialization of a recent checkingpointing cost model [16]. The saving of state vector is based on the probability of restoring the state vector due to rollback, position of the last checkpoint, and the granularity of intermediate events. We weigh the loss and gain in wall-clock time before a state is saved (or not saved). The rollback probability is derived using mathematical convolution, and applied to a back-tracking algorithm for computing the expected coast forward e ort. In the proposed cost model, a state vector is saved only if the expected coast forward e ort is larger than the state saving cost and vice-versa. The advantages of the cost model are: 1. It does not need to collect statistical data at runtime so no extrapolation is performed. Instead, the cost model is based on a strong mathematical foundation. 2. It ensures that the decision to save a state is more cost e ective, thus the overall simulation time can be reduced. The rest of this paper is organized as follows. Section 2 gives an overview of related work on state saving. Section 3 presents the probabilistic cost model for checkpointing decision. We derive the rollback probability based on statistical distribution and compute the coast forward e ort using a back-tracking algorithm. Subsequently, the expected coast forward e ort is compared with the state saving time before a checkpointing decision is made. Such a comparison ensures that the system is checkpointed only if it is cost e ective. Section 4 investigates the e ectiveness of the proposed cost model in reducing the overall simulation elapsed time against two checkpointing schemes. We also evaluate the average number of coasted forward events incurred in a roll-

back, the hit ratio where coasting forward is not needed, and the ratio of the number of states saved with respect to the number of events executed. The aggregate e ect of these factors to the simulation elapsed time is also analyzed. Finally, section 5 contains our concluding remarks and some discussions on future work.

2 Related Work Over the years, many schemes such as incremental, infrequent, adaptive or hybrid have been proposed to re-

duce the state saving cost. Instead of saving the entire vector, the incremental approach [3, 18, 22] saves only those changes to the state. When memory consumption is concerned, the incremental approach is useful if the size of state vector is large and only a small portion is modi ed after an event has been executed. However, the incremental approach requires additional processor time to reconstruct the desired state from the incremental changes thereby incurring a performance penalty. The infrequent approach [8, 12] reduces the frequency of state saving, i.e., w > 1. As a result, the state saving cost is also decreased proportionally. However, the infrequent state saving approach also has drawbacks. Suppose a state S of timestamp TS(S ) is restored after a straggler is detected. All the events in the time interval from TS(S ) to TS(M) will need to be redone before M can be executed. Such an overhead is actually a repeated e ort and proportional to the size of checkpointing interval. Adaptive schemes use the dynamic of the simulator at runtime, and allows LPs to adjust their checkpointing interval on the y with respect to the simulation advancement. Variation of such schemes depends on the parameters used, such as memory usage [2], time spent in saving state and event and restoration time [17], and rollback behavior [9]. Such an adaptiveness depends on the characteristics of statistical data collected, thus the decision to save a state is also based on the extrapolation of the runtime history. The prediction is accurate provided the system is stable throughout the whole simulation run. Otherwise, the extrapolation may not be appropriate and can produce adverse e ect. Recently, hybrid approaches such as combining periodic (or infrequent) approach and probabilistic approach [15], combining event history and incremental approach [14], embedding the incremental state saving mode on a sparse state saving basis [13], multiplexing the incremental approach and infrequent approach at xed interval [6], and switching automatically from periodic approach and incremental approach based on the cost model constructed by runtime statistics [19] have been proposed.

3 The Probabilistic Approach The following assumptions are made in the probabilistic model:

1. simulator contains p homogeneous LPs and p homogeneous processing elements (PEs) 2. the placement of LPs on PEs is one-to-one 3. each arrival event has a corresponding departure event in the same LP and the execution of each departure event will in turn schedule an arrival event in one of its succeeding LPs 4. state vector is saved after an event is executed 5. memory space is sucient to complete the simulation Table 1 contains a list of parameters used to derive the rollback probability. We assume that the inter-LVT advancement time has an exponential distribution of mean , where is the LVT advancement rate de ned as follows:  1

1

=

2 +

if  <  otherwise

where  is the arrival rate, and  the service rate. The LVT at the n-th state, denoted by LV Tn, is modeled based on the following observations:  The rst event processed by an LP is an arrival event. Otherwise, the causality constraint is violated.  An LP cannot advance its LVT until the rst arrival event is processed.  An LP at the n-th state has advanced its LVT (n ? 1) times. Let LV Tn denote the clock time in an LP when n events are processed. Assume that the interarrival time and service time are identically and independently distributed (IID). We can parameterize LV Tn as the sum of two random variables R and R , where R  exp(), and R  gamma ( , n ? 1). Let random variable Z represent the LVT of an LP at the n-th state. The probability density function of Z is given as follows: 1

2

1

2

(

g(z ) =



 T n?1  e?z ? e? z

Pn?

0

 z )k k!

2 (

k=0



if z  0 otherwise

where  = ? , and T =  (refer to [20]). We model the communication delay by two time components: bu er access time and transmission time. The bu er access time is accounted to both sending and receiving LPs. In order to prevent double accounting, the transmission time is accounted by the sender only. The wall clock duration for each transmission is Tbuffer +Ttransit, and for each reception is Tbuffer ( gure 1). Therefore, the duration for a message to travel from its sender to the receiver is 2  Tbuffer + Ttransit, and the number of events processed during this commuTtransit nication delay is c = d Tbuffer e. In the followTevent ing analysis, exponential distribution is used to model 2

+

Homogeneity assumption is made in this paper to obtain a mathematically tractable solution. Work is in progress to model heterogeneous LPs and PEs using numerical methods to approximate the mathematical solutions. 1

PARAMETER system

  c a dT GV

Tevent

measured Tstate

Tbuffer Ttransit

DESCRIPTION

arrival rate (per simulated time) of each LP service rate (per simulated time) of each LP LVT advancement rate (per simulated time) communication delay (in terms of number of events processed) lower bound of GVT window (in terms of event index) number of events processed (less number of events rolled back) before a GVT computation is activated event (arrival or departure) execution time state saving time bu er (receive or transmit) access time message transmission time

PROBABILISTIC ROLLBACK

rb(I0 ; J0 )

probability (prob.) that an event message sent at index I will cause a rollback when it is processed at index J RB (J ) prob. that a straggler is processed at index J haltIJkk (dk ) prob. that a rollback caused by a straggler sent at index Ik of the source LP and processed at index Jk of the destined LP will stop after dk events are undone 0

0

0

derived

+

TCCF (J0 ) TCCF (J0 )

0

CHECKPOINTING OVERHEAD

coast forward e ort required to re-execute the events when the state at index J is not saved expectation of TCCF (J ) 0

0

Table 1: Parameters and Measures of Probabilistic Checkpointing Cost Model charged to receiver charged to sender Tbuffer

Ttransit

Tbuffer

X by the statistical distributions exp() and gamma ( , p ? 1), and Y by exp() and gamma ( , q ? 1). The probability density functions, denoted by g (x) and g (y) respectively are given as follows: 1

2

g1 (x) =

transmission prepare message Sender

(

receive message Receiver

Figure 1: Communication Time Accounting the interarrival time and service time due to its memoryless property. GVT re-computation is performed whenever a prede ned number of events are processed. The number of events executed in between two GVT comdT. putations is denoted by GV

3.1 Causality Error Characterization

Suppose an event message M is generated at the pth state of the sending LP, and processed at the q-th state of the receiving LP. Let the timestamp of M be LV Tp;send, and the LVT of the target LP be LV Tq;recv . We want to compute Pr(LV Tp;send < LV Tq;recv ), which is the probability that M is out of sequence when it is processed by the receiving LP. Let X and Y be random variables for LV Tp;send and LV Tq;recv respectively. Similarly, we can model

( g2 (y) =



 T p?1  e?x ? e? x

0



 T q?1  e?y ? e? y

Pp?

 x)k k!

Pq?

 y )l l!

2 (

k=0

2 (

l=0

0



if x  0 otherwise



if y  0 otherwise

Let gX;Y be the joint density function of X and Y . We have P r(LV Tp;send < LV Tq;recv )

Z 1Z

=

0

y

gX;Y (x; y) dx dy.

0

Assume that LV Tp;send and LV Tq;recv are also IID. We can replace gX;Y (x; y) by g (x)  g (y). From [20], Pr(LV Tp;send < LV Tq;recv ) =  T p q?       [  ? Gp ( ) ? Gq + Gq ( ) +Gp Gq   k i  l ? Pq? P p?  Pk ( )  l i ] 1

2

2

1

2

(

+

2

(

1

2

2

+

(

l!

)

P

2

k=0

+

)

(

)

2

)

1 2

i=0



2

(

+

)

2 (2

l=0

)

+

( + )!

i!

where Gn(x) = in? xi = ?x?nx? . For the ease of discussion we let LP send an event message M to LP (see gure 2). Let the index of LP be I when M is generated, and the index of LP be J when M is executed. We observe that M will become a straggler in LP if the timestamp of M (denoted by 2

=0

1

1

1

0

1

0

0

1

1

0

GVT Window State Index:

0

1 2

....

a

...

State Index:

Communication Delay

I0 . . .

a + GVT − 1

....

Np

0

1 2

....

a . . . I0 . . .

a + GVT − 1

....

Np

LP 0 Time Scale

LP 0 Time Scale

M M

LP 1

LP 1 Time Scale State Index:

0

1

2... a

......

J0

timely arrival

. . . a + GVT − 1 . . .

Np

State Index:

LV TI ;LP ) is less than the LVT of the receiving LP at state J (denoted by LV TJ ;LP ). As the progress of both LPs is bounded by a GVT window, we have dT. Due to the asynchronous event projI ? J j < GV cessing of each LP, I can be of any value within the dT ? 1)], where a is the lower GVT window [a; (a + GV bound (in terms of event number) of the window. Consider the homogeneity assumption imposed on all LPs and all PPs. If LP processes M at state J , more likely (in terms of probability) the message is generated when LP is at state J ? c, where c is the number of events processed by LP during the communication delay. Within the GVT window the more likely state corresponds to the (max(J ? c ? a; 0))-th event away from the lower bound. We therefore assign a normalized weight (fN ) peaked at max(J ? c ? a; 0) to the rollback probabilities [20]. In the following derivation we let I = i + a, and J = j + a. It follows that dT ? 1. 0  i ; j  GV Let rb(I ; J ) = Pr(LV TI ;LP < LV TJ ;LP ), or Pr(LV Ta i ;LP < LV TJ ;LP ) equivalently. The rollback probability (due to straggler) at state J is 0

0

0

1

0

0

1

0

1

0

2

0

0

0

0

0

0

0

+

0

0

0

0

0

0

0

1

1

0

RB (J0 )+

GV d T ?1 X

=

i0 =0

fNmax(J0 ?c?a;0) (i0 )  rb(I0 ; J0 ):

The derivation of halt probability (halt IJ (d )), which is the probability that the rollback will halt after d events are undone (see gure 3), is based on the following observations:  LV TI ;LP > LV TJ ?d ;LP . Otherwise the rollback will not halt after d events are undone.  LV TI ;LP < LV TJ ?d ;LP . Otherwise the number of events undone is less than d . As a rollback cannot coast below the lower bound of the GVT window, we impose the total probability con0 0

0

0

0

0

0

0

1

0

0

0

0

0 +1

1

0

The sum of the normalized weights is equal to 1. The weights are the discrete points on a continuous normal curve. 2

2 . . . a .. J0 −d 0 J0−d 0+1 . . .

J 0 . . a + GVT − 1 . . . N p

straint on halt I (d ). In general, we ensure the condiPJk ?a JIk tion dk halt Jk (dk ) = 1 for k  0. This is done by normalizing the halt probability with respect to its sum as follows: (1 ? rb(Ik ; Jk ? dk ))  rb(Ik ; Jk ? dk + 1) halt Ik (dk ) = 0 0

0

=1

Jk

R SUMJIkk

where R SUMJIkk = PdJkk ?a (1 ? rb(Ik ; Jk ? dk )) rb(Ik ; Jk ? dk + 1). =1

0

0

0

1

Figure 3: Rollback Events due to Straggler

Figure 2: Causality Error 0

0

untimely arrival GVT Window

0

d0 events undone ...

Time Scale

3.2 The Checkpointing Cost Model

The proposed cost model is parameterized by two categories of wall clock time, namely the coast forward e ort (TCCF ) and the state saving cost (Tstate). Consider a scenario where the J -th event, 0  J  dT ? 1, is executed in an LP, and subsequently it reGV ceives a straggler M after the J -th event is executed, dT ? 1 (see gure 4). If M undo where J +1  J  GV (or rolls back) J ? J events and the state vector at J has been saved, such a state will be restored and M can be executed immediately. Otherwise, the straggler will continue to undo the executed events until a saved state is found. Suppose the state restored corresponds to the Js -th event, where 0  Js  J ? 1. The TW simulator will have to redo (or re-execute) from the (Js + 1)-th event to the J -th event before M can be executed . Let TCCF (J ) be the coast forward e ort to reexecute from the (Js + 1)-th event to the J -th event and TCCF (J ) be the expectation. In our checkpointing scheme, the system state is saved at index J provided TCCF (J ) > Tstate. To compute the expected coast forward e ort, we also have to compute Pr(J ; J ), which is the probability for a rollback to undo from the J -th event to the (J + 1)-th event. This is derived based on the following observations:  The straggler arrives immediately after the J -th event is executed, thereby causing the LP to rollback. 0

0

1

0

1

1

0

0

0

3

0

0

0

0

0

0

0

1

1

0

1

3

This is also called the coast forward phase.

: saved state

ideal state to be restored but not available

: unsaved state

undone events

restored state

0

straggler received

redone events (coast forward)

J0

Js

J1

GVT − 1

Figure 4: Coasting Forward after a State is Restored

 The rollback halts after J ? J events are undone 1

0

provided the state at index J has been saved. Therefore, we have 0

Pr(J ; J ) = RB(J )  haltJJ (J ? J ): 0

1

1

1 0

+

1

0

However, if the state at index J is not saved, the rollback will continue until a saved state is found. In this case a coast forward e ort is required. Since the rollback can occur only when J > J , we derive the expected coast forward e ort based on the following summation: 0

1

TCCF (J0 ) =

0

( P G d V T?

J1 =J0 +1 P r(J0 ; J1 ) 1

0

 TCCF (J ) if J 0

the intricacies of simulation synchronization and parallelism. To handle the spawning, communication, and synchronization of processes, the PVM (Parallel Virtual Machine) library [5] is adopted. During the pre-simulation stage, four parameter values used in the cost model are obtained by taking measurements (see table 2) on the implementation platform, Fujitsu AP3000 distributed-memory parallel computer. The values for computation costs (Tevent and Tstate) in table 2 are obtained by timing the execution time of the respective code segments in the simulationprogram over 1000 iterations and taking their average. The bu er ac-

>0 if J = 0

parameter time (sec) Tevent 1200 Tstate 990 Tbuffer 2750 Ttransit 1290

0

0

where the value of coast forward e ort TCCF (J ) is a multiple of Tevent. Figure 5 shows the back-tracking algorithm used in the computation. Given an index J and the state is not saved, the coast forward effort required is repeatedly computed based on the expected coast forward e ort at the lower indices until one of the expected value is greater than the state saving cost. Subsequently, the actual coast forward effort is computed as the product of the number of redone events and the event granularity in wall clock time. Such a back-tracking algorithm implements the proposed checkpointing scheme where a state vector is saved provided the expected coast forward cost is larger than the state saving cost. 0

0

4 Performance Analysis

Table 2: Granularity of Parameter cess time (Tbuffer ) is obtained by clocking the elapsed time of PVM code segment for packing and unpacking the message and taking their average over 1000 iterations. As additional protocols are required by the PVM to allocate memory space for the transmission and reception bu ers, Tbuffer has a high value as compared to the other measurements. Transmission time (Ttransit) is also obtained by clocking the average elapsed time of two ping-pong programs over 1000 iterations. The communication delays in number of processed events is c = d  e = 6. The GVT interval chosen d is GV T = 50 after a series of sample runs to get the least elapsed time. The rollback probability and the expected coast forward cost are computed in advance, and implemented by a lookup table in the simulation program. Performance gures presented below have been averaged over 50 replicated simulation runs. 2

2750+1290 1200

We implemented three checkpointing schemes, including the conventional (or frequent) approach, infrequent approach and the proposed cost model on the Fujitsu AP3000 distributed-memory parallel computer using the simulation workbench called SPaDES/C (Structured Parallel Discrete-Event Simulation) [21]. The modular design of SPaDES/C supports experimental research in synchronization protocols, and ease of parallel simulator development without dealing with ++

++

4.1 Application Examples

Figure 6 consists of (i) MIN (feed-forward con guration) and (ii) Torus (feedback con guration). The ex-

T

if ( CCF

(J ? 1) > Tstate) TCCF (J ) = Tevent 0

T

0

else if ( CCF

(J ? 2) > Tstate) TCCF (J ) = 2  Tevent 0

T

0

else if ( CCF

(J ? 3) > Tstate) TCCF (J ) = 3  Tevent 0

0

T

: :

else if ( CCF else

(1) > Tstate) TCCF (J ) = J  Tevent 0

0

TCCF (J ) = (J + 1)  Tevent 0

0

Figure 5: Back-Tracking Algorithm for Computing the Coast Forward E ort stage 2

: packet generator

stage 1

stage 0

: switching element

(a) A 8 x 8 Omega MIN with Packet Generators

(b) A 4 x 4 Torus Network

Figure 6: Application Examples ponentially distributed mean inter-arrival time used in the packet generator and the mean service time used in each switching element are second. The 4  4 torus consists of 16 nodes, each with the same mean service time of second. The routing on the torus network is uniformly distributed on the four directions. 1

100

1

60

4.2 Checkpointing and Coasting Forward Overheads

Let EvCF RB be the average number of events coasted no: of coasted forward events forward for a rollback ( total total no: of rollback occurrences ) and hit ratio be the percentage of rollback occurrences where coasting forward is not needed no r ( total no: of rollback occurrences where no r is the number of rollback occurrences where the restored State state corresponds to index J ) (see gure 4), and EvExe be the percentage of the number of states saved with respect to the number of events executed total no: of states saved ). Tables 3 and 4 compare the ( total no: of executed events e ectiveness of the cost model against two checkpointing schemes. As observed, the conventional scheme 0

scheme freq. (w = 1) infreq. (w = 20) prob. cost-based

EvCF RB

0 6.5 3.4

State hit ratio EvExe 100% 100% 7.3% 5% 87.5% 16%

Table 3: Comparison of Overheads - 8  8 MIN scheme freq. (w = 1) infreq. (w = 20) prob. cost-based

EvCF RB

0 6.9 3.6

State hit ratio EvExe 100% 100% 7.1% 5% 83.4% 19%

Table 4: Comparison of Overheads - 4  4 Torus

(w = 1) has a 100% hit ratio and coasting forward is not needed because the state vector is saved whenever an event is executed. The overheads incurred by the infrequent approach vary with the size of the checkpointing interval (w) and we present the best experimental result when w = 20. On the average, the number of states saved for the infrequent scheme is inversely proportional to w, and the number of coasted forward events is proportional to the interval. The probabilistic cost approach outperforms the infrequent approach for the number of coasted forward events and hit ratio, but it saves more states than the infrequent scheme. The aggregate e ect of these factors to the simulation elapsed time is analyzed in the next section.

4.3 Elapsed Time

Figures 7 and 8 show that the elapsed time of the proposed cost model is better than that of the other two schemes. This e ect is due to the rollback probability and cost-e ectiveness considerations which ensure that state saving decision will lead to a net gain in execution time. Although the conventional scheme has a 100% hit ratio and does not incur any overhead to coast forward the simulator (see tables 3 and 4), its elapsed time does

1600 elapsed 1200 time (sec) 800 400 0

freq. (w = 1) 3 infreq. (w = 20) + prob. cost-based 2 3 + 2 3 + 2 3 + 2 3 2+ 1

3 + 2

3 + 2

2 3 4 5 6 duration (10 seconds) 4

Figure 7: Elapsed Time of MIN Simulation 2800 2400 2000 elapsed 1600 time (sec) 1200 800 400 0

freq. (w = 1) 3 infreq. (w = 20) + 3 prob. cost-based 2 3 + 3 + 2 2 + 3 2 + 2 3 + 2 5 10 15 20 25 duration (10 seconds)

3 + 2

30

5

Figure 8: Elapsed Time of Torus Simulation not outperform the other two schemes due to the huge amount of time incurred in saving the state vectors. On

the other hand, the infrequent scheme has reduced the state saving overhead but the saved states are statically selected without any consideration for their usefulness or vulnerability to rollback risk. As such, the infrequent scheme has to incur additional overhead to coast forward the simulator. As compared to the state saving overhead, the gain in not saving the state vectors in the infrequent scheme outweighs its loss in coasting forward the simulator, thus it yields a net gain in overall elapsed time as compared to the conventional approach. Out of the three checkpointing schemes, the probabilistic cost model has the best performance. Although the cost model incurs a larger state saving overhead as compared to that of infrequent scheme (see tables 3 and 4), its hit ratio is substantially higher, i.e., a higher probability of not incurring the coast forward overhead when a causality error occurs. Even when coasting forward is need, the number of coasted events in the cost model is also smaller as compared to that of infrequent scheme. On the average, the cost model reduces the elapsed time by 15% as compared to the conventional checkpointing scheme and 12% as compared to the infrequent scheme for MIN simulation, and 20% and 42% respectively for the torus simulation.

5 Conclusions and Future Work While the frequent checkpointing scheme incurs a substantial overhead in saving the system states, the infrequent approach also introduces a coast forward risk in redoing the executed events. Thus, a cost comparison is necessary to make decision to save a state with lower overhead. This paper proposes a probabilistic cost model and considers two factors, namely coast forward e ort and state saving overhead so that the checkpointing decision is cost e ective. The proposed model considers a homogeneous system and derives the rollback probability due to the arrival of straggler. A backtracking algorithm is used to compute the coast forward e ort, and the system state is saved only if the expected coast forward e ort is larger than the state saving cost. As the rollback probability and the expect coast forward e ort are computed in advance and implemented as a lookup table, the cost model does not incur a substantial overhead. Our implementation results as compared to two checkpointing schemes show that the probabilistic cost model approach is e ective in reducing the overall elapsed time in both feed-forward and feedback con gurations. We are extending the cost model to cover heterogeneous simulation and platform through di erent parameterizations of LVT advancement rates and communication delays respectively, and to consider the impact of cascading rollbacks on the coast forward effort. Acknowledgement This work is supported by a research grant, RP960715, from the Ministry of Education and PSA Corporation.

References [1] K.M. Chandy and J. Misra, \Distributed Simulation: A Case Study in Design and Veri cation of Distributed Program", IEEE Trans. on Software Engineering, Vol. SE-5, No. 5, pp. 440-452, September 1979. [2] S. R. Das and R. M. Fujimoto, \Adaptive Memory Management and Optimism Control in Time Warp", ACM Trans. On Modeling and Computer Simulation, Vol. 7. No. 2, pp. 239-271, April 1997. [3] S. Franks, F. Gomes, B. Unger and J. Cleary. \State Saving for Interactive Optimistic Simulation", Proc. of 11th Workshop on Parallel and Distributed Simulation, pp. 72-79, June 1997. [4] R.M. Fujimoto, \Parallel Discrete Event Simulation", Comm. of ACM, Vol. 33, No. 10, pp. 31-53, October 1990. [5] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderan, \PVM 3 User's Guide and Reference Manual", Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, May 1993. [6] F. Gomes, S. Franks, B. Unger, J. Cleary, \Multiplexed State Saving for Bounded Rollback", Winter Simulation Conference, 1997. [7] D. R. Je erson, \Virtual Time", ACM Transactions on Programming Languages and Systems, Vol. 7, No. 3, pp. 404-425, July 1985. [8] Y. B. Lin and E. D. Lazowska, \Reducing the State Saving Overhead for Time Warp Parallel Simulation". Technical Report 90-02-03, Dept. of Computer Science, University of Washington, Seattle, Washington, February 1990. [9] Y. B. Lin and E. Lazowska, \Optimality Consideration for "Time Warp" Parallel Simulation", in Proc. of 1990 SCS Multiconference on Distributed Simulation, pp. 2934, 1990. [10] J. Misra and K. M. Chandy, \Asynchronous Distributed Simulation via a Sequence of Parallel Computations", Communication of the ACM, Vol. 24, No. 4, pp. 198206, April 1981. [11] D. Nicol and X. Liu, \The Dark Side of Risk", Proc. of 11th Workshop on Parallel and Distributed Simulation (PADS'97), Lockenhaus, Austria, IEEE Computer Society Press, pp. 188-195, June 10-13, 1997. [12] B. R. Preiss, I. D. MacIntyre, W. M. Loucks, \On the Trade-O between Time and Space in Optimistic Parallel Discrete-Event Simulation", Proc. of the SCS Multiconference on Distributed Simulation, Vol. 24, No. 3, pp. 33-42, January 1992. [13] F. Quaglia and V.Cortellessa, \Rollback-Based Parallel Discrete Event Simulation by Using Hybrid State Saving", Proc. 9th European Simulation Symposium, pp. 275-279, October 1997.

[14] F. Quaglia, \Event History Based Sparse State Saving in Time Warp", Proc. of 12th Workshop on Parallel and Distributed Simulation (PADS'98), Alberta, Canada, IEEE Computer Press, pp. 72-79, May 26-29, 1998. [15] F. Quaglia, \Combining Period and Probabilistic Checkpointing in Optimistic Simulation", Proc. of 13th Workshop on Parallel and Distributed Simulation (PADS'99), Georgia, USA, IEEE Computer Press, pp. 109- 116, May 1-4, 1999. [16] F. Quaglia, \A Cost Model for Selecting Checkpoint Positions in Time Warp Parallel Simulation", Technical Report 12-99, Dipartimento di Informatica e Sistemistica, Universita di Roma "La Sapienza", http://ftp.dis.uniroma1.it/pub/quaglia/t-rep-12-99.ps. [17] R. Ronngren and R. Ayani, \Adaptive Checkpointing in Time Warp", Proc. of the 8th Workshop on Parallel and Distributed Simulation (PADS'94), ACM Vol. 24, No. 1, pp. 110-117, July 1994. [18] S. Skold and R. Ronngren, \Event Sensitive State Saving in Time Warp Parallel Discrete Event Simulation", Proc. of 1996 Winter Simulation Conference, December 1996. [19] H. M. Soliman, \On the Selection of the State Saving Strategy In Time Warp Parallel Simulations", Trans. of The Society for Computer Simulation International, March 1999. [20] S.C. Tay, \Parallel Simulation Algorithms and Performance Analysis", Ph.D. Thesis, National University of Singapore, Dept. of Information Systems and Computer Science, 1998. [21] Y.M. Teo, S.C. Tay and K.T. Kong, \Structured Parallel Simulation Modeling and Programming", Proc. of the 31st Annual Simulation Symposium, Boston, Massachusetts, USA, IEEE Computer Society Press, pp. 135-142, April 5-9, 1998. [22] D. West and K. Panesar, \Automatic Incremental State Saving", Proc 10th Workshop on Parallel Simulation, pp. 78-85, May 1996.