An Asynchronous Protocol for Virtual Factory Simulation ... - CiteSeerX

8 downloads 1525 Views 124KB Size Report
University of Exeter ... general, these protocols can be classified into two major classes: ..... a directed link to LPi, 1 ≤ i < n, and LP1 has a directed link to LPn. ..... Workshop on Runtime Systems for Parallel Programming, Orlando, Florida,.
An Asynchronous Protocol for Virtual Factory Simulation on Shared Memory Multiprocessor Systems Boon-Ping Gan [email protected] Gintic Institute of Manufacturing Technology 71, Nanyang Drive, Singapore 639798

Stephen J. Turner [email protected] University of Exeter Exeter, EX4 4PT, U.K.

Abstract The development of parallel simulation technology is seen as an enabler for the implementation of the virtual factory concept, the integrated simulation of all the systems in a factory. One important parallel simulation protocol, the asynchronous deadlock avoidance algorithm proposed by Chandy, Misra, and Bryant, has usually been discussed in the context of distributed memory systems. Also, null messages have normally been associated with this approach for deadlock avoidance. This paper presents a new implementation of the CMB protocol designed for shared memory multiprocessor systems. We have successfully used this protocol, which we call the CMB-SMP protocol, to achieve useful speedups in a manufacturing simulation application, despite the fine granularity of event processing. The implementation eliminates the need for sending null messages, without causing deadlock in the simulation. Double buffering is also used to reduce the overhead of buffer locking. It is shown that the CMB-SMP protocol outperforms a synchronous super-step protocol in terms of the speedups achieved. The paper also discusses the cache behaviour of the CMB-SMP protocol implementation since cache misses are very expensive with today's high clock speed processors. Keywords: virtual factory simulation, wafer fabrication modeling, parallel discrete event simulation 1. Introduction Parallelizing a virtual factory simulation [13] - a plant-wide simulation including the modeling of manufacturing processes, business processes, and communications network - is the major concern in this study1. This simulation allows one to model and analyze the behaviour of the system by looking at the overall activities of the system. The simulation results obtained will then be much more accurate and realistic. Due to the anticipated complexity of this detailed modeling, parallel discrete event simulation (PDES) strategy is applied. Our initial focus will be on the electronics industry, in particular the wafer fabrication plant, without the modeling of business processes and communications network. This strategy allows us first to exploit parallelism within each model before integrating them to form a single large virtual factory model. When a system is broken down into logical processes (LPs) that can be simulated in parallel, these LPs must not violate the causality constraint. Events must strictly be processed in timestamp order. Many variations of PDES protocols have been proposed in the literature to ensure that the simulation adheres to this constraint. In general, these protocols can be classified into two major classes: conservative 1

This work is an ongoing collaborative project between the Gintic Institute of Manufacturing Technology and the School of Applied Science in Nanyang Technological University, Singapore.

[1,2,3,17,23] and optimistic [6,16]. The conservative protocols allow the simulation to proceed only up to a safe time that avoids any causality error. This is normally done by computing a time guaranteed from the events that are received from the upstream LPs. On the other hand, the optimistic protocols allow the logical time to be advanced without regard to the timestamp of future events that the LP will receive. When the LP receives an event with an earlier timestamp than its current logical time, the LP will rollback to an earlier time and cancel whatever actions it has taken. In this way, the causality constraint is obeyed. The conservative protocol is the focus of this paper. It can be further broken down into two major classes, namely the synchronous [17,23] and asynchronous [1,2,3] protocols. The synchronous protocol is a more constrained strategy in which LPs are only allowed to proceed in super-steps. Every LP must wait for all the other LPs to finish their current super-step before a new super-step can be initiated. The asynchronous protocols do not have this constraint. An LP is allowed to proceed as long as events that can safely be processed exist in the LP's local event list. Both of these strategies have been implemented on a shared memory multiprocessor system (SMP) in our study. In fact, we are currently moving from the synchronous strategy to the asynchronous strategy. The motivations behind this will be presented later in this paper. Throughout this paper, the asynchronous protocol that we are referring to is the deadlock avoidance protocol proposed by Chandy, Misra, and Bryant in [1,2], which is known as the CMB protocol hereafter. The rest of the paper is organized as follow. Section 2 presents the motivations of the study. Section 3 gives a detailed description of the implementation of the CMB protocol on the SMP systems. Also, it discusses the repeatability of simulation based on the proposed asynchronous protocol. Section 4 presents the experimental results which compare the performance achieved by the synchronous and asynchronous protocols, both implemented on a SMP system. The cache behavior of the asynchronous protocol is also studied due to its importance in SMP systems. Section 5 compares our approach with related work in this area. Lastly, the conclusions and future work of the study are presented.

2. Motivations Many have said the computing power of today's single processor system has reached its limit. Shared memory multiprocessor (SMP) systems have emerged to be a promising technology to further enhance this saturating computing power. All major vendors, such as Sun, SGI, IBM, and HP, have their individual product lines that offer SMP systems. These systems are relatively affordable in price. Due to the trend of moving towards the SMP architecture, we are exploring possible techniques for implementing parallel discrete event simulation (PDES) on this platform. More specifically, we are implementing a parallel simulation application based on the Sematech [http://www.sematech.org] wafer fabrication model [18], using conservative simulation protocols. The Sematech has six sample data models that model real world wafer fabrication processes. The six data models are currently used to benchmark our simulator.

The synchronous super-step protocol (which we called super-step-SMP) was first implemented in the study [23,24]. It has been found that the protocol has a more rigid requirement of load balancing in order to achieve good performance. Each logical process (LP) must have approximately the same amount of computation within a super-step so that no time is wasted in waiting for other LPs to reach the synchronization point. This shortcoming of the super-step-SMP protocol does not exist in the CMB protocol. In this latter protocol, LPs are allowed to proceed as long as events are safe to be processed. No time is wasted unnecessarily. Consequently, the CMB protocol outperforms the synchronous super-step protocol. This claim can be verified by referring to Table 1 which compares the best possible speedup for the two protocols on four processors, using the six Sematech data models. These numbers are obtained from the parallelism analyzers that we have implemented using the algorithms proposed in [26] and [12] respectively. The superior predicted performance of the CMB protocol urges us to incorporate it into our parallel simulation engine. Model Super-step CMB

1 2.74 3.01

2 2.66 2.84

3 1.45 1.90

4 1.75 2.08

5 2.31 2.33

6 1.91 2.02

Table 1 Best Possible Speedup Achieved by Synchronous and Asynchronous Protocols

3. The Asynchronous Simulation Protocol To lay a background for readers of the paper, this section first gives a brief overview of the asynchronous simulation protocol with deadlock avoidance strategy, also called the CMB protocol. Then, an efficient implementation of the protocol on a SMP system is presented.

3.1 The CMB protocol on distributed memory systems In a conservative protocol, either asynchronous or synchronous, how fast a logical process (LP) can advance its timestamp depends on the time bound that is imposed by its upstream LPs. This time bound is defined as the next earliest timestamp of an event that the LP will receive from its upstream LPs. Thus, the safe time bound, also called safetime, that this LP can choose is the minimum among all the time bounds imposed by its upstream LPs. In this way, it is guaranteed that the LP will process events in non-decreasing timestamp order, and thus adheres to the causality constraint. Let us consider an example with three LPs connected as shown in Figure 1. Suppose LP3 has two upstream LPs, called LP1 and LP2, and the time bound that LP1 imposes on LP3 is 10 while the time bound that LP2 imposes on LP3 is 8, LP3 will take 8 as its safetime since this is the minimum between 8 and 10. It will then process all events with timestamp less than 8 in its event list since it knows that no event with timestamp less than 8 will arrive in the future.

The CMB protocol advances the virtual time of each LP following the method described in the above paragraph. In order to achieve a deadlock free simulation on a distributed memory system, three conditions as listed below must be satisfied: (a) At least one link within a cycle that is connecting LPs must have non-zero lookahead (b) Null messages must be sent (as and when it is necessary) from LPi to LPj when there is a connection from LPi to LPj (c) Events must arrive in non-decreasing timestamp order

LP1

10

LP3

LP2

8

Figure 1: An example Lookahead in condition (a) is defined as the minimum (virtual) time interval between event arrival, from a source LP to a destination LP. Suppose that an event, ei, is processed at virtual time T in LPi and it generates an external event, ej, to LPj, the timestamp of event ej must then be at least T+LAij, where LAij is the lookahead value between LPi and LPj. In [2], Chandy and Misra have proven that it is sufficient for some link in any cycle of links among LPs to have non-zero LAij in order to achieve a deadlock free simulation. Whenever an LP updates its local simulation time, downstream LPs connected to this LP need to be informed since a new time bound is now imposed on them. This new time bound will either be greater than or equal to the old time bound, and thus, events having timestamps less than this new time bound can be processed immediately. Basically, this process of informing downstream LPs is done by sending null messages. Without these null messages, some LPs that do not receive any event might not be able to progress. Thus, whenever an LP processes an event that advances its own simulation time, it will transmit a null message to all its downstream LPs that do not receive any normal event from itself (from the processing of the current event). In this way, it is guaranteed that all the LPs can progress without further delay once a better time bound is imposed on them. Other ways of transmitting null messages are also available. Since this is not the focus of the paper, they will not be discussed further. Without condition (c), the protocol cannot derive a safe time bound to each LP. This is because the time bound that an LP computes depends on the timestamp of events or messages that it receives. These timestamps impose the lowest bound to the LP. If the events are allowed to arrive out of timestamp order, the previous statement is no longer valid. An LP cannot be sure that after it receives an event with timestamp T, it will not receive another event with timestamp less than T. The causality constraint will then be violated and the simulation is no longer valid.

3.2 The CMB protocol on shared memory systems Our approach to implementing the CMB protocol on a shared memory multiprocessor system (SMP) (known as CMB-SMP hereafter) allows us to relax two of the three conditions specified in section 3.1. First of all, there is no need to send null messages in a SMP system. Whenever an LP needs to compute a new time bound, it can directly look into its upstream LPs' local virtual time. This strategy is more like requesting null messages on demand, but at a negligible cost since the "request" is just a memory reference. The LP also obtains the most up-to-date information since there is no communication delay involved. Another advantage of this proposed approach is that events can now arrive out of timestamp order to their destination LP. This is mainly due to the fact that computation of the time bound is no longer based on the events or messages that an LP receives. Thus, the order in which the events arrive is no longer important to the protocol in computing the safe time bound. With this, the only condition that the approach needs to satisfy is the nonzero lookahead for at least one link within any cycle. Deadlock might occur if this condition is not satisfied. One extra step that needs to be taken by this proposed approach is to update the local virtual time of the LP to its safe time bound once all the events with timestamp less than this bound are simulated. This is to ensure that the simulation progresses. Without this update, LPs in a cycle might deadlock. It will be easier to understand if we refer to an example. Suppose the update of the local virtual time of LPs is not performed in the example shown in Figure 2 and the lookahead values between LP1 and LP2, and LP2 and LP1, are 1. A deadlock would happen if the LPs' local virtual time are 1 and the smallest timestamp event in LP1 and LP2 were 3 and 4 respectively. When LP1 looks at its upstream LP, which is LP2 in this case, its safe time bound will be (1+1) = 2. This time bound still cannot guarantee that the event with timestamp 3 can be processed since it is still possible for LP1 to receive an event with timestamp earlier than 3. So, LP1 will look at LP2 again and again, but the same thing keeps on happening. In fact, the same activity is happening in LP2. LA2,1 = 1 LP1

LP2 LA1,2 = 1

Figure 2 An example with cycle If the virtual time of the LP is updated to its own safe time when all the events with timestamp less than this time bound are processed, the scenario described above will not happen. Immediately after LP1 realizes that it does not have any events to process, it will update its virtual time to the new value, which is 2. Then, at some point when LP2 looks at the local virtual time of LP1, LP2 can update its own local virtual time to (2+1) = 3. Following this, LP1 will eventually see that LP2's virtual time is 3, and thus a new safe time of (3+1) = 4 is computed. The event with timestamp 3 can then be processed safely, and the deadlock is avoided. A proof of the proposed CMB-SMP protocol's correctness will be presented in section 3.3.

for each LPi do // 1) Swapping buffers and merging events for each inBuffji, where j is set of upstream LPs do lock inBuffLockji swap inBuffji with old_inBuffji unlock inBuffLockji endfor merge events from all inBuffji // 2) Safetime computation safetimei = min(virtual_timej + lookaheadj,i) where j is set of upstream LPs // 3) Simulating up to safetime while (top_eventi.timestamp < safetimei) do etop = local_eventlisti.pop() virtual_timei = etop.timestamp generated_events = simulate(etop) // 4) Insertion of generated events for each generated_events, eg, do if (external_event(eg)) lock inBuffLockik old_inBuffik.push(eg) where k is the receiving LP unlock inBuffLockik else local_eventlisti.push(eg) endif endfor endwhile // 5) Update virtual time to safetime virtual_timei = safetimei endfor

Figure 3 The CMB-SMP protocol Figure 3 shows the outline of the CMB-SMP protocol. All LPs that simulate the real world model will run the same codes as shown. In order to minimize the interference between the sending and receiving LPs, a double buffered approach is employed. In fact, all links between LPs are double buffered. The sending LPs will always insert events into one buffer, while the receiving LPs will always receive events from the other. Before an LP receives the events, it will first swap its input buffer with the output buffer of its sending LP. Since it is possible that the sending LP is accessing the output buffer while the receiving LP attempts to swap it or the other way round, a mutex lock is associated with each buffer set. The double buffered approach also shortens the locking period since the buffer set is only locked for the duration of swapping or event insertion. Thus, spin locks can be used to take advantage of this short locking duration. Immediately after the swapping, the events in the input buffer are merged into the local event list of the receiving LP (without posing any interference to the LP's upstream LPs). All this is done in section 1) of Figure 3. Section 2) of Figure 3 is self-explanatory. It is mainly for the computation of the LP's safe time bound, also called safetime hereafter. The LP can directly look at its

upstream LPs local virtual time without any locking. Locking is not necessary especially when the reading and writing of a simulation time is atomic. The LP can then compute the time guaranteed by adding the local virtual time of the upstream LP and the associated lookahead value. The minimum of all these values will be the new time guaranteed for the LP. Following the safetime computation, the LP will simulate up to this safetime, excluding events whose timestamps fall at the safetime. These events are not simulated to ensure repeatability of the simulation as discussed in a later section. New events will be generated when an LP is simulating an event. These new events are either external or internal events. Section 4) of Figure 3 handles the insertion of these events either into the local (internal) or output (external) buffers. Only insertion into the output buffers needs the locking of the buffer. Lastly, section 5) shows that the virtual time of the LP is updated to the safetime so that the simulation is deadlock free.

3.3 Correctness of the algorithm In this section, we will prove that the proposed CMB-SMP protocol adheres to the causality constraint and it is deadlock free. To prove that causality is preserved, we must show that events are executed in the correct order within each LP. Each LP executes the loop shown in figure 3, where each iteration consists of the five sections shown. Within each iteration of the loop, events with timestamp less than the safe time are popped from the event list and executed in non-decreasing timestamp order. Let safetimei be the safe time computed in a given iteration by LPi. By the definition of the simulation protocol virtual_timej + lookaheadj,i ≥ safetimei (1) for all LPj which have a directed link to LPi, where virtual_timej is the virtual time of LPj at the moment LPi computed its safe time for the current iteration. It is sufficient to show that any event e which is merged into LPi's buffer in a later iteration has a timestamp not less than safetimei. Let e' be the event which generated e, and suppose e' is executed by LPj, an LP upstream of LPi. Now e'.timestamp ≥ virtual_timej (2) since otherwise event e would have been merged into the LPi's event list in the current (or a previous) iteration. e.timestamp ≥ e'.timestamp + lookaheadj,i ⇒ e.timestamp ≥ virtual_timej + lookaheadj,i …from inequality (2) ⇒ e.timestamp ≥ safetimei …from inequality (1) Thus all events executed in a later iteration have a timestamp not less than safetimei (since arriving events will have a timestamp at least equal to this value). To prove that deadlock cannot occur, we show that it is impossible for the LPs to reach a state where none of them can advance their virtual time. Suppose that such a state has been reached. Taking an arbitrary LP as the starting LP, we choose the upstream LPj which minimizes

virtual_timej + lookaheadj,i. From this new LP, we choose similarly from its upstream LPs. This is repeated, until either we reach an LP with no upstream LPs or a cycle is found. In the first case, there must be an LP with a safe time of infinity, which cannot therefore be blocked. Otherwise we obtain a cycle of deadlocked LPs, say, LP1, ... LPn, where LPi+1 has a directed link to LPi, 1 ≤ i < n, and LP1 has a directed link to LPn. When an LP has no safe events to execute, it sets its virtual time to the latest safe time computed, so that virtual_timei = virtual_timei+1 + lookaheadi+1,i for 1 ≤ i < n virtual_timen = virtual_time1 + lookahead1,n Eventually, virtual_timen = virtual_time1 + lookahead1,n = virtual_time2 + lookahead2,1 + lookahead1,n . . . = virtual_timen + sum(lookaheadi+1,i) + lookahead1,n where 1 ≤ i < n Assuming at least one link in the cycle has non-zero lookahead, the LPs virtual times can then be advanced (through propagation) and this contradicts the statement saying that LP1, ..., LPn are deadlocked. Hence deadlock cannot occur. 3.4 Repeatability of the CMB-SMP protocol Repeatability of a simulation based on the same inputs and initial conditions is one of the important issues in parallel discrete event simulation (PDES) [8]. It is more difficult, but not impossible, to be achieved in a PDES environment as compared to a sequential simulation. In general, whether a simulation is repeatable is very much dependent on how the simultaneous events are resolved. As long as these simultaneous events can be resolved in a predictable or deterministic way, then the simulation can be repeated when the same inputs and initial conditions are fed to the simulator. In a sequential simulation, it will be very straightforward to ensure repeatability since the simulator has the global state information of the simulation to resolve simultaneous events in a deterministic way. Some common ways of breaking the tie are, resolving based on event type, information associated with the event, and so on. These techniques can also be applied to PDES but not in a straightforward fashion. This is mainly due to the fact that LPs in the PDES simulation do not have the knowledge of global state information. The LP will not know what type of event it will receive in the future and at what simulation time. This introduces some nondeterminism in the order of simultaneous event execution. In order to ensure repeatability in the CMB or CMB-SMP protocol, the assumption of non-zero lookahead in at least one link of any cycle must hold. The LP can then make use of the safe time bound that the upstream LPs impose on it to achieve repeatable simulation. The basic idea is that an LP will only process events with timestamp less than the safe time bound. Events that fall at the safe time bound

will not be processed until the safe time is advanced. Thus, when an event is processed, the LP can be sure that all the events with the same timestamp as the currently processed one have already arrived. The tie of these simultaneous events can then be broken in a deterministic way by using the same technique as the sequential simulation, but only with local state information. The repeatability of simulation can then be achieved. There are ways to ensure repeatable simulation when the non-zero lookahead assumption does not hold. Since this is not the scope of this paper, these techniques will not be discussed here. Readers who are interested can refer to [8] for more information.

4. Experimental Results In this section, the performance of the CMB-SMP protocol is compared with the performance of the super-step-SMP protocol. The simulation model used for the comparison is the manufacturing process component of a virtual factory. This model will be briefly described before the experimental results are presented. Lastly, the cache behaviour of the CMB-SMP protocol is studied.

4.1 Simulation model The simulation model that is used in this experiment is a manufacturing process of a wafer fabrication plant. The data model is obtained from the Sematech's Modeling Data Standard (MDS) project [18] that aims "to develop a set of standards that will enable the seamless exchange, sharing, and re-use of data among modeling applications and Manufacturing Execution Systems (MES)". The data models are realistic examples from real world applications. The MDS uses several files to define the manufacturing processes. They are the process flow, rework, tool set, operator set, and volume release files. Only the process flow file will be mentioned here to facilitate the discussion of our simulation model. The definition of the rest of the files can be found in [18]. The process flow file defines the workflow of products. This file contains the information of steps that wafer lots need to flow through. Each step will define the machine set and operator set needed, processing time incurred, and so on. Our simulation model is built by modeling the machine set as the simulation object and the wafer lots as events. The simulation objects are mapped to form logical processes (LPs). The mapping can either be 1-to-1 or multiple-to-1. When the latter mapping is used, a partitioning algorithm is applied. The partitioning algorithm attempts to group simulation objects such that a certain objective function is optimized. In our case, we are using a static partitioning scheme, which is called the multifit-com strategy, to do the partitioning [9]. The objective function of our partitioning strategy is to minimize load imbalance and maximize lookahead values. This has been found to be the most effective way of achieving good performance [25]. Our simulation model does not consider everything that is defined in the Sematech data model. Some features, such as operators, reworks, and machine down time, are

omitted for the time being. Although these features are omitted, it does not reduce the complexity of the simulation model (in terms of the connectivity of simulation objects). The connectivity of simulation objects is still complex since different steps in a process flow can share machines.

4.2 Performance of CMB-SMP protocol versus super-step-SMP protocol Our simulation program is implemented using C++. It uses the Active Thread library [19] as the parallel runtime support since this package is efficient [21]. We also modified a memory management package [15] to provide memory allocation and deallocation routines that can execute efficiently without causing any race conditions. The object code of the simulation program is compiled by a GNU g++ compiler (version 2.7.2.1). The timings of the experiments are obtained on a 4-CPU Sun Enterprise 3000 (250 MHz UltraSparc2). Since we have only 4-CPU, we only group the simulation objects to form 4 LPs, in which each LP is mapped to a single CPU.

3.00

Speedup

2.50 2.00 Super-step-SMP

1.50

CMB-SMP

1.00 0.50 0.00 1

2

3

4

5

6

Data set

Figure 4 Speedup comparison of CMB-SMP and super-step-SMP protocols Figure 4 compares the speedup achieved by the CMB-SMP and the super-stepSMP protocols. As can be seen, the CMB-SMP protocol outperforms the super-stepSMP protocol for all data sets. This is consistent with the predicted speedup that we have observed (shown in Table 1). The main reason is that when the super-step-SMP protocol is used, some LPs might be wasting time in waiting for other LPs to reach the barrier synchronization point before they can proceed. This wasted time is better utilized in the CMB-SMP protocol in the sense that the LPs do not have to wait for barrier synchronization at all. As long as the LP finds that it is safe to proceed, the events in the local event list will be processed. Looking from another angle, the superstep-SMP protocol in fact has a more rigid requirement of load balancing. It needs each super-step to be balanced in order to achieve good performance. The CMB-SMP protocol does not have this rigid requirement. It can perform well as long as the overall system load is balanced.

4.3 Cache behaviour In general, today's computer system usually comes with two levels of caches. These caches are used to close the gap between processor speed and memory bandwidth. Since caches are expensive, the cache size is normally limited. The limited cache size has prompted today's programmers to pay special attention to the data access pattern when writing a program. A mere 10% secondary cache misses might cause a 50% degradation in the program performance [20]. The cache behaviour of a parallel program is especially important in the SMP environment since multiple processors can cache the same cache line that might cause false sharing that never happens in sequential programs. Thus, it is important to study the cache behaviour of a parallel program to see if there is any performance bottleneck due to cache misses. This is especially true for a fine granularity problem such as ours. In our case, each event processing takes only approximately 100 microseconds or less. A single external cache miss in our Sun Enterprise 3000 system can cost as much as 100 clock cycles [20]. Thus, it is worthwhile to look at the cache behaviour of the simulation.

8.00 Cache misses (%)

7.00 6.00 5.00 Super-step-SMP

4.00

CMB-SMP

3.00 2.00 1.00 0.00 1

2

3

4

5

6

Data set

Figure 5 Cache misses comparison of CMB-SMP and Super-step-SMP

43% 48%

Event Merging Safetime

9%

Figure 6 Breakdown of cache misses for data set 4 Only external cache miss numbers are collected in this experiment since it is approximately 10 times more expensive than an internal cache miss. Figure 5

compares the cache misses percentage of the simulation based on the CMB-SMP and the super-step-SMP protocols. Even though the CMB-SMP protocol outperforms the super-step-SMP protocol in terms of speedup achieved, it exhibits a worse cache behaviour. This implies that the performance of the CMB-SMP protocol can be improved further if the cache misses can be reduced. In order to improve the cache misses of the CMB-SMP protocol, a measurement of the contributing factors needs to be done first. Basically, the contributing factors to cache misses are event execution, merging of events into the local event list, and the safetime computation (looking at upstream LPs' virtual time). Figure 6 shows this breakdown and it is obvious that the event execution and safetime computation are the main contributors to the cache misses. In fact, this is true for the rest of the data sets. Observation from Figure 6 has helped us to identify the portion of the simulator that we should optimize to improve the cache behaviour. For optimization related to the protocol, we could attempt to improve the cache misses of safetime computation in the CMB-SMP protocol. This can push the performance of the CMB-SMP protocol even further. Another area that we can attempt to improve is the cache misses contribution of the event execution. This improvement will shorten the execution time of either the CMB-SMP or super-step-SMP protocol based simulator.

5. Related Research This section gives a brief overview of related work on the implementation of conservative protocols on shared memory architectures and compare these approaches with our own. An early paper by Reed, Malony and McCredie [4] describes experiments on a shared memory architecture using both deadlock avoidance (null message) and deadlock detection and recovery algorithms. Their deadlock avoidance approach is a straightforward implementation of the algorithm described in [2], in which message based communication is implemented via shared access to the message queues. However, their performance results for various queueing network simulation applications are disappointing, except in a few specialized cases such as feed-forward networks. Other authors have proposed shared memory implementations of synchronous conservative algorithms. These approaches differ mainly in the method used to determine which events are safe to process. Ayani and Rajaei [5,10] present a threephase algorithm called conservative time windows, in which each cycle of the simulation consists of three phases with a barrier synchronization at the end of each phase. Performance results on an SMP architecture show that the algorithm can perform poorly for heterogeneous applications. Konas and Yew [11] discuss the overheads of synchronous algorithms. These arise from the global synchronization, the difficulties of load balancing and the complexity of the safe time computation. A more aggressive safe time computation might expose more parallelism, but this must be offset by the increased overheads.

They describe a synchronous algorithm which has a single barrier synchronization per cycle and which attempts to expose more parallelism only when necessary. Cleary and Tsai [14] describe a shared memory implementation of the deadlock avoidance algorithm. This retains the concept of links between LPs and makes the assumption that the links are monotonic, which is not the case for our algorithm. Each LP updates the clock value of all outgoing links after processing each message, and this corresponds to the sending of null messages. The link clock value is read directly by the receiving LP. Good performance is reported for ATM simulation applications, even with small event granularity. Chen and Bagrodia [22] describe the use of both asynchronous and synchronous algorithms for circuit simulation. They discuss various optimizations for shared memory architectures, including a mechanism in which a time guarantee (null message) is "piggybacked" where possible onto a real message. If an LP has not sent real messages on all of its outgoing links, the link clock values are updated after processing all safe events. Their approach aims to reduce the overhead of polling the link clock values, but means that an LP is not necessarily using the most up-to-date value when calculating its safe time. In contrast, our approach uses the upstream LP's virtual time, which is incremented as each event is executed.

6. Conclusions and Future Work In general, it is quite difficult but not impossible to achieve good speedup for fine granularity problems. Nevertheless, we have achieved a respectable speedup (on 4 processors) with our fine granularity manufacturing process model. The CMB-SMP protocol has brought the speedup number to a new high that the super-step-SMP protocol is incapable of. Another interesting point is that even though the CMB-SMP protocol results in a worse cache behaviour, it still performs better than the superstep-SMP protocol. This implies that the performance of the CMB-SMP protocol can still be pushed further. Part of our future work will concentrate on improving the cache behaviour of the CMB-SMP protocol. The speedup numbers that we have collected so far are only for a 4-processor system. It will be interesting to study the scalability of the CMB-SMP protocol and compare it against the scalability of the super-step-SMP protocol. One problem that we might face during the study is the model size that we currently have, which might be too small to give good scalability. Thus, we intend to enlarge the model by incorporating the business processes and communication network of the virtual factory simulation. The CMB-SMP protocol has also been proven to be correct with the assumption that lookahead must not be zero for some link within any cycle. Currently, we are exploring the possibility of relaxing this assumption to allow zero lookahead in virtually any part of the system. This might introduce deadlock into the simulation, which would mean that a deadlock detection scheme needs to be developed. Lastly, the only partitioning algorithm that has been attempted in this study is the multifit-com algorithm. This leaves some room for further investigation of some other

partitioning algorithms, which might give better partitions. A dynamic load-balancing scheme might also be one possibility in achieving even better performance.

Acknowledgments This research is supported by National Science and Technology Board, Singapore, under the project: Parallel And Distributed Simulation of Virtual Factory Implementation. It is a collaborative project between Gintic Institute of Manufacturing Technology, Singapore and the School of Applied Science in Nanyang Technological University, Singapore. The project is currently located at the Centre for Advanced Information Systems at the Nanyang Technological University, Singapore. The authors would like to acknowledge other members of the project: Chu-Cheow Lim, Yoke-Hean Low, Sanjay Jain, Wentong Cai, Shell-Ying Huang, and Wen-Jing Hsu.

References 1.

R.E. Bryant, "Simulation of Packet Communications Architecture Computer Systems", MIT-LCS-TR-188, Massachusetts Institute of Technology, 1977

2.

K.M. Chandy and J. Misra, "Distributed simulation: A Case Study in Design and Verification of Distributed Programs", IEEE Trans. on Software Engineering, vol. SE-5, no. 5, pp. 440-452, Sep 1979

3.

K.M. Chandy and J. Misra, "Asynchronous Distributed Simulation via a Sequence of Parallel Computations", Communications, ACM 24, pp. 198-205, Nov 1981

4.

D.A. Reed, A.D. Malony, and B.D. McCredie, "Parallel Discrete Event Simulation Using Shared Memory", IEEE Trans. on Software Engineering, vol. 14, no. 4, pp. 541-553, 1988

5.

R. Ayani, "Parallel Discrete Event Simulation on Shared Memory Multiprocessors", Int'l Journal of Computer Simulation, vol. 1, pp. 111-131, 1989

6.

R.M. Fujimoto, "Optimistic Approaches to Parallel Discrete Event Simulation", Trans. Soc. Comput. Simulation, vol. 7, pp. 153-191, Jun 1990

7.

R.M. Fujimoto, "Parallel Discrete Event Simulation", Communications, ACM, vol. 33, no. 10, pp.31-53, Oct 1990

8.

H. Mehl, "A Deterministic Tie-Breaking Scheme for Sequential and Distributed Simulation", Proceedings of Workshop on Parallel and Distributed Simulation, pp. 199-200, 1992

9.

C.M. Woodside and G.G. Monforton, "Fast Allocation of Processes in Distributed and Parallel Systems", IEEE Trans. on Parallel and Distributed Systems, vol. 4, no. 2, 1993

10. R. Ayani and H. Rajaei, "Parallel Simulation Based on Conservative Time Windows: a Performance Study", Concurrency – Practice and Experience, vol. 6, no. 2, pp. 119-142, 1994 11. P. Konas and P-C Yew, "Improved Parallel Architectural Simulations on SharedMemory Multiprocessor", Proceedings of 8th Workshop on Parallel and Distributed Simulation (PADS'94), pp. 156-159, 1994 12. Y.C. Wong, S.Y. Hwang, and Jason Y.B. Lin, "A Parallelism Analyzer for Conservative Parallel Simulation", IEEE Trans. on Parallel and Distributed Systems, vol. 6, no. 6, pp.628-638, June 1995 13. S. Jain, "Virtual Factory Framework: A Key Enabler for Agile Manufacturing", Proceedings of 1995 INRIA/IEEE Symposium on Emerging Technologies and Factory Automation, Paris, vol. 1, pp. 247-258, IEEE Computer Society Press, Los Alamitos, CA., Oct. 1995 14. J.G. Cleary and J-J Tsai, "Conservative Parallel Simulation of ATM Networks", Proceedings of 10th Workshop on Parallel and Distributed Simulation (PADS'96), pp. 30-38, 1996 15. K.P. Vo, "Vmalloc: A General and Efficient Memory Allocator", Software – Practice and Experience, vol. 26, no. 3, pp.357-374, Mar 1996 16. S.C. Tay, Y.M. Teo, and S.T. Kong, "Speculative Parallel Simulation with an Adaptive Throttle Scheme", Proceedings of 11th Workshop on Parallel and Distributed Simulation (PADS'97), Austria, pp.116-123, Jun 1997 17. W. Cai, E. Letertre, and S.J. Turner, "Dag Consistent Parallel Simulation: a Predictable and Robust Conservative Algorithm.", Proceedings of 11th Workshop on Parallel and Distributed Simulation (PADS'97), Austria, pp.178181, Jun 1997 18. Sematech, "Modeling Data Standards, version 1.0", Technical report, Sematech, Inc. Austin, TX78741 1997 19. B. Weissman, "Active threads manual", Int'l Computer Science Institue, Berkeley, CA 94704, Technical Report TR-97-037, 1997 20. A. Cockcroft and R. Pettit, "Sun Performance and Tuning: SPARC and Solaris", Sun Press, 1997 21. C.C. Lim, Y.H. Low, W. Cai, S.J. Turner, W.J. Hsu, S.Y. Huang, "An Empirical Comparison of Runtime Systems for Conservative Parallel Simulations", 2nd Workshop on Runtime Systems for Parallel Programming, Orlando, Florida, USA, Mar 1998

22. Y.A. Chen and R. Bagrodia, "Shared Memory Implementation of A Parallel Switch-Level Circuit Simulator", Proceedings of 12th Workshop on Parallel and Distributed Simulation (PADS'98), pp.134-141, May 1998 23. C.C. Lim, Y.H. Low, and S.J. Turner, "Relaxing Safetime Computation of a Conservative Simulation Algorithm", 1998 Int'l Conf. on Parallel and Distributed Processing Techniques and Applications (PDPTA'98), Las Vegas, Nevada, USA, 13-16 Jul 1998 24. Y.H. Low, C.C. Lim, B.P. Gan, S. Jain. W. Cai, W.J. Hsu, S.Y. Huang, "Conservative Parallel Simulation of Manufacturing System", 8th Int'l Parallel Computing Workshop (PCW'98), Singapore, pp.293-300, 7-8 Sep 1998 25. C.C. Lim, Y.H.Low, B.P. Gan, S.J. Turner, S. Jain, W. Cai, W.J. Hsu, S.Y. Huang, "A Parallel Discrete Event Simulation of Wafer Fabrication Processes", 3rd High Performance Computing Asia Conference & Exhibition, Singapore, pp. 22-25, Sep 1998 26. C.C. Lim, Y.H. Low, and W. Cai, "A Parallelism Analyzer for a Conservative Super-step Simulation Protocol", Hawaii Int'l Conf. On System Sciences (HICSS-32), 5-8 Jan 1999