Roving Emulation as a Fault Detection Mechanism - IEEE Computer ...

2 downloads 0 Views 3MB Size Report
qi: The probability of detecting a fault in Di due to the application of one input vector, given that. Di is faulty. MTBFi: The mean time between failure for Di where ...
933

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-35, NO. 11, NOVEMBER 1986

Roving

Emulation as a Fault Detection Mechanism M. A. BREUER,

FELLOW, IEEE, AND

Abstract-In this paper we present a new built-in test methodology for detecting and locating faults in digital systems. The technique is called roving emulation and consists of an off-line snap shot type emulation or simulation of operating components in a system. Its primary application is in testing systems in the field where real-time fault detection is not required. The primary performance measure of this test schema is taken to be the expected value of the error latency, i.e., the time required to detect a fault once it first occurs. The primary results of this paper deal with deriving equations for the error latency. We present both a probabilistic and service-waiting model to analyze the expected error latency in a system tested via roving emulation. The effects of various controllable and uncontrollable system parameters on error latency are studied. Finally, the technique is applied to a system consisting of combinational logic modules, and numerical results are presented. Index Terms-Built-in testing, digital systems testing, emulation, error latency, fault detection, roving emulation, simulation. I. INTRODUCTION

D EFINITION (Webster): roving-capable of being shifted from place to place. This paper deals with the self-testing of complex digital devices in the field where real-time error detection is not required. In general, the testing of a system, especially those containing VLSI circuits, is a complex task requiring numerous different techniques. Parts of a system may be tested using stored patterns [1]. This usually requires the costly generation and fault simulation of test vectors [2]. Often extra hardware overhead is added to make test application and/or test generation easier. An example of this is the use of set/scan registers or level sensitive scan design (LSSD) [3]. Hardware can also be added to carry out syndrome testing [4] and/or signature analysis [5]-[8]. In general the design of a system which can test itself is a complex task. If error masking is not required, then duplication is a powerful means for detecting errors. If concurrent (instantaneous) error detection is not required, then duplication can be carried out in the background, that is off-line, rather than on-line. One means of achieving such an off-line form of duplication is introduced in this paper, and is called roving emulation. Error latency is defined to be the time Manuscript received November 17, 1983; revised February 21, 1986. This work was supported in part by the National Science Foundation under Grants MSC78-26153 and MCS-8203485, and by the United States Naval Electronic Systems Command under Contract N00039-80-C-o64l. M. A. Breuer is with the Department of Electrical Engineering-Systems, University of Southern California, Los Angeles, CA 90089. A. A. Ismaeel is with the Department of Electrical and Computer Engineering, University of Kuwait, Kuwait. IEEE Log Number 8610936.

ASAD A. ISMAEEL, MEMBER, IEEE

interval between the occurrence of a fault and its detection [9]. We have selected the concept of error latency to be the primary criteria by which the performance of a system tested via roving emulation is to be measured. However, we have modified the definition to include only the time between the beginning of testings and detection. In Section II we discuss what a roving emulator (RE) is, and how a system is tested via roving emulation. In Section III we define the important modeling parameters to be used. In Sections IV and V we develop a probabilistic and servicewaiting model, respectively, for measuring system performance, i.e., the expected error latency time. Some theoretical results are given in Section VI. Applications to combinational circuits are presented in Section VII. II. THE ROVING EMULATOR AND ROVING EMULATION Roving emulation is a new concept for testing digital systems. A roving emulator (RE) is essentially a hardware device which, under software and/or firmware control, can either emulate or simulate the operation of another digital system. Testing is done via "software duplication" in an offline non real-time manner. We conceive of the RE as consisting of a special-purpose machine (set of chips) whose sole purpose is to carry out emulation. (In this paper we use

the concepts of emulation and simulation synonymously). Clearly the RE could be easily designed using either special purpose VLSI chips, standard components, or even generalpurpose microprocessors. The architecture of a specific RE, referred to as an emulation engine, which can be implemented on a single PCB is presented in [11]. Fig. 1 shows a general structure for a system being tested via an RE, and Fig. 2 shows a more specific structure. Here D1, D2*, D, are the major devices of the system, such as VLSI chips or PC cards. S is the interconnection network between the Di's. The database F contains an emulation model of each of the Di's to be checked via emulation. Note that emulation is not an efficient tool to use for checking all types of devices, such as RAM's and other devices which are mainly memory units with little computational power. The RE has the following capabilities: 1) interrupt the system { D1, * D*, D"}; 2) read the internal state of a device i(Di); 3) emulate the operation of Di; 4) observe and record all the input and output information for D, for a specified time period; 5) select the next device to be processed; 6) compare the results of Qbserved output data and the data produced by the emulator. Clearly the architecture of the system as well as the Di's to

0018-9340/86/1100-0933$01.00 © 1986 IEEE

934

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-35, NO. 11, NOVEMBER 1986

4) puts a small time overhead on the system to be tested, since except for the interrupt times, testing is carried out as a background process, 5) puts a small hardware overhead on the devices to be tested, 6) eliminates the need for test generation, fault simulation, Fig. 1. General structure of a system tested via a roving emulator. numerous BIT and BITE chip and PCB hardware test structures, and control 7) eliminates the need for a fault model. Note that if a fault, even an intermittent fault, causes either the output, or final state of a device to be in error during the testing window, the fault will be detected. If the fault caused no output error, one or more incorrect state transitions, but finally a return to the correct state, then no error will be detected. However, one could argue that no real computational damage has really occurred. Note that the data used to test the data name bus bus device consists of normal operational data. In this paper we will model this data as if it were random data. We assume that Fig. 2. A bus structure system via a roving emulator. the system to be tested has been thoroughly validated via be tested must be such so as to support the concept imposed by simulation, at both the register and logic levels, and hence roving emulation. First, devices must be built so that their accurate simulation models to be processed by the RE exist. internal memory state can be read. This capability often exists, III. PARAMETERS in part, in microprocessors where interrupts and/or subroutine jumps force systems to "save their state." This requirement The relevant parameters associated with the roving emulacan also be easily implemented using scan-out registers. tion testing methodology are separated into two classes. Class Secondly, the emulator must be aware of what data a device 1 consists of controllable parameters which can be changed by outputs. In a bus architecture, as shown in Fig. 2, this can be our testing approach. Class 2 are the uncontrollable parameachieved by either 1) having a unique control signal from each ters which are inherent to the system and cannot be changed, device to the emulator indicate when a device writes on the bus e.g., the total number of devices in a system. All parameters or 2) have each device put a unique ID on the bus when it which have the units of time are measured in the number of writes. If tristate drivers are used, then the required control clock cycles of the system being tested. signals already exist. A digital system containing an RE is tested in the field as A. Controllable Parameters Two important controllable parameters associated with D, follows: at time to the system is interrupted and the internal state of D, is recorded by the RE. The system then returns to are wi, the test window size (in clock cycles), and ki, the normal operation during which time the RE monitors and average time needed by the RE to simulate one input vector stores input-output vector pairs associated with Di for a time and obtain an output. A final parameter, which is a function of duration consisting of w clock cycles. This is called the test all devices, is the order in which they are tested. This order is window. At the end of this period another interruption of determined by the scheduling policy. The simplest scheduling system operation can occur at which time the final state of Di is policy is called round-robin, where the devices are tested in The value of w, is recorded. Again the system continues its normal operation the order D1, D2, ..i Dn D1, D2, while the RE emulates (or simulates) the operation of Di given controllable because it is just a parameter in the operating its logical description, initial state and input data, and system run on the RE. k, can be varied by building either faster compares the result of the emulation (output and final state) to or slower emulation engines [11]. the corresponding output vectors and final state recorded earlier. If the recorded and computed responses are the same, B. Uncontrollable Parameters then no error is detected. If they are different, then D, is The following are some parameters which are inherent to assumed to be faulty, assuming that the RE and the communi- the system under test. cation lines are fault free. In the case where no error is n: The total number of devices in the system. detected, the RE will proceed (rove) to another device and continue the test process until either an error is detected or the mi: The average number of input-output vector pairs obtained while testing Di during the emulator is disabled. Testing via emulation can be carried out opening of a test window of duration wi. continuously or periodically. qi: The probability of detecting a fault in Di due to The proposed method for testing has the following importhe application of one input vector, given that tant characteristics: Di is faulty. 1) can be applied to a large class of digital systems, MTBFi: The mean time between failure for Di where Xi 2) is easy to implement, = 1I/MTBFi. 3) both detects and locates faulty devices,

935

BREUER AND ISAMAEEL: ROVING EMULATION

The probability that Di is faulty, given that there exists exactly one faulty device in the system. PDi(m,): The probability of detecting a fault in Di due to the application of mi input vectors, given that Di is faulty. I': The initial interrupt period, in clock cycles, used primarily to obtain the initial state of Di. 12: The final interrupt period, in clock cycles, used primarily to obtain the final state of Di. The interrupt periods can be varied to some extent by modifying the design of Di so as to affect the time required to dump its internal state, e.g., by using multiple scan paths. To help clarify the units of time, assume that the clock rate of the system under test is 50 MHz. If a window of 0.0001 s is opened, then w = 5 x 103. If the device is idle during half of this time, then m = 2.5 x 103. In our analysis we normally assume that the functional data being processed through an actual device in the field can be modeled as if the data were random with a known distribution. For example, one could assume that every possible input vector to a PLA is equally likely. If this assumption is not justified for a particular device, then the device can be operated under normal conditions and actual input data observed. From this information more realistic information on input data statistics can be obtained. Using this data, a fault simulator and a small set of sample faults, the parameter qi can be estimated. The computation of PDi(mi) for both combinational and sequential circuits is discussed in detail in [12]. PFi:

IV. PROBABILISTIC MODEL In this section we will develop a probabilistic model to calculate the expected value of the error latency in a digital system using roving emulation as the testing vehicle. The following are the main assumptions made in this model: a) exactly one faulty device exists in the system, b) round-robin scheduling, and c) at time to we start by testing device 1. We will examine digital systems having identical as well as nonidentical device parameters. For identical device parameters let T = P + w + j2 + km (note that the subscripts are deleted, since the device parameters are identical). T is called the testing period (see Fig. 3). The following four events occur during a testing period. 1) During II, the system is interrupted and the initial state of a device is recorded. 2) The system returns to normal operation during which time the RE observes and records m input-output vectors for a time duration of w. 3) During 12, the system is again interrupted and the final state of the device is recorded. 4) The RE emulates the device under test and compares the results of the emulation to the corresponding output and final state recorded earlier. This event occurs for a time duration of

km. Let A be the total interrupt time in one testing period (T), i.e., A = I' + J2. Let En be the expected value of the error latency for an n

1I2t

km

2 +km 1 T= Il+w+I

Fig.

3.

The testing period.

device system. We first consider systems with two devices and then generalize the result to n devices. We let X be the discrete random variable which represents the error latency time. Let P(X = x) be the probability that the error latency is equal to x. Note that x can only assume the values TI, T1 + T2, 2T1 + T2, 2T1 + 2T2, 3 T' + 2T2, *. . We have

P(X= T1)=PF1PD1 P(X= T1 + T2)=PF2PD2 P(X-2T1 + T2)=PF1PD1(l -PDI)

P(X= iT1 + iT2) = PF2PD2(1 -PD2) 1

P(X = (i + 1) Ti + iT2) -PF, PD, (I - PD)i Thus, the expected value of the error latency is

E2= [((i + 1) T + iT2)P(X= (i+ 1) T'+ iT2) i=O

+ (iT1 + iT2)(P(X= iT1 + iT2)]

= PF1 ((T1 + T2)/PD1 - T2) + PF2((T1 + T2)/PD2).

(1)

For an n device system PFi = (I//MTBFi) /

n

1 (I/MTBFi) j=1

where (I/MTBFi) = Xi is the failure rate of device i. Thus, for n = 2, PF1 = XI/(XI + X2) and PF2 = X2/(X1 + X2)6 Substituting for the values of PF1 and PF2 in (1) gives us

E2 =

XI

[

XI1+X2

X

T±+ 7 _T2

[T +

xi+X2

PD1

2]

LPD2

For systems with n devices, we obtain [F

IE

=

T

T

n

Lf-'\PD j

i=j±1

PDnJ

(2)

where n i=l

and n

Ir= xi. i=1

If the parameters of all devices are equal, then

En T((n I PD) + (I n)12) =

(3)

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-35, NO. 11, NOVEMBER 1986

936

and E2= 2 T/PD - T/2.

(4)

V. A SERVICE-WAITING MODEL In this section we will study the performance of the roving emulator using a service-waiting model. For a system consisting of n devices we will model each device by a queue and the roving emulator by a server. The faults are modeled by customers that arrive to the queues according to a known interarrival distribution. Our queues are unique in that customers (faults) do not line up in the queue, but rather act as one composite customer, i.e., a multiple fault. In this work, however, we will restrict our attention to at most one customer in a queue at a time. Fig. 4 shows a block diagram of the model. We will model the process of detecting an error in Di by the completion of service for a customer in queue i. Starting from queue 1, the server will provide each queue with a quantum T of service time. If the queue being served contains a customer (fault) and detection occurs within the time interval allocated, then we say that the customer's service has been completed. If, on the other hand, no error is detected, either because no fault exists or more testing is needed, then the server moves to the next queue in line. The server will cycle through the queues using a round-robin schedule until an error is detected. A complete pass through queues 1 to n is called a testing cycle. The density function of the service time for each customer can easily be shown to be geometrically distributed. We will first examine a system of queues having the same arrival rates (failure rates) and service parameters (PD). We will derive an expression for the average time a customer spends in the system, which is equal to the expected value of the error latency. Then we will consider queues with different arrival rates and different service parameters.

A. Queues with the Same Arrival Rate and Same Service Parameter In practice, the time between failures in a complex system or a single component tends to obey only a few well-defined probability distributions. In our model we will assume an exponential distribution for the time of occurrence of failures, and a constant failure rate. Hence, the arrival distribution function is given by A(t) = 1 - e-t. A(t) is merely the probability that the time between arrivals of two consecutive faults is less than or equal to t. The density function associated with A(t) is a(t) = Xe - . The service time for Di is chosen independently from a geometric distribution with parameter PD. Adapting queueing notations [10], we let B(x) be the service time distribution function, i.e., B(x) = 1 - (1 PD)x. B(x) is the probability that the number of testing periods required for completion of service is less than or equal to x. Note that each testing period for a customer in queue i is independent of the testing periods applied in previous testing cycles. The service time density function is b(x) = PD(1 PD)x-1 where b(x) is the probability that service is completed (detection occurs) at the xth testing period. In this analysis we assume that the arrival rate of failures in digital devices is so small that the probability of another failure

CC:ueueI- >13 Server

Fig. 4. Block diagram of the service-waiting model.

arrival before detecting an existing one is zero. We will first examine systems with two queues, then generalize the results to an n-queue system. The mean time spent in the system is the sum of the mean waiting time (W) and the mean service time l)T where i is the queue 1)T + (r (S). Then W (i number of quantums r the average is 1 or and 2), number (i.e., service (detection) is before a needed queue by (service time) completed. 1)T. If queue 2 If queue 1 has the failure, then W = (r the same = assume we Tr. Since W then has the failure, = + rT/ rT/2 T/2 W then the two for queues, arrival rate = rT. and S rT T/2, 2 The mean time spent in the system is thus =

-

-

-

-

=

-

(5)

E2=(rT- T/2)+rT.

Since the service time is geometrically distributed with density function b(x), then the average number of quantums needed to complete a service is 00

r=> xPD(I - PD)X-= 1/PD. x=O

Substituting the value of r into (5) gives us

(6)

E2 = ( T/PD- T/2) + T/PD.

This result divides the average time spent in the system into two parts. Part one, which is equal to the term inside the parenthesis, represents the average total waiting time. The term T/PD is the average time a queue waits for the server to finish serving the other queues. The term T/2 is the initial waiting time, which is a function of the queue number in the testing cycle. Part two is the mean service time. Note that this result is equal to (4). In an n-queue system we have I

W= n

n

E[(i l) T+ (r 1)(n l) T] -

-

-

n-I

+ T(r- l)(n- 1). 2 Therefore, the average time spent in the system is

=T

En= W+S T(n - 1) 2

I-T

kPDI

PD

(7)

937

BREUER AND ISAMAEEL: ROVING EMULATION

B. Queues with Different Arrival Rate and Service Parameters We will first examine two queue systems and then generalize our results to n queue systems. Let PDi be the service parameter for queue i and ri be the mean number of quantums queue i needs from the server to complete its service requirement. For two queues we have the following: If queue l contains the failure then E2 = W + S = (r1 - 1)T2 + r, T1, and if queue 2 contains the failure then E2 = r2T1 + r2T2. Thus E2 = PF1[(r1 - 1)T2 + r1T1] + PF2[r2T1 + r2T2]. For an n-queue system we have n_ W= PF E Tj-T,+(r,-1) E TJ j#i

j=1

i=1

and the mean service time is given by the expression S = Z,= I PFiri T . Therefore, the mean time spent in the system is

En=

n

i

PFi

Tj 1 - Ti + (ri - 1) , Tj + ri Ti

. (8)

Note that the term En PFi(E,=i Tj - T,) represents the initial waiting time, namely the time between when testing first begins until the queue containing the fault is first served. VI. THEORETICAL RESULTS ON SYSTEM PERFORMANCE Theorem 1: Let S be a digital system which consists of n devices. Let T and X be the same for all devices, and assume that one device is faulty. The expected value of the error latency is not affected by the ordering of devices when roundrobin scheduling is used. Proof: Consider a digital system S consisting of n devices. This system has (n - 1)! different round-robin schedules. Using (8), the expected value of the error latency for any permutation is n

En= E PF, 1=1

i

Yd Ti -j=

T, + (r, 1)

I

I; Ti + ri T,

-

j:tl

Since we have the same value of T and X for all devices, this equation reduces to

En = T

n

p

1 -

nT+

T(n+ 1)

n

+

xi

n

n

Ya PDt j*i Tj- i=l, Xi j#, Tj

i=l

*

.

We note that only the first term is a function of ordering. Examining this term we note that for any ordering we either have the term Xj Tk or XkTj for all j, k such that I < j, k c n. Assume we order the devices such that T1 T2 X1

Tn xn

X2

Let A be the sum .n

i=l

__

j=1

For any other ordering let B be the corresponding sum. Then A - B consists of terms of the form Xj Tk - XkTj where j > k. But from our assumption we have that Tk/Xk C T,/XJ, for all j, k such that k < j. Thus Xj Tk < XkTj and the term Xj Tk - XkTj is negative assuming that the X's and the T's are O positive. Hence, the ordering assumed is optimal. Corollary 1: Let S be a digital system which consists of n devices. Let T be the same for all devices. Assume that one device is faulty. The expected value of error latency will be minimized if we order devices in descending order with D respect to their X values. Corollary 2: Let S be a digital system which consists of n devices. Let X be the same for all devices. Assume that one device is faulty. The expected value of the error latency is minimized if we ordered the devices in increasing order with L respect to their T values. VII. APPLICATION TO COMBINATIONAL CIRCUITS In this section we will examine systems consisting of combinational devices. We assume that each input vector has probability qi of being a test vector for a failure in Di. Furthermore, each input vector is independent of previously applied input vectors and its detection probability does not change with time. The probability that none of the mi input vectors obtained from a test window wi is a test vector is equal to (1 - qi)mi. Thus the probability that at least one input vector among the mi vectors detects a fault is equal to 1 - (1 - qi)mi. This quantity is the previously defined probability PDi. So, for a combinational device we have

In this equation, we note that each of the three terms is not a O function of ordering. Theorem 2: Let S be a digital system which consists of n PDi(mi) = 1 -(1 - qi)mi. (9) devices. Assume that one device is faulty. The expected value of the error latency is minimized if we order devices in Fig. 5 shows the relationship between PD and m for different values of q. Note that as q decreases, (the failure is the term to with order respect increasing Ti/Xi. devices. of n harder to detect) the value of m needed for a specific level of S a Consider consisting system Proof: detection increases. Substituting for the values of PD and Tin we get Applying (8), gives us the expected value of error latency as a (1)-(4) n function of MTBF,, II, I2, ki, qi, wi, mi. For example, from En = E PF, [ Ti - T, + (ri - 1) Ti + r, T, (4) we obtained i=l j*ie j=l

I r

=-

n

Y4

i=l

i

Xi YaTi j=l

n

Ya

i=l

n

Xi T, + Yi i=

I

X., T,

E2 =((II1 + w + I2 + km)/2)(3- (I1-q) m)l(l- (I -q))

PDi

In general, En can be reduced if we decrease the value of the

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-35, NO. 11, NOVEMBER 1986

938 PD

10 0.8 0.6

0.64

E/Emin

P /

q-.l

1.4

-

1.3

_q =001

q=0

q-°

0.2 I

50 100 150 200 250 300 Fig. 5. PD versus m.

E min ~215 75 A= 50 K = 100 q =.01

1.2 1.1

controllable parameter J1, J2, and k. We conclude this section with an example. Example: Consider a system with two devices having the same parameters, namely k = 100, A = 50, q = 0.01. Let m = w. Fig. 6 shows a normalized curve of E2 (with respect to the minimum value of E2) as a function of m. Fig. 7 shows curves of E2 as a function of m for different values of k. Fig. 8 shows E2 versus m as a function of A. Note that m is a critical parameter when A varies over a large range. Let a be the average number of test periods required for detection as a function of m, i.e., a(m) = E2/T. Hence, F3 = ma(m) is the average number of test vectors applied before detection. Fig. 9 shows 13 as a function of m. The curve shows that the number of test periods multiplied by the number of test vectors per period is approximately a constant as long as PD(m) is not close to 1. We define the concept of "redundant" test vectors as those tests applied after fault detection has been determined. For a large window size, the percent of redundant test vectors increases. This is why the curve in Fig. 9 increases slightly with m. Note that nmin = l/q and occurs for m = 1. Hence, if q = 0.01, then on the average at least 100 test vectors will be applied to the device before the fault is detected.

n 2 A= 50 =

104

Ex

q

4.

(13,2.1)

,..I

.5

[12].

K-100

-

_K=1~~K=0

_ -C ,

Z.01

S(19011)

1.4

_

A

I

50 25 75 lo0 Fig. 7. E2 versus m for different values of K.

m

E

Em,in(A=50)

A

5.

K-100

q=.01 n= 2

(13,99930)

4

A = 5000

3

(104,40071)

2

(13,21575) C 50

_A50 a

250

150

450 m

350

Fig. 8. E2 versus m.

VIII. CONCLUSIONS

We have outlined the philosophy behind the use of roving emulation in detecting faults in complex digital systems. Its key features include minimum interrupt time; no need for test pattern generation; the elimination of built-in test hardware, self-checking circuits, coding techniques, and the applicability to a large variety of digital systems. The probabilistic and service waiting models presented give us consistent results for the expected value of error latency, however, the servicewaiting model gave us more insight to the factors effecting the error latency, such as separating the service and waiting time. This model can also allow for arrival of failures during testing which provides us with a dynamic oriented analysis technique. In systems built from combinational devices we note that the expected value of the error latency is approximately a constant function with respect to the number of input-output vectors pair obtained during the opening of a window. This means that it is not the size of the window that primarily effects E,s but rather the total number of inputs needed for a certain degree of detection. This conclusion holds true as long as the total interrupt period (overhead time) is negligible. The application of roving emulation to sequential circuits will be the subject of a future paper. The interested reader can find these results in

I-.-L

-

150 200 100 Fig. 6. Normalized E2 versus m.

50

13

m

S

50

_

:b 00

-

I

e.

Flu

Fig.

I

L 13 vsuU 25C 9. 0 versus m. 1 1C

I

30

_ =m

m

A. Fault Coverage As with other forms of self-testing, roving emulation may not always detect a fault when one exists. For example, assume a fault exists in Di, but whenever a test window is opened for this device an error is never propagated to an output, or at the end of the window, to a flip-flop. Such a fault will never be detected even though errors may be created when a window is not open and in fact remain within the system. This situation is quite unlikely, i.e., eventually an observable error will be created when a test window is open for this device, and the fault will be detected. Note that this approach to testing is actually superior to normal self-testing techniques which, if

939

BREUER AND ISAMAEEL: ROVING EMULATION

they have a fault coverage of x percent, where normally x < 97, guarantee to never detect (1 - x) percent of the faults of interest.

B. Complexity of Implementation In practice, testing via roving emulation is not a complex methodology to implement. Construction of a universal roving emulation engine on a single printed circuit card is currently quite feasible. In fact, several engineering workstations have this capability. Systems consisting of printed circuit cards (devices) which plug into a common bus are prime candidates for testing via roving emulation. The only design for test necessary for such cards is that they have the ability to be interrupted, to dump their state, and to signal when they write on the bus. It is also assumed that these devices are designed in a modern top-down manner using CAD tools so that an accurate simulation model for the device exists. For devices, such as some commercial microprocessors, for which models cannot be easily developed, then an actual hardware copy of the device can be used as part of the roving emulator. C. Overhead Time It is difficult to compare the relative overhead time of testing via roving emulation and a more conventional test method, such as using LSSD because in the former case, most of the test processing is done in the background. Consider, for example, a printed circuit board containing 2 X 103 flip-flops, an LSSD design with one long shift register chain, and a test set consisting of 4 x 101 test vectors. Then approximately 8 x 106 clock cycles are required to test the circuit. Thus, the circuit requires 0.8 s of overhead time to test, assuming a 10 MHz clock. To test via roving emulation, we must interrupt the PCB twice in order to capture its initial and final (optional) state; this requires 4 x 103 clock cycles, i.e., only 0.4 ms of overhead time. All other computation is carried out in the background and hence does not contribute to overhead time. Assume that to get approximately the same fault coverage as the 4 x 103 deterministically generated test vectors used for the LSSD test methodology, 4 x 105 functional test vectors are required and that k = 105. Thus, T = 4 x 105 105 = 4 x 1010 clock cycles or 1.1 h are needed to verify the correctness of the response, again assuming a 10 MHz clock. Thus, if there are 20 PCB's with similar characteristics in a system, using a round-robin test scheme the entire system can be tested once each day. If we consider the overhead time to be the time t the system being tested is not carrying out useful computation, then X

t(LSSD test time) t(roving emulator state dump time) 8 x 106 clock cycles 4x 103 clock cycles Hence, using a set/scan type test approach leads to a significantly higher time overhead. However, each PCB can be tested more often than when the roving emulation test methodology is employed. 3

REFERENCES

[1] H. Y. Chang, G. W. Smith, Jr., and R. B. Walford, "LAMP: System description," Bell Syst. Tech. J., vol. 53, pp. 1421-1449, Oct. 1974. [2] W. G. Bouricius, E. P. Hsieh, G. R. Putzolu, J. P. Roth, P. R. Schneider, and C. J. Tan, "Algorithms for detection of faults in logic circuits," IEEE Trans. Comput., vol. C-20, pp. 1258-1264, Nov. 1971. [3] E. B, Eichelberger and T. W. Williams, "A logic design structure for LSI testability," J. Des. Automat. Fault-Tolerant Comput., vol. 2, pp. 165-178, May 1978. [4] J. Savir, "Syndrome-testable design of combinational circuits," Dig. 9th Int. Symp. Fault-Tolerant Comput., June 1979, pp. 137-141. [5] H. J. Nadig, "Signature analysis-concepts, examples, and guidelines," Hewlett-Packard J., pp. 15-21, May 1977. [6] G. Gordon and H. J. Nadig, "Hexadecimal signatures identify troublespots in microprocessor-based industrial products," Electronics, pp. 89-96, Mar. 1977. [7] E. White, "Signature analysis enhancing the serviceability of microprocessor-based industrial products," in Proc. 4th IECI Ann. Conf., Mar. 1978, pp. 68-76. [8] B. Konemann, J. Mucha, and G. Zwiehoff, "Built-in logic block observation techniques," Dig. 1979 Test Conf., Oct. 1979, pp. 37-41. [9] J. J. Shedletsky and E. J. McCluskey, "The error latency of a fault in combinational digital circuit," Dig. 5th Int. Symp. Fault-Tolerant Comput., June 1975, pp. 210-214. [10] L. Kleinrock, Queueing Systems. New York: Wiley, 1975. [11] F. Cohen, "The USC roving emulator engine," Dep. Elec. Eng.-Syst., Univ. Southern California, Los Angeles, CA, USC DISC Rep. 82-8, Jan. 1983. [12] A. A. Ismaeel, "Roving emulation: Theory and analysis," Ph.D. dissertation, Univ. Southern California, Los Angeles, CA, Aug. 1983.

Melvin A. Breuer (S'58-M'65-SM'73-F'85) was born in Los Angeles, CA on February 1, 1938. He received the B.S. degree in engineering with honors from the University of California, Los Angeles, in 1959, and the M.S. degree in engineering, also from the University of California, Los Angeles, in 1961. In 1965 he received the Ph.D. degree in electrical engineering from the University of California, Berkeley. In 1965 he joined the staff of the Department of Electrical Engineering, University of Southern California, Los Angeles, where he is currently a Professor of both Electrical Engineering and Computer Science. His main interests are in the area of switching theory, computer-aided design of computers, fault-tolerant computing, and VLSI circuits. He is the editor and coauthor of Design Automation

of Digital Systems: Theory and Techniques (Prentice-Hall), editor of Digital Systems Design Automation: Languages, Simulation and Data Base (Computer Science Press), coauthor of Diagnosis and Reliable Design of Digital Systems (Computer Science Press), and co-editor of Computer Hardware Description Languages and their Applications (North-Holland). He has published over 100 technical papers and was formerly the editor-inchief of the Journal of Design Automation and Fault-Tolerant Computing, the co-editor of the Journal of Digital Systems, and was the Program Chairman of the Fifth International IFIP Conference on Computer Hardware Description Languages and Their Applications. Dr. Breuer is a member of Sigma Xi, Tau Beta Pi, Eta Kappa Nu, and was a Fullbright-Hays Scholar in 1973.

Asad A. Ismaeel (S'81-M'83) was born in Kuwait. He received the B.S. degree in electrical engineering from the University of Southern California, Los Angeles, CA in 1977; the M.S. and Ph.D. degrees in computer engineering also from the University of Southern California. Currently, he is an Assistant Professor in the Department of Electrical and Computer Engineering, Kuwait University, Kuwait. His research interests include testing and design of digital systems and fault-tolerant computing.