Dynamic fault-tolerant clock synchronization - csail

0 downloads 0 Views 3MB Size Report
synchronized. Assuming a fault-tolerant authentication protocol, the algorithms tolerate both link ... certain actions at roughly the same time. In such a system, each processor usually possesses ...... a clock is set after an adjustment, are all positive ...... that is correct at t, such that CP(t,) > ~ – fs E and ETP(tl). = ~. PROOF.

Dynamic

DANNY RAY

Fault-Tolerant

DOLEV,

Clock Synchronization

JOSEPH

Y.

HALPERN,

BARBARA

SIMONS,

AND

STRONG

IBM Almaden

Research Cente~ San Jose, California

Abstract. This paper gives two simple efficient distributed algorithms: one for keeping clocks in a network synchronized and one for allowing new processors to join the network with their clocks synchronized. Assuming a fault-tolerant authentication protocol, the algorithms tolerate both link and processor failures of any type. The algorithm for maintaining synchronization works for arbitra~ networks (rather than just completely connected networks) and tolerates any number of processor or communication link faults as long as the correct processors remain connected by fault-free paths. It thus represents an improvement over other clock synchronization algorithms such as those of Lamport and Melliar-Smith [1985] and Welch and Lynch [1988], although, unlike them, it does require an authentication protocol to handle Byzantine faults. Our algorithm for allowing new processors to join requires that more than half the processors be correct, a requirement that is provably necessa~. Categories and Subject Descriptors: C.2.4 [Computer-Communications

Networks]: Distributed

applications; distributed databases; network operating systems; C.4 [Performance of Systems]: reliability, availability, and serviceability; D.4. 1 [Operating Systems]: Process Systems—distributed

Management—synchronization; D.4.5 [Operating Systems]: Reliability —~ault-tolerance General Terms: Algorithms, performance, reliability, theory Additional time-of-day

Key Words clock

and Phrases:

Byzantine

failures,

clock synchronization,

fault-tolerance,

1. Introduction In a distributed actions

possesses assumed

system,

at roughly its

own

the

it is often same

time.

independent

to have a bounded

rate

necessary In

for processors

such

physical of drift

a system,

clock from

or real

to perform

each duration

time.

processor timer,

However,

certain usually which

over

is

time,

This is a revised version of a paper entitled Fault-Tolerant Clock Synchronization, which appeared in Proceedings of the 3rd AnnualA CM Symposium on Principles of Distributed Computing. ACM, New York, 1984, pp. 89-102. D. Dolev is also affiliated with Hebrew University. Authors’ present addresses: D. Dolev, Department of Computer Science, Hebrew University, Jerusalem 91904, Israel, [email protected] il; J. Halpern, IBM Research Division, Almaden Research Center, 650 Harry Road, San Jose, CA 95120-6099, [email protected]; B. Simons, IBM Application Development Technology Institute, Santa Teresa Laboratories, 555 Bailey Ave., San Jose, CA 95141, [email protected]; R. Strong, JBM Research Division, Almaden Research Center, 650 Harry Road, San Jose, CA 95120-6099, [email protected] almaden.ibm.com. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. 01995 ACM 0004-5411/95/0000-0143 $03.50 Journal of the As$oc,at,on for Computmg Machinery, Vol 42, No, 1, Januaq 1995, pp 143–185

144

DOLEV ET AL.

D.

these duration timers nized” periodically. More

precisely,

logical

clock

to drift

we assume

time

has no control)

tend

apart.

that

Thus,

the clocks

each processor

is the sum of the reading and its adjustment

must be “desynchro-

has an adjustment

of its duration

register.

It is these

timer

logical

Its

register.

(over

clock

which

times

it

that

are to be kept close together, even in the presence of processor and link failures. Let the logical clock time of processor i at real time t be represented by

C,(t).

We

require

that

there

be

some

constant

DMAX

(for

maximum

de~iation) such that lC,(t ) – C,(t)l < DJL4X. As is mentioned in Dolev et al. [1986], there are trivial algorithms for keeping logical clocks close together. For example, the logical clock time can always be a constant, say O. Of course, this is not terribly useful in practice. A useful clock synchronization also guarantee that logical clocks stay within some linear duration below clock that

timers

(i.e., the time

by a linear time

is indeed

keeps

linear

function

logical

envelope

on the logical

of the time

a reasonable clocks

of

the

synchronization. A number of recent

of

must

duration

close

is said

have presented

timer),

to real

processors

timers

be bounded

duration

approximation

correct

papers

clock

on the

to

algorithm must enuelope of the

time.

An

together

linear

that

and

logical

algorithm

and

maintain

algorithms

above

so that

within

a

enuelope

maintain

linear

envelope synchronization in the presence of faults [Krishna et al. 1985; Lamport and Melliar-Smith 1985; Marzullo 1983; Srikanth and Toueg 1987; Lundelius et al. 1988]. The algorithms of Lamport and Melliar-Smith [1985], Marzullo process

[1983], that

averaging,

these

processors. [1985] require

and Welch

involves Two

reading

algorithms of the

and the algorithms

col, requires

[1985],

analysis

algorithms

2f

are all based

there

be more

presented

in Lamport [1988]

to handle

f faults;

assumes

deal with

the existence

The

processors processors within an algorithms

a third

algorithm

on an averaging

processors.

and Lynch

+ 1 processors.

is provided,

that

[1988]

of all the other

of Welch

which

also requires 3f + 1 protocol, and 2f + 1 tains synchronization explained later. The

Lynch

the clocks require

3f + 1 processors

Melliar-Smith

and

nonfaulty and

and Krishna algorithm

Because than

Melliar-Smith et al. [1985]

of Lamport

of an authentication

of Srikanth

of

faulty

and Toueg

and proto[1987]

to handle f faults without an authentication with an authentication protocol, but it mainoptimal linear envelope in a precise sense of Marzullo [1983], for which no worst-case

ranges

of times

rather

than

a single

logical

clock

time and therefore are not directly comparable. The algorithm of Krishna et al. [1985], called phase-locking, is very close in spirit to the algorithm presented here, in that both algorithms have processors sending out synchronization messages at predetermined times. However, the algorithm of Krishna et al. [1985] requires that the number of faulty clocks be less than one third of the number of participants, and also requires certain assumptions about the nature of the communication medium. For the most recent work on phase-locking and comparison studies for hardware versus software implementations of clock synchronization algorithms, see Ramanathan et al. [1990]. In this paper, a synchronization algorithm is presented that does not require any minimum number of processors to subnetwork containing the nonfaulty that this does not contradict the lower that only n/3 faults can be tolerated,

handle f processor faults, so long as the processors remains connected. (Notice bound of Dolev et al. [1986], which says since we are assuming an authentication

Qnamic

Fault-Tolerant

protocol

here.)

The crucial

that

the majority

necessary requires is the

Clock

the transmission total

number

145

Synchronization

point

is that

since we do not use averaging,

of processors of at most

of processors

be correct.

n 2 messages in the

Moreover,

our

it is not algorithm

per synchronization

system).

The

(where

algorithms

n

of Srikanth

and Toueg [1987] and Welch and Lynch [1988], and one of the algorithms Lamport and Melliar-Smith [1985] also require only n2 messages; the other

of two

algorithms messages

of Lamport and Melliar-Smith [1985] might need as many as n‘+ J to tolerate ~ faults. A final advantage of our algorithm is that it can

deal with

either

processors [1984]

and

and

connected are no faulty

faults

The

Lynch

in any network,

algorithms

[1988]

deal

provided

of Lamport with

only

algorithm

is based

processors,

can broadcast a message with depending on the frequency

If

there

different

are A

faults,

faulty

times)

Melliar-Smith

processor

on the following

one processor

faults

in

simple

a

its current time once an hour of synchronization required).

however,

then

synchronizer

to different

there

might

processors,

observation.

can act as a synchronizer

would then adjust its clock function accordingly, making necessary for the transmission time of the message. approach.

the nonfaulty

and

network.

synchronization

If there

or link

connected.

Welch

completely The

processor

remain

are

minor

obvious

broadcast or it might

the

allowances

problems same

broadcast

and

(or day, or week, Each processor

with

if this

messages

(i.e.,

the same message

but at different times, or it might “forget” to broadcast the message to some processors. Note that it is not necessary to assume “malevolence” on the part of the synchronizer for such behavior to occur. For example, a synchronizer might fail (halt) in the middle of broadcasting the message “The time is 9 A.M.”, spontaneously recover 5 minutes later, and continue broadcasting the same message. Thus, some of the processors would receive the message “The time is 9 A. M.” Nevertheless,

at 9 A. M., while the remainder the idea of using a synchronizer

efficient synchronization faults. The role of the tries

to

succeeds. protocol

act

as a synchronizer

To ensure that

algorithm synchronizer

that

guarantees

that

that is correct is distributed:

at roughly

this

happens

the

time,

the

processors

resynchronization

mented by a method for initializing that they are close together. It must

and

at least

previously

faulty

algorithms processors

time,”

agree

on the expected

algorithm

must

we use a

be supple-

the clocks of the original participants so also be possible for new processors to join

can be extended to rejoin)

one

same

the system or for previously faulty processors to rejoin the system with clocks synchronized to those of already existing processors. Initializing clocks of the original processors turns out to be an easy task. Moreover, synchronization

an

even in the presence of Every (correct) processor

same

at “roughly

all the correct

time for the next synchronization. In practice, such a periodic

would receive it at 9:05. can be modified to obtain

to allow

the network.

new processors The join

their the our

to join

algorithm

(or

allows

joining processors to join a short time after they request to do so. Our join algorithm requires that fewer than half the processors be faulty during the join process. Again, we can tolerate any number of link failures provided that the nonfaulty processors remain connected. This requirement is provably necessary. The remainder of the paper is organized as follows: In the next section, the problem is formalized, a formal definition of linear envelope synchronization is

D. DOLEV ET AL.

146 given, These

and the precise assumptions underlying the assumptions include the existence of a bounded

duration

timers

of correct

processors,

a known

sion time of messages between correct cate signatures. The resynchronization analyzed

in Section

guaranteed discussion [1986], related

algorithm are described. rate of drift between the

upper

bound

on the transmis-

processors, and the ability to authentialgorithm is described in Section 3 and

4. The worst-case

difference

between

logical

clocks

that

is

by our algorithm is almost as small as possible, but a careful of this property is beyond the scope of this paper (see Dolev et al.

Halpern et al. [1985], Lundelius and Lynch [1984]). We discuss issues to initialization and joining in Section 5. In Section 6, we present a

synchronous update service, which enables all correct processes to agree on which processes are currently joined; this service plays a key role in our join algorithm.

The join

algorithm

is presented

8. In Section

9, we show how to modify

and 7 so that

the logical

a piecewise results and

clock

continuous

in Section

8. Section

is a continuous

function.

We

10. We recommend 2 contains

in Section

7 and analyzed

the algorithms function

conclude that

assumptions

presented of real time

with

some

the casual

reader

and

specifications

in Section

in Sections rather

discussion

of our

skip Sections that

3

than 4, 5,

are important,

but not necessary for a basic understanding of the algorithm. The reader who is interested only in the algorithms might wish to read only Sections 3, 7, and 9.

2. Assumptions In

this

and Specifications

section,

denoted

we

A1–A5,

discuss

and

algorithms.

We break

that

immediately

follow

the

the

five

from

the

of an external

the processors.

source

are met

into

structure

made by our

two parts:

of the

some effort assumptions of “real

Just as Lamport

assumptions

that

these specifications

are deeper properties that require Let us first consider the basic existence

basic

specifications

P1–P4

algorithm,

to prove. of the model.

time, ” not

necessarily

and Melliar-Smith

[1985],

in

our

model,

synchronization are properties while We

CS1–CS4 assume

the

measurable

by

Srikanth

and Toueg

[1987], and Welch and Lynch [1988], we distinguish between real time, as measured on this external clock, and duration time, the time measured on some processor’s duration timer DT. We also adopt the convention that variables and constants that range over real time are written in lowercase and variables and

constants

that

uppercase. We define time by no more than Al:

over

the

processors’

clock

time

a correct duration timer to be one that a bounded amount. More formally,

are

written

drifts

from

in real

Each correct duration timer DT is a monotone increasing function of real time, and there is a known constant p > 0 such that for all real times L,, u, with v > u: (1 +

For

range

technical

P)-’(LJ– u)

reasons

i. We take

some

then

notational

sent by correct

that

our

all

algorithms

i >0.

(Compare

PER (for period), ll&OLX E, and e, with PER > ADI

maintain

the

S2 of PER is a

synchronized;

the

in

fact,

ith

and

CS2

to

properties

Lamport and Melliar-Smith [1985 ].) As we mentioned above, the constant is an estimate on the time between successive synchronizations. All] adjustment that real-time interval

CS1

following

S1 and

bound on the maximum constant e defines the

our

(for and

a processor makes to its clock. The within which all correct clocks are

synchronization

occurs

during

the

interval

[t,, t, + e]. DMXX is the bound on how tightly processors not assume that processors can actually compute DA4XX,

synchronize. We do since it depends on

tdel,

can compute

which

upper

they

bound

in other Again,

may

not

that

CSl(i):

~ < ETP(t)

If

That

is, when their

CS2(i):

If processor

CS3(i):

are

they

on tdel),

required

< ~+,,

have

ET

then

to

between

lCP(t)

p makes

an adjustment

– CP(t)

are set forward

at time

by less than

only

for

correct

– c~(t)l

< DW. of synchronization

t and ~ < ETP(t)

ETP(t)

< ~, = ~ and CP(t)

> ~ – AD.1,

If t is in [t,, t, + e], then ETP(t) is either ~ – ADJ < CP(t) < L( + (1 + p)e,

(d) if t = t, + e, then

ET,(t)

= ~ + PER,

(e) if t > t, + e, then

ETP(t)

> ~ + PER.

If processor p is correct at time then ETP is defined throughout in that

s ~+ ~,

ADJ.

ETP(t)

times

some

they can use

< ADJ.

(b) if t = t,, then

critical

which

hold

the same pair

(a) if t < t,,then (c)

CS4(i):

below

that

bound

are close together.

O s CP(t+)

is, clocks

do assume

an upper

and C defined.

< ETq(t)

processors clocks

then That

have ET

We

(using

computations. all the conditions

processors

values,

know.

E for Dik&LY

interval.

t,~ < ETP(t) the

interval

~

or

~ + PER,

and

< ~+ ~, and t > t, + e, [t, + e, t) and p has no

@namic

Fault-Tolerant

Clock

We now show that LES. THEOREM i >0

2.1.

Synchronization

conditions

PI, P2, and CS1–CS3

If an algorithm

M satisfies PI,

in a run r, then M maintains

+ (1 + p)e}, PROOF.

Assume

[u, L)] in

observe

that

LES

p and q are correct, a run

if

r of M.

To

> DM4X.

part

(a) of CS3(j)

that

and

then

(1)

defined

(1) of LES

holds

trivially.

only if there

in

holds,

Suppose

is some j such

s If < ETP(u). Assume, without 10SS of Since q is correct and ET’’(L) > ~, by

v > tj.Since p is correct

have

ADJ

8 = ADJ.

condition

By CS1, this can happen

we must

for all

A = max{DMXX,

and ETP and ETq are both

prove

that E~P(.v) < ~ < ET~(u) or ET~(u) generahty, that ETP( u) < ~ < ET~(u).

to guarantee

P2, and CSl(i)-CS3(i)

– ADJ),

ICP( u) – C~( u)\ < DJL4X,

ICP(U) – C~(U)l

are enough

with parameters

1, ~ = O, y = PER\(PER

a =

interval

153

and ETP(u)

< ~, by

part (e) of CS3(j) we must have u s tj + e. Thus, t~ < u s tj + e. By part (c)of CS3(j), it now follows that ICP(U) – C~(U)\ < ADJ – (1 + p)e. Thus, in general, \CP(u) – C~(u)l < max{DkL4X, ADJ + (1 + p)e}. For part (2), observe DTP(u) ADJ),

– DTP(u)

that

s CP(V)

P1 and the definition – CP(U).

We

next

of C immediately show

that

give us that

if y = PER/(

PER



then CP(U)

If p makes

– CP(U)

no adjustments

by the definition

< y(DTP(u)

in [u, u], then

of C. Suppose

p makes

+ ADJ.

– D~p(z~))

CP(~I) – CP(U) = llTP(~’) at least

one adjustment

– DT’(u), in [u,L’1.

By

Pl, there is a first and last adjustment in the interval. Let w be the time of the first adjustment and let z be the time of the last. By P1 and P2, since ETP is always have

a multiple CP(Z+)

adjustments changes

of

PER,

– CP(W’) made

in the

the clock

and at adjustments

= (k – I)PER, interval

by at most

CP is equal

where

k

is at

[u, u]. Moreover,

ADJ.

Therefore,

to some

least

the

by CS2, each

-

C,(w+)

of

adjustment

CP(Z+ ) – CP( w + ) < DTP( z) –

DTP(w) + (k – l)A~J. Thus, DTP(z) – DTP(w) > (k – l)(PER y(DTP(z) – DTP(w)) > (k – I)PER. It follows that Cp(z+)

ETP, we

number

< @TP(z)

– ADJ)

and

– DTP(w)).

Now CP(U) because

– CP(Z+)

there

= DTP(u)

are no adjustments

– DTP(z)

< y(DTP(L’)

– DT’(z)),

in (z, u ]. Since the only

adjustment

in [u, w

1

is at w, we also have, by CS2, that CP(w+)

– CP(U)

Summing

these

< DTP(w) inequalities,

C,(L)

Thus,

we

PER/

(PER

get

the

– ADJ),



we conclude

CP(U)

second and

(for

“period”)

condition

– DTP(u)) of

LES,

– DTP(u))

+ADJ.

+ADJ.

with

a = 1,

~ = O,

Y =



as desired.

Algorithm

uses two parameters:

is the time

< y(DTP(w)

that

< y(~TP(u)

S = ADJ,

3. The Basic Resynchronization The basic algorithm

+ADJ

– ~~P(U)

between

PER

and E. Roughly

synchronizations

(and

thus

speaking, corresponds

PER to

D. DOLEV ET AL.

154 the

R of Lamport

[1988]),

while

and

~ (for

Melliar-Smith

estimated

[1985]

maximum

and the

deviation)

P of Welch

is an upper

and Lynch

bound

on the

difference between correct clocks. In the next section, we discuss how parameters should be chosen. For processor p, let ETP (the expected time of the next synchronization),

these A.P

(the adjustment register), and CP (logical clock time) be local variables. DTP 1s a continuously updated variable representing the duration timer (hardware clock) of processor p. When processor p starts running the algorithm, ETP = PER

and

AP =

–DTP.

Recall

that

CP is defined

as DTP + AP. Thus,

initially,

CP is O. (More precisely, if p is initialized at time LL, then we take CP(U) to be undefined and CP(U + ) = O.) In this section we assume that all processors in the network than

start

d. In

running

Section

the algorithm 5, we show

during

how

a real

time

to accomplish

this

processors initially in the network. We use the following abbreviations comprise SIGN SEND The

interval

of length

synchronous

in the description

start

less for

of the two tasks that

the algorithm:

means means

“compute “send

algorithm

a signature

consists

of two

processor. The first task, TM a processor’s clock reads ET synchronization

and append

it to the message.”

out to all neighbors.”

messages

tasks

that

run

continuously

on each

(as in assumption

A3)

from

the other

CP(t) = ETP(t), then processor p signs and sends a message saying “The time is ET” and ET is incremented by PER. Task TM if C = ET

then

correct

(for Time Monitor), deals with the case in which before that processor has received any authentic processors.

If

to all processors

begin

SIGN AND SEND ET G ET + pER;

“The

time

is ET”;

end The second

task, MSG

a processor receives receives an authentic

(for

Message

Manager),

If this message is timei’y, that is, if it comes ET – s . E < C, then processor p updates both out the message.

deals with

the case in which

a message before its clock reads ET. Suppose processor p message with s distinct signatures saying “The time is T.”

Otherwise,

the message

at a time when T = ET and ET and A and signs and sends

is ignored.

Task MSG if {(an authentic message M with s distinct signatures saying “The A (T = ET) A (ET – s “ E < C)} then begin received) SIGN AND SEND “M”; AGET–DT; ET & ET + PER;

time

is T“ is

end This completes the description of the algorithm. Intuitively, the effect of these two tasks is to have correct

processors

at the rate of the fastest “reasonable” processor, that is, one whose pass the timeliness tests. As an example of how the algorithm

running messages operates,

L$namic

Fault-Tolerant

Clock

Synchronization

155

is expected at 11:00 (i.e., suppose PER = 1 hour, and the next synchronization ET = 11). If processor p has not received a timely message (one that passes the tests of MSG)

by 11:00 on its clock,

does receive

a timely

MSG.

one of these

read

Once

12:00. Note

that

it receives saying MSG. In general,

then

before

it executes

11:00, then

tasks is executed,

this means

that

task TM.

it executes

p updates

p will

then

If processor

the body

its local

ignore

variable

any further

p

of Task ET

to

messages

“The time is 11:00,” since they will not pass the tests of Task exactly one of the tasks TM and MSG will run to completion

in a synchronization particular, MSG, but

message

interval,

and it will

be run

many messages saying “The time only one of them will be considered

to completion

only

once.

(In

is T“ may be received by task timely in each synchronization

period.) A message with s signatures saying “The time is T“ might arrive as much as s . E “early” (before ET) and still be considered timely according to the test in MSG. Nonetheless, as we show in the next section, at the completion of a synchronization the correct which is less than E. The

following

interval

during

Suppose

example which

DMXX

(the

processors

illustrates

a message actual

are

synchronized

to within

why the test in Task is considered

maximum

MSG

acceptable

deviation

between

(1 + p)d,

must

to have correct

allow size

the s “E.

clocks)

is 0.1

i receives second and in the algorithm we take E = DMAX = 0.1. If processor a message with three signatures saying “The time is 11:00,” and the message arrives

0.29 seconds

before

processor

i’s clock

reads

11:00,

processor

i will

think that the message is timely according to Task MSG. Suppose, however, that processor j is also correct and is running 0.099 seconds slower than processor i (which is possible since DJL4X = 0.1). If processor j receives processor i’s message almost instantaneously, roughly 0.39 seconds before 11:00 on its clock. signatures, MSG

processor

did

number

not

allow

j will the

of signatures,

Indeed

also consider

interval the

it is straightforward

it timely.

of “timeliness”

message to convert

then j will receive the message Since the message now has four

might

not

However, to grow have

this example

been

if the test in Task as a function considered

to a scenario

of the timely.

in which

any

bound on the size of the interval in which a message is considered timely that is independent of the number of signatures on the message results in an incorrect algorithm. In the next section,

we prove

every run of the algorithm given all i >0. As a consequence, our

4. [email protected] 4.1.

that,

are satisfied,

then

above satisfies P 1–P4 and CS l(i)–CS4(i) algorithm maintains LES.

if assumptions

A1–A5

for

of the Algorithin

INITIALIZATION

ASSUMPTIONS AND PARAMETER

DEFINITIONS.

Let

M be

the algorithm described in Section 3, with parameters E and PER chosen to satisfy the conditions presented below. Assume that there are n processors and that they are all initialized with C = O and ET = PER during a real-time interval of duration less than d. Since we take tO to be the first time some correct processor’s clock reads VO = O, it follows that all correct processors are [t., to + d).If a processor p is initialized at time u, we initialized in the interval take ~TP(u) and AP(u) to be undefined, while ETP(u+) = PER and AP(u’) = –DTP(u),

so that

CP(U+)

= O.

D. DOLEV ET AL.

156 We

choose

following

the

parameters

~e>d; DMAX=(l

+p)e+2p.

. ADJ

l)E;



= (f+

o E > DLL4X o PER > ADJ

this can be done:

First,

which

(~+ l)E. Next, choose PER > ADJ set DMXX = (1 + p)e + 2pPER.

they

satisfy

the

fix

e > d. Then

is possible

so that

choose

by A5. Then,

E such

set ADJ

E > (1 + p)e + 2 pPER.

to

Finally,

CORRECTNESS PROOF

THEOREM

4.2.1.

Under assumption

PI–P4

and CS1( i)–CS4(i)

fewer

than nz synchronization

From

Theorem

COROLLARY

real-time

A 1–A 5, elle~

all i >0.

the correctness

logical

clocks

correct

processors

that

of length

of correct

algorithm

of algorithm

all

correct

e. At the end of the

processors

are within

have the same value

are

M maintains

ith

interval,

namely

LES.

straightforward.

synchronized

(1 + p)e

of ET,

corollaqv

[email protected] is quite

clocks

send

lalue.

2.1 we get the following A 1–A5,

M sotisjies

the correct processors

messages for each synchronization

guarantees

interval

mn of algorithm

Moreoler,

Under assumptions

behind

algorithm

for

4.2.1 and Theorem

4.2.2.

The intuition The

so that

PER;

E > (1 + p)e + 2 p( f + l)E,

4.2.

CS1–CS4

(Drift Inequality); (Separation Inequality).

It is easy to see that that

of conditions

conditions:

within

at time

of each other

~ + PER

a

t, + e, all and all

(which

in this

algorithm is ~+ ~). The next synchronization occurs in the interval [t, + ~, t, + ~ + e]. PER. We also show that during the interval We show that t, ~ ~ – ti is roughly PER

clocks

drift

apart

by

at

most

an

extra

2p”

PER.

This

gives

us

the

expression for D&lXX, which is the right-hand side of the Drift Inequality. In practice, the interval during which clocks are desynchronized, which has duration which

at most

e, is quite

has duration

some typical Although

short,

roughly

while

PER,

the

values for the parameters. the intuition behind the

straightforward, a formal induction, which is why

interval

is quite

long.

correctness

proof requires CS 1–CS4 are

between After of

the the

resynchronizations, proof

we consider

algorithm

some care. We prove the paramet erized by i. The

is quite result proof

by of

Theorem 4.2.1 proceeds through a sequence of lemmas, where we prove the relevant properties one by one (and some added necessary properties). In the proof of these lemmas, we assume that properties A1-A4 hold. LEMMA PROOF.

4.2.3.

Evey

We first

run of ti

prove

most

satisfies Pl, of PI.

P2,

P3,

and P4.

It is easy to see by inspection

of tasks

TM and MSG that AP and ETP are both defined for the same values of t if p is a correct processor and that AP changes value only when ETP changes value. ETP is first defined as PER and when it is changed, it increases by PER, so that it is a monotone nondecreasing step function. AP is also a since it changes only when ETP changes. We prove at the end that AP is nondecreasing. Suppose ETP( t ) is defined. Since multiple of PER, suppose ETP = k “ PER. Since ETP increases

step function, of the lemma it must be a by PER each

@namic time

Fault-Tolerant

it is changed

adjusted

Clock

Synchronization

and starts

no more

than

out

since ETP is a step function interval ending with t,there ETP is constant

itfollows

at PER,

k – 1 times

157 that

in any interval

ETP can have

ending

been

t.Moreover,

with

which assumes only finitely many values in any must be an interval of the form (u, t] such that

in this interval,

and either

ETP(v)

is undefined

or ETP(u’)

#

at L), our initializaETP( ~’). We clearly must have L’ < t. If ETP is first defined tion assumption guarantees that CP( v + ) = ETP( L)+ ) – PER. Otherwise, this fact

is guaranteed

works

by the

code

of tasks

MSG

and

TM.

A similar

argument

in the case of AP.

For PER.

P2 observe Since

ET

that

processors

is changed

that are positive MSG also shows

only

are initialized by adding

with

PER,

integer multiples of PER. An that the synchronization values

ET

A =

–DT

can take

and

ET

on only

=

values

inspection of tasks TM and sent are always equal to the

current value of ET, and after an adjustment, a logical clock is set to the current value of ET. P4 follows from inspection of tasks TM and MSG and our assumption that if p is initialized at time u, then ETP(u’) = PER and CP(U’) = O. For P3, suppose that a processor p is correct and CP is defined at some time

t,which shown,

is not a critical there

is a first

time.

We first

prove

u < t such that

time

that

interval.

Suppose

that

CP(LL) > ETP(u)

continuity strictly

of

definition

of

that

We now prove previous while

guarantees

ETP throughout

w because

CP(U) > ETP(u). particular,

that

paragraph.

that the

be a

– PER. Since AP is and increasing function in this

some

there

u G (L), t].

that

is some

interval

each neighborhood

Itfollows C,(t)

for

L’ must

= ETP( L‘)

Let

inf{u

w =

~

By continuity, w > u and C’,(w) = ETP(w). Thus, by and MSG, we have ETP(w’) = ETP(w) + PER. The

CP then

less than

As has been

AP( L’+ ) = AP(t ). Since

critical time for p, we have by P4 that CP( L‘) constant in the interval (L), t], C’p is a continuous (u, t]:C,(u) > ETP(24)}. inspection of tasks TM

Cp(t) s ETP(t).

x > w such

that

CP is

(w, x). But

this

contradicts

w must

contain

some

of

CP s ETP throughout

the interval

the

u with

(u, t] and, in

s ETp(t).

Cp(t) > ETP(t) Since

ETP is a step function,

CP(V+) there

– PER. = ETP(v’)

Let

L) and

t be defined

– PER,

and

as in the

CP is increasing

must be some u’ > L such that

CP > ETP –

PER throughout the interval throughout (L), u]}. We claim

(L, L)’ ]. Let w = sup{u = (L, t ]: CP > ETP – PER that CP(W ) > ETP( w) – PER. To see this, observe

that

changes

since only finitely

many

to ET

take place

an xl = (v, w) such that ETP is constant throughout w < t, there must be an Xz = (w, t) such that ETP

in (L], t], there (xl, w). In is constant

(w, Xz). By construction, CP(X1 ) > ETP(xI) – PER. Since continuous from the left at w, while ETP is constant

must

be

addition, if throughout

CP is increasing in (xl, w), we

and have

CP(W) > ETP(w) – PER, as desired. If w = t, we are now done. If w < t,then we clearly must have ETP(w + ) > ETP( w). By inspection of tasks TM and MSG, we have CP(WJ+) = ETP(w + ) – PER. Since ETP is constant in (w, Xz ), it follows that CP > ETP – PER throughout [w, Xz). But this contradicts the definition of w. Thus CP(t) > ETP(t) – PER, and this completes the proof of P3. All that remains is to complete the proof of PI by showing that AP nondecreasing. Observe that the only task that changes AP is task MSG.

is If

> ETP(t) task MSG changes AP at time t,then it follows from P3 that AP(t’) – DTP(t) > CP(t) – DTP(t) = AP(t). Thus, AP is also a monotone nondecreasing step function of real time. This completes the proof of PI. ❑

D. DOLEV ET AL.

158 LEMMA and

4.2.4.

CP(t+)

receiLes

Let t be a critical

= O, (b)

CP(t)

a synchronization

time forp.

is defined

message

Then either

and

with

CP(t)

(a) CP(t)

> CP(t+)

synchronization

is undefined

– f “E,

value

or

(c)

p

CP( t + ) at time

t

signed by some other con-ect processor. PROOF.

The

initialized

only

t, (2)

at

way

t can be a critical

that

CP(t)

= ETP(t)

(according

to

time task

for

p

TM),

is if (1) or

(3)

p

Cp(t)

is is

defined and p receives a timely message in task MSG. If (1) holds, then Cp(t) is undefined and Cp(t + ) = O, while if (2) holds, then Cp(t ) = Cp(t’). Thus, suppose (3) holds, and p receives T and s signatures. The timeliness s . E. If s < f, then we are done. message

must

LEMMA

be that

4.2.5.

processor

a timely message with synchromzation value test guarantees that CP(t) = T > CP(t’) – Otherwise,

of a correct

If i >0

that is correct

by A4, one of the signatures

processor,

and t, is finite,

(1) t,< t,+~ and

then

at t, such that CP(t,)

on the ❑

so again we are done.

(2) there is a

> ~ – fs E and ETP(tl)

= ~.

PROOF. Let Zj = min{u : exists a processor p that is correct at time u and CP(U + ) = j “ PER}. If for no time u is it the case that there is a processor correct at u with CP(14+ ) = j “ PER, then we take z] = ~. By P2 and the fact that VI > 0, {t, : i >0, t, finite} is a subset of {z, : j >0, z, finite}. Let p be a processor

correct

at time

CP(ZI ) = j . PER. critical P,

then

integer

time

for p, then

Cp(Zj)

=

multiples

z] such that

CP(ZJ) and ETP(z,) Cp(Z~). of PER

CP(Z,+) = j “ PER. are defined

We want

because

CP(z~)

to show that >0.

ETP(z, ) = CP( ZJ+) by P4. If z] is not a critical

If z, is a time

for

In this case, CP(ZJ) = ETP(z1) since they are both and ETP(z, ) – PER < CP(ZJ ) s ETP(z, ) by P3. Thus,

in either case, ETP(z, ) = j “ PER as desired. Since ETP( z] ) = j . PER, it follows from P1 that for j >1 there exists a y < z, such that ETP( y+) = j “ PER and CP( y+) = ETP(y+) – PER = (j – 1) “ PER. Therefore, z,_ ~ < z,. Since the t,’s are a subset of the ZJ’S, and since the finite ZJ’s are totally ordered, it follows that the finite t,’s are also totally ordered. This proves part (1) of the lemma. For

part

(2), by definition

of

t, there

is a processor

p

correct

at t, with

Cp(t~) = ~. We have ~ = j “ PER for some j with t, = z]. We argued above that in this case, both Cp(tl) and ETP(tl) were defined and ETP(t,) = j . PER = ~. If t, is not a critical time for p, then Cp(t,) = Cp(t~ ) = ~, so (2) holds. If t, is a critical time for p, then by Lemma 4.2.4, one of the following three cases holds:

(a) Cp(t~) (b) (c)

= O,

Cp(t,) is defined, and Cp(tz) > Cp(t~) -f” E, a synchronization message with synchronization from another correct processor.

value

Cp(t~ ) is received

Since i >0 by assumption, case (a) does not hold. Case (c) also does not hold for otherwise (by A2) there would be a correct processor whose clock read ~ at Thus, case (b) must hold and Cp(t,) > Cp(t,+ ) – f” E = ~ – a time prior to t,. f. E. Hence, whether or not t,is a critical and ETP(t, ) = ~. ❑ The next lemma a run is sent out message.

shows that more than

time

for p, we have C,(t,)

the (i + l)st synchronization e time units later than the

> L( -f.

E

message sent out in i th synchronization

L&tamic

Fault-Tolerant 4.2.6.

LEMMA

and ti is finite,

Clock

Synchronization

In eve~ run [email protected]

159

and for all i >0,

ifpart

(a) of CS3(i)

holds

then t,+ ~ > ti + e.

PROOF. Suppose that ti & finite. If t,+~ is infinite, then the lemma clearly holds. Otherwise, by the previous lemma, there is some processor, say p, that is + ~ such that Cp(tL + ~) > ~+ ~ –f. E and ETP(t,+l) correct at time tl = V+l. BY PI, there From

is a u < tl+l

that is the earliest time such that ETP(L~’) = ETP(tl + ~). (a) of CS3(i) and PI, it follows that u > t,. By Pl, we have that

part

:,(u+)

= ETP(u+)

– PER

= ~+1

– PER.

Moreover,

(u, t,+ ,), since, by Pl, A, changes Interval Al, we have that CP(tz+l) s ~+1 – PER ~ – t,).Combining this with + (1 + p)(ti+

CP is continuous

in the

only when ET, changes. Thus, by + (1 + p)(t,+l– u) s ~+1 – PER the earlier inequality CP(tL + ~) z

~+1 – f” E, we get that (1 + p)(ti+,– t,)> PER – f. E. The Drift and Separation Inequalities together imply that PER – f” E > (1 + p)e, so we get that ❑

t,+ ~ > t, + e, as desired. LEMMA

If CS3(i)

4.2.7.

PROOF.

Suppose

be two processors

and CS4(i)

r is a run where that

are correct

F(+l and ~ < ET~(t) s ~+l. Since both ETP(t) and ET~(t) multiples

of PER

~ + PER.

By

in a run of d,

CS3(i)

and CS4(i)

t in run

at time

then so does CSl(i).

hold,

and let p and q

r such that

~ < ETP(t)


t,. are greater than ~, and these values must all be

by P2, we must

P3, it follows

hold

that

have that CP(t)

ETP(t)

> ~

and

> ~ + PER C~(t)

> ~.

and

ETq(t)

By part

>

(c)

of

CS3(i), if t is in the interval [ti, ti + e], then CP(t) < ~ + (1 + p)e, and C~(t) < ~ + (1 + p)e. Thus, both CP(t) and Cg(t) are in the interval (~, ~ + (1 + p)e), so that lCP(t) – C,(t)/ < (1 + p)e, which is less than DMAX. Now the

suppose

interval

interval. there

t > t, + e. By CS4(i),

[ti + e, t).

Suppose

without

can be no point

For if there

were,

Thus,

we have that

CP and

Cg

loss of generality

t’in the interval

then by P3 we would

AP and Aq are constant

are

continuous

that

CP(t ) > C~(t ). We claim

such that

functions

Cp(t’ ) is of the form

have Cp(t’)

= ETP(t’).

Then

in

in this that

k. PER.

by task TM

CS4(i). By parts (c) a synchronization value would be sent at t‘,contradicting and (d) of CS3(i) together with P3, we know that C~(ti + e) > ~ and CP(tl + e) < ~ + (1 + p)e. Since ~. is a multiple of PER, and CP(t) cannot be a multiple of PER in the interval [t, + e, t), we know that CP(t ) s ~ + PER. It is easy to see that we overestimate the maximum separation between CP and C~ at time

t by assuming

(1) CP(ti + e) = ~ + (1 + p)e,

(2) C~(tl

+ e) = ~,

(3) C, runs at the maximum possible rate (1 + p) in the interval [t, + e, t], (4) C~ runs at the minimum possible rate (1 + p)- 1 in this interval, and (5) CP(t) = ~ + PER (so that the interval is as long as possible). Making these assumptions, we see that t = ti + e + (1 + p)-l(PER – e), CP(t) = ~ + PER, and p)-z >1

– e). Thus, Cp(t) – Cq(t) < (1 – (1 + C~(t) = ~ + (1 + p)-2(PER )PER + (1 + p)-ze. Since straightforward algebra shows that (1 + p)-2 – 2p, this expression too is bounded by D&lAX. ❑

LEMMA

PROOF. ~ < ETP(t)

4.2.8.

If CS3(i

Suppose s ~+l.

+ 1) holds

in a run of.&,

p is correct and makes By P4 (which holds

then so does CS2(i).

an adjustment at a time t such that by Lemma 4.2.3), ETP(t) = z+ 1,

D. DOLEV ET AL.

160 ETP(t+)

= ~+1

+ PER,

+ 1), t must

CS3(i follows

from

LEMMA

part

4.2.9.

(c) of CS3(i If CS3(i)

Suppose

PROOF.

hence

faulty,

[tl+l,

+ 1) that

holds

fl+l

in a run of d, at time

then so does CS4(i).

t,~ < ET’(t)

s ~+,,

all correct processors are initialized once ETP is defined it stays defined

it follows

that

ETP

(d) and (e) of

+ e). Since

Cp(t+) = ~+1, it Cp(t’) – Cp(t) < MY. ❑

is defined

d s e and t,z to)in [t, + e,

(since

By parts so t > t,+l.

= ~+1,

interval

p is correct

Since by assumption [tO, tO + d), and since becomes

and CP(t+)

be in the

in the

Suppose

t].

and

in until

interval there

t > t, + e.

the interval processor p

[tO + d, t],

were

a critical

and time

u for p in the interval [t, + e, t). Since u > t, + e, by CS3(i) and PI, it follows that ETP(u) > T(. By P4, ETP(u) = ~ for some j > i. Thus, ETP(u) > ~+ ~. By P4,

we

have

ETP(u+)

= ~ + PER.

contradicting our assumption. interval. CS4(i) now follows. LEMMA

4.2.10.

PROOF.

CS3(i)

We proceed

Thus,

Hence, El

holds for all i >0 by induction

and VO = O by definition,

by

there

Pl,

we

have

is no critical

ETP(t)

time

for

in the

in ezleq run of M.

on i. For the case i = O, recall

and we assumed

> ~+1, p

that

all processors

that

tO = O

are initialized

at

some time in the interval [0, e). Since we have also assumed that if a processor p is initialized at time u we have ET’’(u) undefined, it is easy to see that parts (a) and (b) of CS3(0) hold vacuously. Clearly if a correct processor’s logical clock is not adjusted before time tO + e, then by Al it reads a value in the range [0, (1 + p )e) wherever it is defined in the interval [to, to + e], while its value of ET is PER. On the other hand, if some correct processor’s clock is adjusted in this interval, then tl < tO + e. By Lemma 4.2.5 and the fact that VI > PER, for some correct processor p we have CP(tl) > VI – f. E > PER – f” E. Since p cannot have adjusted its clock prior to tl, we must have (1 + p)e > CP(tl ) > PER – f” E, Thus, no correct processor adjusts as well Now then finite.

contradicting the Separation Inequality. its clock before to + e. This proves part (c)

as part (d). Part (e) follows from PI and P2. assume CS3(i) holds; we show that CS3(i + 1) holds.

so is ~+ ~, by definition, For part

4.2.6 implies have ETP(t)

(a) observe

so CS3(i

+ 1) is vacuous.

that by P2, it follows

that t,+ ~ > t, + e. Suppose > ~+l. By P2, it follows

that

If t,+ ~ is infinite,

So suppose

that

~+ ~ > ~ + PER.

t,+,

is

Lemma

p is correct and for some t < t,+ ~, we that ETP(f) > ~+ ~ + PER. By parts

(a)-(d) of CS3(i), it is easy to see that we must have t > t, + e. We next show that CP must be continuous in [t, + e, t). If not, then there is some adjustment (such a u exists by Pi). in [t, + e, t).Let u be the time of the first adjustment By P4, CP(U+ ) = E~P(u) = ~ for some j. By parts (d) and (e) of CS3(i), conETP(u) > ~, so j > L + 1. If ETP(u) = ~+1, then the fact that ZL < t,+, tradicts the definition of t,+~.If ETP(t4) > ~+,, then by P3, CP(U) > ETP(u) – that PER > ~+,. Since CP is continuous in the interval [ti + e, u ), it follows CP( u) = ~+ ~ for some u in the interval and hence, again by continuity, that CP(L’+) = ~+l. But this contradicts the definition of t,+~.Thus, CP is continuous in [t, + e, t), as claimed. By P3, we have CP(t) > ETP(t) – PER > ~+ ~. Since Cp(t, + e) < ~ + (1 + p)e < ~ + PER s ~+ ~, itfollows from the continuity of CP that for some point u in the interval (t, + e, t), we have

Dynamic cP(u) tion

Fault-Tolerant

= ~+1

Clock

and hence

CP(L1+)

=

But

~+l.

Thus, we must have ETP(t) of t,+,.

of CS3(i

this

again

contradicts

s ~+ ~, as desired.

the defini-

This proves

part

(a)

+ 1).

For part correct

161

Synchronization

(b), first

observe

at t,+ ~ we have

induction

assumption

that

Cr(t, +l)

together

by Lemma > ~+1

with

4.2.5 for some processor

- ~” E. By CSl(i)

Lemmas

(which

4.2.7 and 4.2.9),

p that

holds

is

by the

for every proces-

sor q that is correct at ti+, we have lC’P(tl+l) – C~(tt+l)l < DMZ4X, so Cq(tz+ ~) >~+1 -f. EDMXX>~+l – ADJ. Since ADJ < PER by the Separation Inequality, it follows from P3 that we must have ETq(t,,, ) > ~+ ~. In combination with the previous paragraph, this gives us part (b) of CS3(i + 1). (Since ETP does not change in the interval [t, + e, ti+ ~), we can in fact show that ETP(t,, ~) = ~ + PER, and hence that ~+, = ~ + PER. Thus, we could carry along

as an inductive

hypothesis

that

~ = i . PER

if t, is finite,

but we do not

need this fact here, nor will it hold for our join algorithm.) For part (c) of CS3(i + 1),suppose that p is correct at time e]. There

are

four

cases to

consider:

(1)

ETP(t)

t = [t, + ~, t, + ~ +

< ~+ ~, (2)

ETP(t)

= ~+ ~,

(3) ET (t) = ~+ ~ + PER, and (4) ETp(t) > ~+ ~ + PER. We show that only case (2J or case (3) can hold, and that, in these cases, ~.+, – ADJ < CP(t) < ~+ ~ + (1 + p)e. By Lemma 4.2.6 and the induction assumption applied to part (a) of CS3(i), we can assume assumption applied to CS3(i), Suppose

case (1) holds,

t,+ ~ > t, + e; by Lemma we can assume CS4(i).

so ETP(t)

> t, + e. By part

we have t > t,+l that ~ < ETP(t ) < ~+ ~. By [ti + e, t); ETP is defined

t,+~.By part (b) of CS3(i

4.2.9 and the induction

< ~+ ~. By assumption (e) of CS3(i)

CS4(i),

ETP

is defined

at t by assumption.

throughout

It follows

+ 1),ETP(t, + ~) = ~+,.

and Lemma

and the assumption, that

4.2.6,

we know the

interval

ETP is defined

at

Since t > t,+~,this contradicts

P1. Suppose is constant

case (2) holds, throughout

– ADJ.

By

so ETP(t)

P3,

Cp(t)

= ~+ ~. By CS4(i),

< ETP(t)

[~+~,+~+1 + e] we have ~+1 –ADJ Suppose case (3) holds, so ETP(t) u < t such

that

ETP(u+)

= ~+,

ETP is defined

[ti+ *,t).By part (b) of CS3(i

the interval

= ~+l.

< C,(t) = ~,,

+ PER.

Moreover,

By

< ~+1 + PER.

P1

and

and AP

+ 1), Cp(t, + ~) Al,

+ (1 + p)e. By Pl, there

since

t =

is a first

time

By part (a) of CS3(i + 1), u > t,+ ~. Finally, by PI and the fact that there

by PI, we have C (u+) = ~+l. are no changes to ETP in t ( e interval (u, t),there are no changes to AP, and CP is continuous in this interval. Since ti+, s u < t < t,. ~ + e, using Al and Pl, we get ~+1 < CP(t) < ~+1 + (1 + p)e. Suppose case (4) holds, so ETP(t) > ~+ ~ + PER. Let T = ETP(t). By P2, we

must have T > ~+ ~ + 2PER. By Pl, there is a first time u < t such that some correct processor q has ETq(u’) = T. There are two subcases: (4a) L~ is a critical time for q, and (4b) u is not a critical time for q. Suppose (4a) holds. Then, by P4, C~(u’) = ET~(u’) – PER Thus

t]s u. But ET~(u+)

j > i + 1. By Lemma 4.2.6 + e < t,+ ~, contradicting ti+l

= T > ~+1 and part Lemma

+ 2PER,

so C~(u+)

(a) of CS3(i 4.2.5.

= ~ for some ~.

> ~+1

+ 1), we have

+ PER,

and

t,s LL < t s

Suppose case (4b) holds, so that u is not a critical time for g. Then ETq( u) is defined and ET~(u) > ~+ ~ + PER. By PI, there is a first time u < u such that = ETq(LI+) – PER = T – 2PER > ~+l. Since ETq(L1+) = ETq(u) and Cq(LI+) ETq(u+) > ~+1 + PER, Itfollows from part (a) of CS3(i + 1) that if w is any time

in (L1, u),

then

w > t,+,.Thus

t,+l

s z] < u < t s t,+,

+ e. Now

C~ is

D. DOLEV ET AL.

162 continuous

on (u, u], so Cg(u+)

the fact that

(1 + p)e

< PER.

< C~(U+) But,

for

+ (1 + p)e

< T – PER,

any w in (u, t), C~(w)

P3. Thus, C~(u’) > T – PER, contradicting pletes the proof of part (c) of CS3(i + 1).

C~(u’)

by Al

and

> T – PER,

< T – PER.

This

by

com-

For part (d), suppose that q is correct at t,+ ~ + e and ET~(tl+ ~ + e) # ~+ ~ + PER. By part (c) of CS3(i + 1), we must have ET’’(t, + ~ + e) = ~+ ~. Let p be a processor correct at t,,~ such that CP(t~+ ~) = ~+ ~. We have assumed that all processors are initialized before tO + e. By Lemma 4.2.6, t,+ 1 > to + e. Thus, it follows from the definition of correct that q must be correct throughout the interval [t, + ~, ti+ ~ + e) and that CP(tl + ~) and EP(tl + 1) are both defined. By part (b) of CS3(i + 1), EP(t, +l) = ~+l. We claim that p sends synchronizaby inspection of tion value ~+ ~ at time t,+~.If Cp(t, + ~) = ~+ ~, this follows task TM.

Otherwise

Cp(tl + ~) # Cp(tC~ ~), so that

p must

invoke

task

MSG

at

t,+~,and at that time

p sends synchronization value ~+ ~. We now apply A2 with t = ti+l. Let po, . . . . p~ be the sequence of processors guaranteed to exist by A2, with p = po, q = p~, and k “ tdel < d. Note that

time

for

all

{~+1,

times ~+1

~,t,+ ~ + d], u = [t,+

+ PER}

by part

ICPJU) – CP,(U)I < DiWIX the synchronization value

each

(c) of CS3(i

correct

The base case holds by assumption.

time

with

that

p,+,

value

Suppose

~+ ~ before

has

ETPJu)

G

then

[t,+ ~, t,+ ~ + (j

p, sends the synchronization

value

and j < k. By A2,

p]+ ~ receives

p~ + ~ already

a synchronization

this time,

sent the message

been in the interval

p]

by CSl(i). We show by induction on j that p, sends ~+ ~ at some time in the interval [t, + ~, t, + ~ + j” tddl.

in the interval [t, + ~, t,+ ~ + js tdel], .y+l message before t,+ ~ + (j + 1) “ tdel. If message

processor

+ 1) and, if ET,< = ETP,, = ~+1,

then

it set its clock + l)tdel],

sent

by tasks TM

and MSG

to ~+ ~. This

as desired.

time

this at the

must

have

If p, + ~ did not already

send such a message, then it suffices to show that the message it receives from p] is timely, that is, it passes all the tests of task MSG. Suppose p, sent its message at time u and the message is received by p]+ 1 at time t.Since the [u, t] is contained

interval

in the

interval

[t, + ~, t,+ 1 + e) and

since

we have

assumed that pi+ ~ has not sent ~+ ~ by time t,part (c) of CS3(i + 1) implies that the value of ET for p,+ ~ must be ~+ ~ (the only other choice is ~+ ~ + PER, but by inspection nization value ~+ ~ is sent lCP\u)

-

DMXX. then

CP,+~u)l There

arrives

since

t > u, it follows

with

Thus,

CP,+jt)

one

signature)

> ~+1

– DM21X,

passes the

a message with synchroto ~+ ~ + PER). Thus, that

are now two cases. If p, used task TM

CPJU) = ~+l.

(which

< DIW4X;

of tasks TM and MSG, out when ET is set

C,,+,(t)

> C,,(u)

-

to send out its message,

so in this case the message

timeliness

test.

If

pj

used

task

MSG, pj was responding to a message with s signatures and sending a message with s + 1 signatures. Since p~ found the message timely, CP,(U) > ~+ ~ — s - E, and so CP,+,(t) > ~+ ~ – (S + 1)-E. Since p,+ ~ receives the message with s + 1 signatures, again it passes the timeliness test. By task MSG, it now follows that p~ + ~ sends out a message with synchronization value ~.* sometime in the interval [t, + ~, t,+, + (j + l)tdel). Since k. tdel < d < e, it follows that q sends out such a message before time ti+ ~ + e. By P4 when q sends out this message, it sets ET~ to L( + ~ + PER. By P1 this contradicts the original conclusion that ETq(tl, ~ + e) = ~+ ~. The contradiction completes the proof of (d). Part (e) is immediate from part (d) and P1. ❑ Proof

of Theorem

4.2.1.

By Lemma

4.2.3 @ satisfies

P1–P4

in every run.

By

l&tamic

Fault-Tolerant

Lemma Lemmas For

Clock

Synchronization

163

4.2.10 it satisfies CS3(i) for all i >0 in every run. It now follows by 4.2.7, 4.2.8, and 4.2.9 that it also satisfies CSl(i), CS2(i), and CS4(i).

each synchronization

messages:

one

value,

synchronization

than

n2 message

proof

of Theorem

each correct

message

are sent for

processor

sends at most

to each of its neighbors.

each synchronization

value.

n – 1

Thus,

This

fewer

completes

the



4.2.1.

4.3. PERFORMANCE ISSUES. We now consider some typical values for parameters of the algorithm. Suppose p = 10-6, tdel = 0.1 second, and network is completely connected with n processors. Then, so long as there

the the are

no more than two processor failures and the network remains connected with d = e = 0.2 second, E = diameter at most 2, we can take PER = 1 hour, DMAX = 0.21 second, and ADJ = 0.63 second. If we allow only processor failures

(as is the

Lynch

[1988],

diameter

case in Lamport

then

we

can

of the network

0.11 second, roughly smaller

do

is still

and Melliar–Smith even

better,

since

1. We can take

d = e = 0.1 second,

PER

[1985] we

are

and Welch assured

= 1 hour,

and AD.1 = 0.33 second.

Note

and

that

the

E = DMXX

=

that

is

DW

equal to d. As stated in Section 2, we can make d, and hence DM.4X, by giving the synchronization process high priority in the scheduling of

the operating Since our

system of the processor. algorithm never sets clocks

back,

if duration

timers

have

fixed

rates of drift from real time (as is often the case) and there are no faults, then clocks will run at the rate of the fastest correct duration timer. This means that logical

clocks

worst

case,

we

of correct

PER/(

PER

– ADJ).

have

processors from

Since

will

Theorem ADJ

tend 2.1

= (f+

to run faster that

l)E,

than

processors

if PER

real time. run

>> ADJ

at

and

In the

a rate

of

E = DMAX

= (1 + p)e + 2P” PER (these assumptions will all be typically true in practice), this worst-case rate is approximately equal to 1 + (ADJ\PER) = 1 + (f+ 1)2 p. In Srikanth and Toueg [1987], an algorithm is given that attains optimal synchronization in the sense that logical clocks of real time as duration timers (i.e., (1 + p)-l(u u > u). However, P)( u – U) for Srikanth and Toueg [1987] require less than half the total number necessary, achieve still

even with

is essentially

use our algorithm

the rate

at which

clocks

of correct

Moreover,

ours in completely

but perhaps

logical

then to set duration In our algorithm,

to maintain this optimal synchronization, that the number of faulty processors f be of processors, a requirement they prove

authentication. twice

clocks

timers DikMX

are within the same envelope – u) < C(U) – C(u) < (1 +

decrease

gain

time

the value connected

of DMAX networks.

the rate of speedup

in practice

using

to run slower by that rate. gives an upper bound on the

processors

that

have the same value

our

they

is to measure algorithm,

difference

of ET.

There

interval

their

clocks

differ

by at most

ADJ

and

between may be a

short interval of time (a subinterval of [t,, t, + e]) during which correct sors have different values of ET. By part (c) of CS3(i), itfollows that this short

can

One way to

proceseven in

+ (1 + p)e. If we assume

that p = O and E = DJL4X, then (since ADJ = (f+ l)E) this difference is bounded by approximately ( f + 2)e + 2p. PER. Using the estimates for p and PER given above, we see that the dominant term here is (f+ 2)e. This amount with n. One may be unacceptable in large systems, where f may grow linearly way around this problem is to prevent events that require timing from taking place in this interval, as suggested in Lamport and Melliar-Smith [1985].

164

D. DOLEV ET AL.

However, logical

there

clock

is another

approach.

(without

making

time t, and continue maximum real-time

running duration

distributed starts and clocks will

process, a virtual then undergoing differ

We can simply

any adjustments)

DMXX

rather

than just being

the adjustment

we make

piecewise

to clocks

Since we can take

Dolev

et al. [1986]

bounds to

in question

synchronize

Cristian

this interval,

is the some

continuous

which

may be

functions

of real

We can do this by amortizing interval,

rather

than

in Lamport and Melliar-Smith clocks in Section 9.

e = d in this algorithm,

the bound

doing

[1985].

on synchronization

it

We that

= (1 + p)e + 2 pPER) is essentially within a factor of 2 of d/2 attainable in systems with no clock drift at all (see

and Halpern

et al. [1985]

are guaranteed,

with

much

for further

worst-case

tighter

details).

bounds,

However,

the

and it may be possible

precision

with

high

probability

The

is initializing

(see,

e.g.,

and Joining

There

are two

logical

clocks

issues

that

are started

(joining)

new

or

remain. within

first

less than

repaired

d time

processors

so

units. that

the

[Cristian

et al. 1986; Dolev

initially

in

message

from

the

(thus

neighbors. time.

network another

setting

et al. 1986]. We assume starts

its logical

By assumption

either

processor.

their

clock

A2,

this

that

spontaneously

As soon

sends

requires

logical

so that

is integratclocks

message

are

diffusion

each of the processors or

as a processor

to O) and diffusion

system

The second

synchronized with those of all the other processors. The first task can be accomplished quite easily by a simple

–DT

before

[19891).

5. Initialization

ing

clocks

continuous.

over some time

all at once. This idea was suggested present an algorithm for continuous we maintain (DiM4X of the optimal bound

the “old”

with logical time until the process will suffice. Unadjusted logical

+ A . dur during

significantly less than AD] + (1 + p )e. Yet another approach is to make logical time,

using

events that begin

after a clock adjustment is made. If dur which a clock might be used to time clock coinciding no adjustments

by at most

continue

to time

upon

receipt

starts,

a message less than

of

a

it sets

A =

to all

of its

d units

of real

We now turn our attention to the problem of joining, to which most of the remainder of the paper is devoted. We start with some notation: A previously synchronized group of processors is called a chlster, and a new processor that wants to join the cluster is called a joiner. We want an algorithm that allows a processor to join a cluster within a bounded time of requesting to do so. Such an algorithm is crucial in a dynamic network in which new processors arc being added to the system. If we have a method of fault detection, such a join algorithm also allows faulty processors that have been repaired to rejoin a cluster. An algorithm achieves bounded joining if for some bound b > 0 a correct processor that requests to join a cluster is guaranteed to join within real time b. Unlike the basic clock synchronization algorithm, which does not require that some minimum number of processors be correct, a necessa~ condition for a bounded joining algorithm to be guaranteed to succeed is that a majority of the processors

in the cluster

be correct.

Dynamic

Fault-Tolerant 5.1.

THEOREM

if a processor

Clock

Synchronization

No algorithm

tries to join

165

can maintain

a cluster

LES

and guarantee

where one half

or more

a bounded

join

of the processors

are

faulty. PROOF.

Assume

algorithm

~

maintains

and ~ and bounded joining with processors are correct throughout

t and

choose in the cluster

real time processors

LES

with

parameters

A, a, ~, y,

bound b. Consider a run r where all the run and in the same cluster. Choose

n a

T such that the time on the logical clocks of the at real time t is at most T. Now choose T’ such that

T’ – T > b(y(l -t p) – a(l + p)-l) + (8 – ~) + 2A. The LES condition clocks of all correct guarantees that at some time t’ in run r, the logical processors show a time greater than T’. (This would not necessarily be true if &Z’ were not required logical clocks always

to maintain read O.)

We use r to construct groups the

LES;

two further

X and Y, each of size n/2

first

run,

rx,

the

processors

for example, runs

of ~.

(we assume in

X

are

r might Divide

be a run where

the

for simplicity correct

and

n processors that

all into

n is even).

processor

p

In

(a new

tries to join at time t.All the processors t just as in run r. At time t,processors in Y move into the state they had at time t’ in r. In the second run, rY, the processors in Y are correct and p tries to join at time t’ with the same local state it has when it tries to join in rx at time t.All processors proceed through in run rY until time t’just as in run r. Then just as p tries to join, processors processor,

proceed

not

through

in either run

X

or

tx until

Y)

time

X move into

the state they had at time

tries

to join,

no processor

time

p tries

to join,

each processor By assumption, The clock by at least

the clock

in Y by more p joins

at and after

the time

Moreover,

in X differs

p

at the

from

the clock

of

in an interval

of length

b.

T – T’. at some point

in X differs

p joins.

that

the two scenarios.

of each processor than

the network

of each processor T’ – T when

t in r. Note

can distinguish

from

Condition

the clock

Al,

part

of each processor

(2) of the LES

in Y

condition,

and the choice of T’ guarantee that they differ by at least 2A throughout the interval of length b after p joins. Thus, p will not be within A of the correct processors in at least one of the two scenarios. ❑ Theorem existence will

5.1 does not of an algorithm

in fact eventually

time

required).

were no bound “fast” group rithm

For

join

preclude

the network,

example,

of the time

the

guaranteeing in the

required

possibility that with situation

to join,

of

eventual

a processor

that

no guaranteed sketched the joining

joining requests

upper in the

processor

bound proof, could

(i.e.,

the

to join on the if there tell the

processors to run slower and the “slow” processors to run faster, each still staying within some linear envelope. We conjecture that an algothat achieves LES and eventual joining may exist without the assumption

that less than half the processors are faulty. (The following is an idea for such a possible protocol: A joiner can obtain synchronization values from all participants (this can be done, for example, using the join protocol we describe in Section 7). If a processor sees that the synchronization value it sent is above the average value of the set, then it slows down by an agreed-upon rate; otherwise, it speeds up by this rate. If this rate is sufficiently large, then joiners and all other processors can detect and ignore uncooperative processors. The process is repeated periodically until all synchronization values in the set are

D. DOLEV ET AL.

166 the

same

or ignorable.

since our algorithm

6. A Synchronous In this section, cluster

algorithm [1986],

Update

joiners

can join

the

unanimous

an algorithm

list of processors

to essentially solves

but

only

our

that

enables

in the cluster,

agree

the atomic for

set.)

we assume

However,

for

our join

Seruiee

we present

of the current the

Then

interest in this paper is in bounded joining, that less than half the processors are faulty.

on which

broadcast

special

processors

problem

purpose.

a processor

and enables

are in the

as presented

It is not

to keep track

all the processors cluster.

by Cristian

suggested

in This

et al.

as a general-pur-

pose atomic broadcast algorithm. We use it to update the list of joining processors as of the same clock time on all correct, joined processors in the system. Again,

we

maintains

start

a data

ory. We say that time T identical

with

some

structure replicated

memory

if the replicated as of clock time

guarantees

that

by each processor

definitions.

suggestively

We

called

assume

that

a (synchronous)

is consistent

all updates

to this structure

in a cluster.

in a set of processors

(Note

are made

the similarity

specifications trize

that

agreement algorithm,

satisfies

of the update

the specification all i >0,

memat clock

at the same clock

of these

informal

time

specifica-

[Dolev and Strong 1983; Pease et al. we can maintain the consistency of

replicate memory. We use the algorithm to ensure that agree on who is currently in the cluster. The update algorithm is assumed to run concurrently algorithm

processor

memories on all correct processors in the set are T. We provide a ,gwchronous update algorithm that

tions to those of Byzantine 1980].) Thus, by using the

nization

each

replicated

P 1–P4

algorithm

and that

with

CS 1–CS4.

formally.

by i. We require

all correct

Again,

We

processors

an update

to replicated

a clock now

it is useful

the following

and all correct

processors synchro-

define

the

to parame-

two properties

are

satisfied

for

SUl(i):

If p initiates

SU2(i):

memory for all ~ < ETP(t) < ~+1, then by time t,,~,the replicated with processors that are correct with ET defined at t,+~ is updated UPD. If p updates its replicated memory with UPD at time t with ~ < ETP(t ) s ~+ ~ and CP(t ) = T, then for all processors q correct at time t1+1 with ETq(t, +, ) defined, there exists a time tq < t,+~ such that Cq(tq) = T and q updates replicated memory with UPD at tq.

UPD

p: memory

at time

t such that

Intuitively, SU1 guarantees that if a correct processor initiates an update UPD to replicated memory, all memories are updated with UPD within a bounded real time. SU2 guarantees that if any correct processor updates its replicated memory with UPD, then all correct processors do so, and they do so at the same time U on their local clocks. We now provide an update algorithm. The algorithm has a similar flavor to the clock synchronization algorithm. Just as all the updates to clock values occur at prearranged times ET (which are all multiples of PER), updates to replicated memory occur in the update algorithm at prearranged times which, for technical reasons explained later, we take to be times of the form ET –

Dynamic

Fault-Tolerant

AD.1. We show that

Clock in order

in time

to do the update,

at time

ET

– 3 “ALU.

in earlier

sections

Synchronization to ensure

information

processors

must

start

We have to strengthen

had

throughout

that

the message the

form

PER

appear on the clocks of all correct As in the clock synchronization diffuses

167

the network,

has arrived

about

processors. algorithm,

information

time.

apply

the system

Inequality,

to guarantee

and processors

this message

through

the Separation

> ADJ,

at an acceptable

hear

diffusing

that

which

such

about

times

the

update

tests to determine

Processors

now

if the

maintain

two

sets UPDMSG and PENDING, both containing pairs of the form (T, UPD), where T is a clock time and UPD is an update value to be applied to replicated memory. UPDMSG consists of messages to be sent out and the times they are to be sent out, while PENDING consists of values with which replicated memory is to be updated, and the times that the update is to take place. Finally, MEM is a variable denoting the current replicated memory. We define APPLY(MEM,

UPD)

to be an action

the value UPD. The update algorithm

consists

UPDATE.

UPDINIT,

The

synchronization

first

task,

algorithm.

If

that

updates

of three is the

CP = ET

the replicated

tasks,

UPDINIT,

analogue

– 3. ADJ

memory

with

DIFFUSE,

of task TM

and processor

and

in the clock p has a pair

of the form (T, UPD) = UPDMSG, with (T + 2 “ ADJ, UPD) not already in PENDING and T = ET – 3. ADJ, then, using task UPDINIT, processor p signs and sends a message SYNC(T, UPD) to all its neighbors. We can think of this message as saying “schedule an update UPD to replicated memory at clock time

T + 2 .ADJ(

added

to the

from

the

= ET

PENDING

UPDMSG

list

– ADJ).” list.

On

once

the

This the

means other

message

(T + 2 .ADJ,

hand,

(T, UPD)

is sent.

(In

our

UPD)

must

be

can be removed applications,

we

guarantee that for all pairs (T, UPD) = UPDMSG, T indeed has the form k oPER – 3 “ADJ, so there will be no “useless” pairs in UPDMSG.) If (T+ 2 “ ADJ, UPD) G PENDING, then the update has already been scheduled, so there

is no need

to schedule

Task UPDINIT if {((T, UPD) e UPDMSG) A(T=ET–3” ADJ)} then

it again.

A ((T

+ 2. ADJ,

UPD)

@ PENDING)

A (c

=

T)

begin

SIGN AND SEND SYNC(T, 17PD); U {(T+ 2 “ ADJ, PENDING e PENDING UPDMSG

+- UPDMSG

UPD)};

– {(T, UPD)};

end Task

DIFFUSE

is the

analogue

of task

MSG

in our

clock

synchronization

algorithm. It guarantees that a SYNC( T, UPD) message will be passed along, provided the message is “convincing.” In order for the message SYNC(T, UPD) to reach processor q convincingly, it must pass two tests. The first just checks that T = ET – 3. ADJ. To show that a message is convincing, we need to show that when a message sent by a correct processor p reaches q, the value of ETP when the message was sent is the same as the value of ET~ when the message is received. This is done in Lemma 6.1 below. The second test verifies that if s is the number of signatures on the message, then T – s cE < C~ < T + 2s . E. Unlike Again,

the test in task MSG, this test is a two-sided test, and is asymmetric. the size of the acceptable interval depends on the number of messages,

168

D.

DOLEV ET AL.

so that a message considered convincing by p and then forwarded to q will still be considered convincing by q. The reason for the factor of 2 in the right-hand side of the inequality is that one multiple of E is needed to allow for the difference between the clocks of p and q, and another to allow for the time taken by the message to diffuse from p to q. Task

DIFFUSE

if {(an

authentic

message

distinct

signatures

A(T–

S” E
4. ADJ

8p(t

(Strong

we need

Separation

this

inequality

can be satisfied,

we need

factor

that

above,

the Separation

not

for protocol as before, Inequality

Inequality).

together to strengthen

with

the

drift

assumption

inequality A5 by adding

(E

> the

of 4: + 1) t, + e, and parts (a) and (b) of CS3(i + 1), ifp is correct at time t, ~ < ETP(t) < v ,+1, and ETP(t) – 4 “ADJ + DiVL4X < CP(t) < ETP(t) – ADJ – 2 .DMAX, then tl+e t,.In

Since since

Dynamic ETP(t) PER. >4

Fault-Tolerant and

~

by part

must

both

and

DihlAX

(c) of CS3(i),

t+d ETP(t)

Since “ADJ

Clock

we must

t,+l

is

+ DiW4X it follows

have

infinite,

itfollows

of PER,

– 4. ADJ

> (1 + p)e,

169

and, that

that

by our

CP(t)

is

immediate.

> ~ + PER

> ~ + (1 + p)e.

t > t, + e. We want

this

ETP(t)

constraints

to show

If

not,

that

there

So,

in fact is

some

processor, say q’, which is correct at t,+ ~. By part (b) of CS3(i + 1), we must have ETq,(t, +l) = ~+ ~. Let u = rnin{tz+l, t}. Because tl+l > t, + e, Csd(i) implies

that

both

CP( u ) and

C~,( v ) are

defined.

Using

parts

(a) and

(b)

of

CS3(i + 1), we have that ETP(LI) and ET~(u) are both < ~+ ~. Using part (e) > ~. CSl(i) now implies of CS3(i), we have that ETP( L ) and ETq,( u) are both that

lCP(L~) – C~,(Zl)l

DMXik ~+ ~ – ADJ

by

+

part

that

– Cq(u)

> DMAX>

(1 + p)e

> (1 + p)d.

t,+~ > u. Since u = min{t, t,+~},itfollows that L) = t and t < t,+~.CM(i) [t, t,+ ~), so, since C~,( t, + ~) – C~l u) implies that Aq, is constant in the interval > (1 + p)d, by Al we must have t,+l> t + d.

Thus,

Suppose

that

u c [t, t + d],processor

q is correct

at time

u, and

ET~(u)

is

defined. By CS3(i) and part (a) of CS3(i + 1), we have ~ + PER < ET~(u) < q is correct and C(I is continuous in the interval [t,u]. In v [+1. By CS4(i), particular, this means that q does not adjust its clock in this interval. From CSl(i), it follows that lCP(t) – C~(t)[ < DillAX. We have assumed that ETj(t) < CP(t) < ETP(t) – ADJ – 2. DM.4X. Since PER > – 4 “ ADJ + DikMX 4 “ ADJ

by the Strong ETP(t)

Since

q does

Separation – PER

not

adjust

Inequality,

< C~(t) its

clock

we have that

< ETP(t) in

the

– ADJ interval

– DMAX. [t, u],

we

have,

for

any

u’ = [t, u], ETP(t) By P3, we know

Since ETP(t) and ETP(t) = ET~(u’).

THEOREM synchronous is a positiue

6.2.

< C~(u’)

< ETP(t).

– PER

< C~(L1’) < ET~(u’).

that E~l(u’)

We now prove

– PER

ET~(u’) ❑

are both

the correctness

multiples

PER

by P2, it follows

that

of the algorithm.

If a run of the algorithm

memoiy integer.

of

above satisfies P2,

the?z all updates

are cam”ed out at a time of the form k . PER For each i >0, if a run satisfies PI–P4,

– ADJ, CS l(i),

to

where k CS3(i),

CS4(i), parts (a) and (b) of CS3(i + 1),and ft, is finite, then t,+ ~ > t, + e, and Moreover, each update to synchronous memoty also satisjies SU l(i)and SU2(i). requires at most n 2 messages. PROOF.

By Task

out by a correct tasks UPDINIT

UPDATE,

t,then for

T = T’ + 2. ADJ, some positive integer

only property

an update

to synchronous

memory

is carried

processor p only if Cp(t) = T and (T, UPD) = PENDINGP. By and DIFFUSE, if (T, UPD) is inserted into PENDINGP at time

used here.

where T’ = ETP(t) – 3. ADJ. By P2, ETP(t) = PER k, so that T = k . PER – ADJ. Note that P2 is the

170

D. DOLEV

To

prove

algorithm CS3(i

the

remainder

that

satisfies

of

+ 1), and if t, is finite

then

SUl(i

that

t,+l

) and SU2(i)

the

P1–P4, then

hold

theorem,

CSl(i),

assume

CS3(i),

we

CS4(i),

t,+ ~ > t, + e. First

vacuously;

have parts

a run (a)

ET AL. of

and

the

(b)

of

note

that

if t, is infinite

so we may assume

that

t, is finite

and

> t, + e.

CLAIM

to PENDING,

(a) If p is the first correct processor to add (T+ 2” AM, UPD) and it does so at a time t with ~ < ETP(t) < ~+ ~, then 3 “ AD.7 interval

and

correct processor with ET [t,t + d) will have added (T + 2. ADI,

some time

every

For

with

(a) of the

the update

SYNC(T, UPD)

memory

part

initiated

UPD)

at time

– the at

in this interval.

(b) If q is a correct processor time t~ < t,+* such that cated

T = ETP(t)

defined throughout UPD) to PENDING

UPD claim,

using

defined at t,+ ~, then there will be some = T + 2 “AD.1 and q will update its repli-

With

ET

Cq(tq) at time there

tq. are two

task UPDINIT

cases to consider: by signing

(1) Processor

and sending

p

the message

a convincing message SYNC(T, at time t and (2) p received t,and thus added (T + 2. ADJ, UPD) to PENDING using task

DIFFUSE. In

case (l),

We now

show

of processors within

task

UPDINIT

that

the message

that

real time

guarantees

are correct

d. It suffices

SYNC(T, and

have

that

CP(t)

UPD) ET

= T = ETP(t)

diffuses

defined

to show that when

through

in the

a correct

– 3 “ AD~. the network

interval

processor

[t,t + d] q’ sends a

message SYNC(T, UPD) to its neighbor q, the message reaches q convincingly if q is correct and has ET defined. Suppose the message reaches q at time t’ with s signatures. By A2, it follows that t’ < t + d, so itfollows from Lemma 6.1 that

T = ETP(t)

the

test.

first

– 3 “AIM

Moreover,

synchronizations

occur

Since,

we have

by CSl(i),

= ET~(t’)

Lemma in the

interval

lC~(t’)

– 3 “ ADJ.

6.1 guarantees

[t,t’1. Thus,

– CP(t’)l

Thus,

that

< DAL4X

the

message

passes

t’ < t + d < t,+ ~, so no

Cp(t’)< C (t)+ (1 + P)d. and CP(t~ = T, itfollows

that T – DTL4X < C~(t’) < T + Dk!XX + (1 + p)d. Our constraints now guarantee that T – E < C~(t’) < T + 2E, so the message passes the second test. In case (2), p must receive a SYNC(T, UPD) which was convincing Suppose the message has s signatures. These must be the signatures processors

(otherwise

p

would

not

be

the

first

correct

processor

at time t. of faulty to

add

(T + 2 “ADJ, UPD) to PENDING), have T–s. EL CP(t)~T+2s”

so we must have s s f E and T= ETP(t)–3’ADJ-

by’ A4. We must Since s~f

it follows and ADJ = (f+ l)E, ET (t) – ADJ – 2” DTL4X. Thus, Ta~ing q and q’ as in the previous

that ETP(t) – 4. ADJ + DW < c’P(t) < the hypotheses of Lemma 6.1 are satisfied. paragraph and using the same reasoning as

above, we can again show that T = ETP(t) – 3 “ADJ = ET~(t’) – 3 “ ADJ, so < T + 2(s + l)E, the first constraint is satisfied. Also, T – (s + 1)“E < C$t’) so the second constraint is satisfied as well. Since s s f, we have T – ADJ < C.q(t’) < T + 2” ADJ; we use this fact below. Again, the message successfully diffuses throughout the network, and part (a) is proven. For part (b) of the claim, suppose q is correct and has ET defined at t,. ~. By Lemma 6.1, we have t, + e < t. Therefore, by CS4(i) and CS3(i + l)(b), q is

@tamic

Fault-Tolerant

correct

and

arguments

has

Clock

ET~

above

that

Synchronization

defined

in

the

171 [t, t,+~]. It

interval

q adds (T + 2. All,

UPD)

follows

to PENDING

from

our

at some

time

t‘ in the interval [t, t + d). Thus, it suffices to show that there is a time tq G [t,t,+~) such that Cq(tq)= T + 2. AD], since it is clear that using task UPDATE, replicated memory will be updated at such a time tq.Suppose that C~(t’ ) = T’. We have shown that T – AM < C~(t’) < T + 2. ADJ. In particular, this means that the update is scheduled for a time in the future. From CS4(i) hence,

and CS3(i + l)(b), it follows that C~ is continuous in this interval.

v 1+1 – AD] there

> ET~(t’)

must

– AD.1 = ETP(t)

be a time

with

suppose

~ < ETP(t)

– ADJ

tq in the interval

part (b) and hence the entire It is easy to see that SUl(i) For SU2(i),

[t’, t,+ ~); A~ is constant in the interval By CS3(i + l)(b), we have C~(t,, ~) >

that

claim. follows

p updates

s lf+l.

Suppose

= T + 2 .ADJ.

when

C~(t~)

immediately

from

its replicated that

part

memory

CP(t)

By

continuity,

= T + 2. AD.1,

proving

(a) of the claim.

with

= T + 2. AD.1.

UPD

at time

Then,

by

UPDATE, it must be the case that (T + 2. ADJ, UPD) = PENDING. P3, it follows T + 2” ADJ s ETP(t) < T + 2 “AD.1 + PER. Suppose that the

first

does

processor

so at

to add

t’.We

time

(T + 2 “ AD.1, UPD) now

prove

arguments showed, we have tasks UPDINIT and DIFFUSE lows that

lETq(t’)

– ETP(t)/

that

to PENDING,

ETq(t’)

and

= ETP(t).

As

t

task By q is

suppose our

it

earlier

T – ADJ < Cq( t’ ) < T + 2 “ ADJ. In addition, guarantee that T = ET~(t’) – 3. ADJ. It fol-

< PER

and hence

(since

ET

is always

a multiple

of PER), that ET~(t’) = ETP(t). Thus, ~ < ETq(t’) s ~+ ~. By the claim, it follows that for all processors q’ with ET defined at ti+~,there is some time t~, < t,+~ such that C~ (tq,)= T + 2. ADJ memory with UPD at this time. The

nz

update

neighbors. We

message

message

can improve

sends

the performance estimate

on DMAX. thetwo-sided

+ s o(D

is straightforward,

processor

that

since

at most

they

update

it is clear

one

message

replicated

that to

for

each

each of

its

H

can get a better estimate replace

bound

each

and

+ E).

7. A Synchronization

E. Recall

algorithm

that

somewhat

E was meant

if we

to be an

If we can get an improved estimate D on d, then we can test T–s. E< C< T+2S. E by T–s. E to+ d, there

For joined

The

we require

it is only

throughout

next

to

(A7)

another

than

A7:

that

says that

processor

it wants

during

holds

a join

~ processors

at all times

process.)

that

are correct

and

[t, t + d]. at all

that

t a correct

times

is joined

This interval [t, t + d + (1 + p )PER]. that a joining processor has a neighbor processors

the assumption

to hold

are more

the interval

assumption

connected

that

required

and

correct

assumption will that it can rely

processor

is

throughout

the

be used to guarantee on to notify the other

to join.

For all processors p and all times t > to, there is a Correct joined processor q that is a neighbor of p such that q is correct throughout the

[t,t + d + (1 + p) PER1.

interval A7

can

be eliminated,

algorithm. of A7.

Instead,

The

parameters, multiple of Roughly nization

role

although

we assume

the

that

of the parameter

result

PER PER

speaking,

if there

take place

a resynchronization

will

LPER.

take place within

complicated

mitigating

the strength

.& is now

shared

by two

should be thought of as a large values are multiples of PER.

are no processors

once every

be a more

thus

in algorithm

LPER and PER, where LPER PER. As before, resynchronization

will

would

is small,

trying

to join,

If processors PER,

then

a resynchro-

are trying

thus minimizing

to join,

then

the amount

time joining processors have to wait to join the cluster. In addition to PER and LPER, the algorithm uses the parameters

E, ADJ,

and ~ (so that, informally, processors know an upper bound on the number failures). Again, each processor has local variables ET, xl, and C, as well local

variables

cluster), other

CLUSTER

JOINERS

variables

(describing

(describing

we shall

which

describe

which

shortly.

joined at time t if ET’’(t) is defined. processor to cover processors that join sor p is correct

at time

t if it follows

joined, its duration timer has been it joined through time t. We assume all

the

that

processors

a joiner in

the

knows network

processors

processors

want

Formally, We after

who

extend the initialization

its neighbors

signature functions (the SP of assumption of all processors in the network, as well

the

of as

in the

a processor

of p is

definition of correct as follows: a proces-

specification,

(i.e., has satisfied

(including

currently

and a number

we say that

its algorithmic

correct

are

to join),

of

Al)

and, if it is from

the time

are. We also assume

that

joiners)

own

know

their

A3) and how to check the signatures as the values of the parameters D, ~,

E, PER, and LPER. For simplicity, we assume that the signature function of a joining processor is distinct from all other signature functions that were ever used in the network. (In particular, this means that if a processor is rejoining after A8:

being

repaired,

it must

use a new name

and signature.)

possesses a signature of processor p If at time t some correct processor and if p is correct at time t,then p has been correct since it issued the signature.

At the end of Section 8, we indicate how to remove this assumption, at the cost of a slight increase in the complexity of the algorithm and an increase to a worst-case time requirement for all joins. For simplicity, we also assume that the string representing the name of any processor p is unforgeable. (For example, we could identify p with Sfl applied to the empty body.)

Dynamic

Fault-Tolerant

We assume timer, how

but

a correct

its variables

they

assume

that

Clock

become

an initial

all initialized

processor

ET,

defined cluster

(using

Synchronization

173

that wants

to join

has a correct

A, C, and CLUSTER

are all undefined.

during

of

the

execution

RO containing

more

the initialization

than

algorithm

the

duration We show

algorithm.

~ correct

discussed

We

joined

also

processors,

in Section

5) during

[t., to+ d). The correct members of the initial cluster are initialthe interval ized with A = – DT, ET = PER, JOINERS = 0, and CLUSTER = RO. The first task of the algorithm is called RTJ (for request processor wants to join a cluster, it sends out a special message

of the form

p to decide Task

RTJ

when (Request

if processor SIGN

RTJ( p) to its neighbors.

it wants

AND

for

for joiner)

to join

SEND

some mechanism

a

to join.)

to Join;

p wants

(We assume

to join). When “request-to-join”

then

begin

RTJ(P);

end All

correct

when

processors

a processor

processor

p

must

agree

p in the cluster

schedules

on which

receives

an update

to replicated

clock time after it receives the message Thus, if p receives a message before sending

of a SYNC

message

for time

send the message

in this synchronization

to be sent at time

ET

to receive

+ PER

a request-to-join

processors memory

ET

– 3. ADJ. period,

that

all processors

If not, then

It is possible

from

in the cluster

time, this guarantees that By including q’s signature the

other

processors

request-to-join faulty Task

processor

first

Thus, from

q,

possible

it is too late to

q before

time that

ET

our

the update

processor

– 30 ADJ,

replicated

Since

perform

is scheduled

for one correct

while

memory

update

is

algorithm

at the same clock

all processors in the cluster will agree on JOINERS. on the request-to-join message, p is “proving” to all

that

message.

at the

and the message

another does not. Our later tasks will ensure updated only once. The result of the update adds q to JOINERS. ensures

to join.

message

(by appropriately updating UPDMSG). it schedules the time ET – 3. ADJ,

– 3 “ ADJ. message

want

a request-to-join

the

update

Without

to arrange

message

this

was

requirement,

for “phantom”

sent

in

it would

processors

to join

response

to

a

be possible

for

a

the network.

ADD

if {(joined) begin if C< ADJ;

A (an

authentic

ET–3SADJ

UPDMSG

e

message then

UPDMSG

M

with

T~ET–3”

body ADJ

RTJ(q) else

is received)}

then

T~ET+PER–3”

U ({T, M)};

end The next task TM’ invoked

when

is the analogue

a processor’s

clock

of task TM.

reads

ET

Just like TM,

before

the task TM’

the processor

is

has received

any authentic synchronization messages. However, there are some differences between TM and TM’. The most important is that TM’ must also add new processors to CLUSTER. Thus, when a processor p invokes TM’, it sends out a message “J(ET, JOINERS u CLUSTER)” which says (essentially) “The time is ET; set CLUSTER to JOINERS U CLUSTER.” The message is sent to all of p’s neighbors in JOINERS U CLUSTER. (This is how we interpret tive SEND below.) The second half of the message does not convey

the primiany useful

174

D.

information they

to the

all agree

information processors,

on

processors

currently

JOINERS

to the joiners. they will learn

and

in the

cluster,

CLUSTER.

since,

However,

DOLEV

ET

AL.

as we shall

it does convey

see,

useful

By getting copies of this message from a number of which processors ought to be in the current cluster.

(In general, we would need to include the complete contents of replicated memory in this message, so that a joining processor would be able to set its replicated

memory

replicated Another

memory here.) difference between

would

like

appropriately.

to take

PER

to join

not

to desynchronize

want since

this

unnecessary

CLUSTER would

divides

of PER).

Therefore,

tions

occur

will

There

is one

Thus, means

during

last

it knows

in this

there

is no other

of optimization. algorithm,

units

if there TM’

traffic

to

due

to

only if JOINERS

of some new processors is an appropriately there

a

we do

are no requests

of message

invokes

in which

We

to allow

to do so. However,

amounts

LPER

intervals

minor

time

a processor

(where

every

is the result

requesting

in excessive

ET

roughly

PER

we assume

small

after

every

result

# 0 (which

or if LPER

and TM’

soon

synchronizations.

– CLUSTER join)

TM

simplicity,

to be relatively

processor join,

For

that want

chosen

are no joins,

to

multiple

synchroniza-

LPER.

subtlety

in TM’.

We

mentioned

above

that

it is

possible that a joined processor p receives a request-to-join from q before time ET – 3 “ ADJ on p’s clock. while another joined processor p‘ receives q‘s request-to-join after ET – 3 “ ADJ. Assuming that p remains correct long enough

to

initiate

JOINERS

for

message

telling

an

update

all processors, everyone

to

replicated

although

to add

p’

memory, will

q to the

q

also have

list

of

will

be

scheduled

JOINERS

the

set

sending

during

synchronization period. Since this would be an unnecessary processors q e JOINERS, we remove from the UPDA4SG list

in

the

a

next

update, for all all pairs of the

form (T, M), where the body of M is RTJ(q). Let REMOVE(UPDMSG, JOINERS) be the task which does this. TM is run only by processors in the cluster (since they are the only ones with

C defined).

updated

After

appropriately.

the

message

Besides

is sent

the variables

uses new variables LASTV and LASTJ that value sent out and the last value for JOINERS.

out,

a number

mentioned

of variables

already,

are

algorithm

J%’

record the last synchronization (Initially, LASTV is undefined

and LASTJ is 0.) After the message is sent, LASTV is set to ET and LASTJ is set to JOINERS. In addition, ET is updated (by adding PER), CLUSTER is redefined to JOINERS U CLUSTER, and JOINERS is set to 0. Task TM if C = ET then begin if {(JOINERS – CLUSTER + 0] or ( LPER divides SIGN AND SEND J( ET, JOINERS U CLUSTER); LASTV +- ET; CLUSTER G JOINERS U CLUSTER; LASTJ G JOINERS; REMOVE(UPDMSG, JOINERS); JOINERS b D; end; ET G ET + PER; end

ET)}

then

begin

~narnic

Fault-Tolerant

We next describe detail,

task

form ET

Task

MSG

J(T,

R)

– ISIGI”

Clock

MSG,

works

signed

Synchronization which

as follows:

by the

is the analogue

If processor

processors

in

E < C, and R = JOINERS

on the message, p keeps

track

JOINERS

to

adjusts

175

SIG

that

then, ET

its clock

to ET,

and increases

of ET

and

and

sets

JOINERS,

JOINERS

to

MSG.

0.

In more

a message

is tinze~,

U CLUSTER,

of the last values CLUSTER,

of Task

p receives

of the

i.e.,

T = ET,

as before,

p passes

by PER.

In addition,

adds the processors One

new

feature

in here

(whose importance will become more apparent when we consider the next task) is that p records which processors signed the message, using a variable MSIG. MSIG consists of tuples of the form (T, R, SIG), where SIG is the set of processors (other than p itself) that are known to have signed a message of the form

J(T,

always = SIG Task

R). Initially

has at most

MSIG

is empty.

one tuple

if (T, R, SIG)

For

of the form

= MSIG;

each

T and

(T, R, SIG).

otherwise

we take

R, we ensure

We define

MSIG(T,

that

p

MSIG(T,

R)

R) = 0.

MSG

if {(an authentic received]

message

M of the form

A (JOINERS

– CLUSTER

J(T, # 0

R) with or

signature

set SIG

LPER

divides

G CLUSTER)

A (ET

ET)

is

A (T =

ET) AR

U CLUSTER)

= JOINERS

A (SIG



ISIGI - E

< c)} then

begin

SIGN AND SEND AGET– DT; LASTV

M;

G ET;

CLUSTER * JOINERS LASTJ G JOINERS; REMOVE(UPDMSG,

u CLUSTER; JOINERS);

JOINERS G D; ET +- ET + pER; MSIG(T,

R) +- SIG;

end A joiner

q is able

~ + 1 processors the cluster sufficient not

alone support

forward

more

the

in the cluster.

by getting

Task MSG’

to join

a message

is not sufficient to join. than

cluster

A joiner

to guarantee

message

is that of the

forwarded the first one, p will set ET to message can no longer satisfy the requirement task FORWARD, T # ET. In fact, ●

LASTJ

J(T, that

from

R) with a joining

a processor form

support

of at least

a processor

p in

q = R signed

by p.

processor

will

p in the cluster

J( T, R),

since

after

p

get will has

ET + PER, so the second such T = ET. By using the following

p may still pass on a message of the form J(T, R) even if p will do so if all the following conditions are met:

# @ (so that

o T = LASTV

q gets the

q gets support

of the form

The problem one

when

and

there

are some processors

R = CLUSTER

(so that

waiting

the message

to join), is one that

p sent

before it adjusted its clock), . IMSIG(T, T)l < f (so that p does not know of ~ processors besides itself who have previously signed this message) ● SIG — MSIG( T, R) # @ (so that there are some new signatures on this message).

176

D.

Task

DOLEV ET AL.

FORW~D

if {(an authentic message M of the form Y( T, R) with signature set SIG A (LASTJ # 0) A (T = LASTV) A (R = CLUSTER) A received) (IMSIG(T, R)l f + 1)) and q = R (so that q is one of the JOINERS), then q sets C to T, sets ET to T + PER, sets CLUSTER to R, and sets JOINERS and LASTJ to 0. At that point q has joined the cluster. Recall that if q is correct, then q is joined if and only if q has ET defined. As we prove formally in the next section, our assumptions guarantee that q collects f + 1 signatures on a message processor other

correct

Task

of the

form

sets its clock

J(T,

R) within

to T; thus,

processors

a short

q’s clock

time

is indeed

after close

the

first

to that

correct

of all the

at this point.

JOIN

if {a processor of the form if {IMSIG(T,

in R with ET undefined receives an authentic message J( T, R) with signature set SIG} then begin R) - MSIG(T, R) u SIG R)[ < f + 1} then MSIG(T,

if {I MSIG(T, R)l > f + 1} then A~T– DT; ET +- T + PER; JOINERS LASTJ

M

begin

G 0; ~

0;

CLUSTER

G R;

end end This

completes

8. Analysis

the description

of the algorithm.

of the Join Algorithm

In this section,

we choose

section and of conditions Our parameter definitions

the parameters

used in the algorithm

of the previous

CS1–CS4 so that they satisfy the following conditions. are similar to those used in algorithm M, but there

are some differences: we use the Strong Separation Inequality, we use LPER rather than PER in defining DiWIX, and we assume e > 2 d rather than e > d. This latter choice allows a bigger window to give the joiners time to join the cluster. We choose the parameters for Algorithm J2? as follows:

● e>2d, o LPER ●

DMAX=

is an integer (1 + p)e

= ADJ = (f + l)E, . E > DMXX, and ~ PER > 4. ADJ.

multiple

of PER,

+ 2p- LPER,

L&mmic

Fault-Tolerant

Let

~

Clock

be the join

parameters

chosen

~

synchronization

to satisfy

THEOREM 8.1. algon”thm

Synchronization algorithm

the conditions

Under assumptions

satisjies

17’7

P1–P4

A1-A4,

and

described

in Section

7, with

above. A5’,

A6,

CSl(i)–CS4(i)

A7,

for

andA8,

all

euery run of

i > 0. Moreok’er,

a

correct processor p that requests to join will do so within (1 + p)(PER + 3. ADJ + DAL4X) + 3d of the time the request is sent. In addition, fewer than nz messages are sent for each synchronization n joined

processors,

and fewer

tion ualue for which From

Theorem

value for which

than (k + f + I)nz

there are k > 1 joiners 2.1, we immediately

there are no joiners

and

messages for each synchroniza-

and n joined

processors.

get the following

corollary

to Theorem

8.1: COROLLARY under The

proof

following the

8.2.

assumptions

Algorithm A1–A4,

of Theorem

sequence

rather

than

tasks

with

of

M

and

changes

LEMMA

8.3.

LES

to that

of Theorem

that

occasionally

of these

Every run of S

of task JOIN,

LEMMA receives

T must

8.4.

CP(t+)

identical

considering

to those

the

considering

lemmas

tasks

LPER

to the reader,

satisfies Pl,

P2,

of

instead

indicating

of ~ of

only

P3, and P4.

4.2.3.

We

Let

assumption

A4,

be a multiple

t be a critical

= O, (b)

CP(t)

a synchronization

must

check

that

these

processors as well as for processors already in the is showing that P2 holds for the joining processors. joining processor, say p, for which P2 fails. By and our assumptions

tion, we can show that p sets ET to T + PER, where value that must have been sent by at least one correct

and

join

4.2.1. We have the

are almost

4 (modulo

CHANGES FROM THE PROOF OF LEMMA

by hypothesis

a bounded

required.

properties hold for joining cluster. The only difficulty If not, consider the first inspection

and achieves

and A8.

proofs

in Section

Thus, we leave the proofs

the major

A7,

8.1 is similar

lemmas

the

A6,

of lemmas,

corresponding

PER).

A%’ maintains

A5’,

of PER,

message with

and

CP(t)

(a) CP(t)

> CP(t+)

synchronization

initializa-



we are done.

time for p. Then either

is defined

about

T is a synchronization joined processor. Since

–f”

ualue

is undefined E,

or

p

(c)

CP(t + ) by time

t

signed by some other correct processor. CHANGES TO THE PROOF OF LEMMA a critical received

sors, one of which LEMMA processor LEMMA

must

8.5. If i >0 that is correct 8.6.

and t, is finite, LEMMA

4.2.4.

It is now possible

that

t could

be

value for p because p joined at t. But in this case p must have messages with synchronization value CP( t+) signed by ~ + 1 proces-

8.7.

be correct

by A4. Thus,

(c) holds.



and t, is ji%ite, then (1) t, < t,, ~ and (2) there is a at t, such that CP(ti) > ~ – f. E and ETP(t,) = ~.

In every run [email protected]

and for

all i >0,

if part

(a)

of CS3(i)

holds

then t,+ ~ > t, + e. If CS3(i)

and CS4(i)

hold

in a run of 97, then so does CSl(i).

CHANGES TO THE PROOF OF LEMMA 4.2.7. that there could be no point t’in the interval

Whereas before we could show such that CP(t’ ) = PER, we can

D. DOLEV ET

178

AL.

now show that there is no point t’ in the interval such that CP(t’) = LPER. Thus, we need to replace PER by LPER in the expression for DMAX. ❑ LEMMA

8.8.

If CS3(i

LEMMA

8.9.

If CS3(i)

that ETP is defined Note

that

the

corresponding

v 1+1 clause that

now

then so does CS2(i).

and if ~ < ETP(t)

the interval

[t, + e, t], then CS4(i)

of

8.9 are

Lemma

since

we now

is defined

since

with

prove

4.2.9,

ETP

is necessa~;

a processor

We need

hypotheses

that

in a run of 9,

holds in a run [email protected]

throughout

Lemma

implies

+ 1) holds

have

throughout

we now

allow

joining,

stronger clause

the

interval

it is not

an analogue

some additional

of Lemma

hypotheses.

4.2.10.

In

also holds.

than

the

J( < ETP(t ) < ~+ ~ has been joined

< ~+ ~ implies

those

“if

[t, + e, t]”.

necessarily

the s This

the

since time

order

of

~. < ET’(t)

case

t, + e.

to do this,

we will

Define:

NEWO(i). If i >1, then no processors can join in the interval [t, _ ~ + e, t, ]; moreover, if processor p joins after time ti _ ~ + e, then it must be as a result of receiving NEWl(i).

a message

If a correct

does so first correct

of the form

processor

during

If

i >1

the interval NEW’3(i). ERSP(t,)

and

prior

< ETP(t)

~.l

to that < ~,

q are joined correct and CLUSTER

If p

LEMMA

PROOF.

is correct

= ~ 8.10.

and NEW5(i)

it

in R still

ETP

is defined

throughout

at

time

t,, then

processors at time + e) = CL USTER~(tl

t < t,+ ~ and

LASwP(t)

.lOLV-

t, + e, then + e).

is defined,

then

for some j 5 i.

CS3(i),

NEWO(i),

hold for all i >0 We

R), then

time. then

processors at time = Cluster.

NEW’4(i). If p and q are joined correct JOINERSP(t, + e) = 0 and CL USTERP(t, LAS~P(t)

I(Y,

[t, _, + e, t].

If p and = Joiners

NE W5(i).

j > i and p = R. of the form

[t,, ti+ d] and all the processors

the interval

at t,+ 2 d have joined

NEW’2(i).

1( ~, R) with

signs a message

proceed

NEWl(i),

NEW2(i),

NEW3(i),

NEW4(i),

in every run of 9.

by induction

on

i. For

the

case

i = O, the

proof

of

CS3(0) proceeds just as in Lemma 4.2.10, so we omit it. NEWO(0) and NEW2(0) are vacuously true since O z 1. NEW1(0) holds because VO = O by definition, and, by P2, no correct processor signs a message of the form J(O, R). For NEW3(0), note that there are no joined correct processors at time to.For NEW4(0), suppose that p and q are joined correct processors at time tO + e. Notice that p and q must have been part of the initial cluster, since no processor can join until after a synchronization value has been sent out. This cannot happen before some initially correct processor has executed task TM’ or MSG’, and, by Lemma 8.6, that cannot happen before time tO + e. When p and q are initialized (which, by assumption, happens at some time in the interval [tO, tO + e)), then JOINERS is set to 0 and CLUSTER is set to RO. JOINERS is changed from this initial setting only by using the synchronous update service. By P2 and Theorem 6.2, an update by the synchronous update service is performed only at a clock time of the form ET – ADJ, which is at least PER – ADJ. By Al and Pl, no correct processor’s clock reads PER – ADJ until after time tO + e. Thus, we must have JOINERSP(tO + e) = JOINERSq(tO

Dynamic

Fault-Tolerant

+ e) = 0.

Since

Clock

CLUSTER

is sent out, it follows NEW5(0),

observe

or task MSG’, For show

the

hold

processor

LASTV

p

step,

for

joins

assume

at

time

a new synchronization

+ e) = CL USTER~(to

is defined

processor

only by execution

has LASTV

that

i + 1. We

179

only when

CL USTERP(tO

so no correct

inductive

they

is updated

that that

Synchronization

first

all

our

prove

t > t, + e. It

of either

defined

hypotheses NEWO(i

must

be

value

+ e) = RO. For task TM’

until

t;.

hold

for

+ 1).

j s i; we

Suppose

as a result

correct

of

receiving

messages of the form J(T, R) with a total of at least ~ + 1 signatures p = R. By A4, one of these signatures must be that of a correct processor, q. Tasks

TM’,

MSG’,

and FORWARD

guarantee

that

T must

be a synchro-

value ~ with j > 1. Be definition of t~ we must have t > t].To prove + 1), it suffices to show that j > i + 1. Suppose j < i + 1 so that we

nization NEWO(i

can apply our induction hypotheses. Thus, by NEWl(j), q must t,+ d and p must have joined or message .7( ~, R) before

tj+ 2 d. By A8, p cannot Thus,

and say

p

original

cannot

have

assumption

correctly

joined

request

at or

t > ti + e. This

to join

twice

with

have sent the failed before

the same name.

t,+ 2 d s t, + e, contradicting

after proves

NEWO(i

the

+ 1).

NEW2(i + 1) is immediate from NEWO(i + 1). Next we show that the hypotheses of Theorem

6.2 hold:

we prove

CSl(i),

then t,+ ~ > t, + e. CS4(i), parts (a) and (b) of CS3(i + 1),and if ti is finite From Lemma 8.9 and NEW2(i + 1) we have CS4(i), and by Lemma 8.7, we have CSl(i). Moreover, the proof that parts (a) and (b) of CS3(i + 1) hold is now identical the

proof

to that

of Lemma

of Lemma 4.2.10

interval

[t, + e, t,+ ~]. We

NEWO(i

+ l).)

completes

Now,

chronous

algorithm

NEW3(i

use

case that

p and

that

JOINERS

in the

synchronous time

q were

JOINERSP(tz update

t,+ ~, so

CL USTERP only when

interval

joins

joined

t, is finite,

since then

this part

can join it

of

in the

follows

from

t,, ~ > t, + e. This

6.2. SUl(i)

that

p and

and

SU2(i)

correct

processors

of the

syn-

By SU2(i),

happen the

+ ~) = JOINERS~(ti+ in the interval value

correct

at ti + e. By NEW4(i), All

the

as a result

updates ~). There

have

the fact that

we

updates

to

of using

the

occurred

by

updates

to

all

are

[t, + e, t,, ~), since

is sent. Thus,

proces-

[t, + e, t,+ ~1, it must be the

+ e) = 0.

[ ti + e, t, + ~] must

CL USTER~

q are joined

in the interval

+ e) = JOINERS~(t,

algorithm.

a synchronization

that

hold.

JOINERSP(ti

and

(Note

no processor

assumption,

if

properties

+ 1) we suppose

sors at t,+ ~. Since no processor know

that

this

8.6,

that

and is omitted. fact

for Theorem

6.2, we know

update

To prove

can

by Lemma

the hypotheses

By Theorem

4.2.10,

uses the

no

updates

occur

CL USTERP(t,

+ ~)

= CL USTER~(t,, ~) follows from NEW4(i). (If there is an update to CL USTERP or CL USTER~ at t,.~, then we may have CL USTERP(t~~, ) + CL USTERq(t~+ ~).) This proves NEW3(i + 1). The proof

of part

(c) of CS3(i

+ 1) is the same as that

of Lemma

4.2.10.

We next prove NEWl(i + 1). Suppose some p is a correct processor that signs a message of the form J(V + ~, R). The first time p signs such a message, it must be the result of executing either task TM’ or task MSG’. By inspection of the algorithm, only a joined processor with ET set to ~., can correctly sign a message of the form J(K+ ~, R) using task TM’ or MSG’. By NEWO(i + 1), if a processor q joins after time t, + e, then q must set its initial value of ET to at least ~+ ~ + PER so q cannot sign such a message. Thus, p must have been joined

at time

t, + e. An

additional

inspection

of the algorithm

shows that

we

D. DOLEV ET AL.

180 must

have R = CL USTERP(t,

other

processor

q that

+ ~) U JOINERSP(t,

is joined

at time

+, ). By NEW3(i

t,+ ~, we must

+ 1), for any

have cLusi%R~(ti+

~) =

CL USTERP(tl, ~) and JOINERS~(t,, ~) = JOINERSP(t,. ~). Using this observation, as in the proof of Lemma 4.10, we can show that if q is correct and joined at t,+,and q is still

correct at time t,+~ + d, then q sets ET to ~+ ~ + PER at t G [t, + ~,t,+~ + d).At time t~, q signs and sends out a message of

some time

the form J/~+,, R) and sets JOINERS, = 0, CLUSTER, = R, and ET, = ~+ ~ + PER. This proves the first half of NEWl(i + 1). We still must prove that any processor q ● R that is correct at t,, ~ + 2d has joined

before

that

time.

Without

loss of generality,

assume

that

q has not

before t,.~ + 2d, and in joined by time t,+,.We now show q has joined addition that when q joins, it sets ET~ to ~+ ~ + PER. This will prove part (d) of CS3(i + 1) in addition to providing NEWl(i + 1). It suffices to show that q receives a total of ~ + 1 signatures on messages of the form J( ~ + ~, R) by time

t,+~ + 2d. correct

By assumption A6, there are at least ~ + 1 processors that are that is and joined in the interval [t, + ~,t,+~ + d). Let p be a processor

correct

and

joined

in

this

interval.

By

previous

arguments,

p

sends

out

a

the J(Z + ~, R) at some time u in the interval [t, + ~,t,+~ + d). Consider to exist sequence of processors p], ..., pk with p = pl and q = p~ guaranteed message

by A2, with

t = u. If p’s message

there is some p, that earlier ~ + 1 signatures. Thus, either 2d,

or q has already

~ + 1 signatures

received

by time

does not

diffuse

to q, this must

be because

sent out messages of this form with a total q receives p’s message by time u + 2d < t,+, messages

u + 2d. Since

of the there

required

are

f+

form

1 correct

with

a total

joined

of + of

proces-

by sors, q will receive messages of this form with a total of f + 1 signatures Hence, time t,+~ + 2d. When q gets these messages, it sets ETq appropriately.

+l,tl+l+ 2d], and at time tq,q sets C~ = ~+1, q joins at some time tq G [t, ETq = ~,, + PER, JOINERS = 0, and CL US TERSq = R. Part (e) of CS3(i + 1) follows from part (d) for processors that were already joined

t1+1

at +

e,

message

t,+ ~ + e, as in Lemma NEWO(i + 1) shows that

4.2.10. For a processor p that joins after p must have joined as a result of receiving a

of the form J(V, R), with j > i + 1.Thus, at this point p sets ETP to + PER. The result now follows from P1. + 1), observe that for any correct processor p, LASWP is

~+PER>~+l For NEW5(i initially

undefined;

by inspection

of the tasks of %, h is clear

reset only at a critical time for p. Moreover, LASWP(U+) = ETP(u), ETP(u+) = ETP(u)

that

US~P

is

if LASWP is reset at time u, then + PER, and ETP(u) = ~ for some

j. If LASWP(t) is defined, then LASWP(t) < ETP(t) – PER. If t s t,+,,then ETP(t) s ~+, by parts (a) and (b) of CS3(i + 1). Thus, LASWP(t) < ~+ ~, and LASWP(t) = ~ for some ~ s i. This proves NEW5(i + 1). It remains to prove NEW4(i + 1). Suppose that q is a joined correct ~ + e) = 0 at time t,+ ~ + e. We want to show that JOINERS~(t,+ and CLUSTER~(t, h ~ + e) = R,+ ~. If q is already joined at time t,+~,then our + ~,t,+ ~ + d), q sets CL USprevious arguments show that at some time tq G [t,

processor

= 0. JOINERS~ can become nonempty after tq TERq = R,, ~ and JOINERS~ only if there is an update to synchronous memory. By Theorem 6.2, there can be such an update only at time t such that C~(t) = k . PER – ADJ for some k. By part (c) of CS3(i + 1), there cannot be such a time in the interval an inspection of the , + e) = 0. Similarly, [t,+ ,, t,+, + e]. Thus, JOINERS,(tL+ tasks of ~ shows that CL USTERq can change values only at a critical time for

Qynamic

Fault-Tolerant

Clock

Synchronization

181

q. Since ET~(t~) = F(+ ~ + PER, the next critical time for q after tq must come 8.6, t,+ ~ > t,+ ~ + e. Thus, we can conclude that at or after t,+~. By Lemma CLUSTER~(tl+,

+ e) = R,+l.

Now, suppose

that

q joins

this must be as a result

at some time

of receiving

t~ = [t,+ ~, t,, ~ + e]. By NEWO(i

a message

of the form

.J(~,

+ 1),

R) with

q = R

and j > i + 1. By the same arguments as used in the proof of NEWO(i + 1), q must receive such a message signed by a correct processor, say q’. By inspection of the tasks of ~, it is immediate only if T = ETg (t ) or T = LASW~,(t LASW~,(t)

is defined,

then

~.+l+PERift ET~(t)

– PER

q receives

SYNC(ET (t) – 3 “ADJ, M) reads ET~~t) – 3. ADJ. (The also recewed

q receives

u, and it is connected

at least

(1 + p) . PER

is guaranteed

(1 + p)(PER

p’s

message

+

q remains

algorithm, after which time p is to guarantee that p joins the

p’s

request-to-join

message

and

(2)

C~(t)

by P3 and message,

> ET~(t)

q remains

it follows

that

and started

an update.)

M

at

are two cases – 3 “ ADJ.

correct the

is signed and sent by q at or before message may be sent earlier if another

p‘s request-to-join

after

to exist

+ 30 ADJ

is straightforward:

By A2, we have t – u < d. There

< ~+l.

C~(t)

at time

for

sufficiently long to invoke the update to JOINERS. Then NEW2 is invoked

t,and ~ < ET~(t)

For

remains message.

cluster soon thereafter. In more detail, suppose time

p requests

q that

for

at

message its clock processor

By Theorem

at 6.1, all processors still correct at time t,+~ will have added q to JOINERS time ET~(t ) – ADJ on their local clocks. It follows that JOINERS – CLUSTER will

be nonempty

at local

clock

time

ET~(t ) – ADJ,

synchronization attempt will take place with value and t,+~ is the first time a correct processor sends tion value ET~(t). Since we have assumed that q time (1 + p)PER, it is easy to show that q is still time is no more than (1 + p) PER after q receives

from

which

we get that

a

ETq(t ). Thus, ~., == ETq(t ), a message with synchronizaremains correct for at least correct at time ti+~,and this p’s message. From NEWl(i

182

D.

+ 1), it follows then.

Thus,

that

if p is still

p joins

join message< For case (2),

within

since

by

(1 + p) “ PER

after

ET~(t)

– 3 “ADJ,

+ PER

3 “ADJ, joined

M)

unless

processor

ET~(t) + PER then t, occurs

p’s

q

time

attempt

remains

no later

take

correct

ET

C~(t’)

with

value

by

Thus, from and

the

assumption). that

+ ADJ.

By

C~(t’) By

time reads

SYNC(ET~(t)



if some

on its clock. that

ETq( t) + PER,

a

so that

at time t,,suppose q’ is a correct joined Cq(t~) s ETq(t) + PER by CS3(j)(b) and

that

Cq(t’)

the

Strong

> ET~(t)

P2

part (a) of CS3(j fact

and

P3

it follows

that

> ET~(t)

– 3. ADJ

+ DMXX),

Since by NEWl(j)

so

it

that

– DkMX.

(PER

> ~.

ET~(t’)

follows

and joined

= ETq(t) + PER C~(t) > ET~(t )

Inequality ETq(t)

– 1), we have t’> t,_~.From

q’ must be correct

C~ (t’)

Since

follows

> ~_ ~ + ADJ,

we get that

t,s t’+ (1 + p)(3 “ADJ

C (t’)

t &ince

Separation

+ ADJ. it

NEW2(j)

+ DM4X).

by

is ~, with j = i + 1 or j = i + 2. If q is stall correct at time t,, at most (1 + p)(PER + 3. ADJ) after q receives p’s request-

we know

> ~_*

least

may happen

– 3. ADJ

P3. We know that q is correct at the time t’such that — 3 “ADJ, and this time is at most (1 + p) PER after – 3 “ADJ

at

its clock

As in case (1), it now follows

place

to-join message. If q is not correct processor at time t,.We must have

4 sADJ),

for when

(This

than

have joined

p sends its request-to-

q is correct

the cluster.

in case (l).)

will

+ 2d, p will

q sends the message

joined

p’s message

t,+*

+ 3d of when

message,

by which

p has already

we are back

synchronization

at time

assumption

q receives

received

If this happens,

correct

(1 + p) . PER

DOLEV ET AL.

> ~.

part (c)of that

From

~ + PER.

CS3(j

– 1)

t’ > t,_ ~ + e.

and from at t’, this

>

~, we have

CSl(j

By

– 1),

it follows

that

that

t,s t + (1 + p)(PER + 3 “ADJ we have that p joins by time t]+ 2 d (if it is still

correct then), and p sends its request-to-join message at most d before t, we get the desired bounds. At most nz messages are sent if no processor requests to join, just as in the case of Algorithm causes one update addition,

if there

~ + 1 messages

are joining (using

or (k + f + l)nz In

general,

algorithm,

J% If k processors request to join, each request-to-join to replicated memory, resulting in k . nz messages. In

in all.

the

processors,

task FORWARD),

each joined giving

processor

a further

may send up to

(~+

l)nz

in

DMAX.

messages,



(1 + p)e

e = 2 d, whereas

term

is the

in the basic

dominant

term

resynchronization

algorithm,

In

this

we have

e = d. This factor of 2 is introduced by the late signature gathering process. It can be eliminated by having yet another synchronization after all the processors have joined. This is essentially the technique used in an earlier version of this paper [Halpern We now discuss

et al. 1984]. how to relax

assumption

A8,

which

states

that

rejoining

processors must use new signatures. If the JOIN task is modified so that a processor will continue to advance its clock according to JOIN (i.e., continue to execute JOIN) until an interval of length (1 + p)(PER + 3. ADJ + DMAX) + 3d has elapsed from the time it requested to join, then we no longer need the assumption that a rejoining processor must use a new signature. A processor may be convinced to set its clock using messages left over from a previous attempt to join; but provided our other assumptions hold, it will have advanced to the correct time within the prescribed time bound. Of course, it may not actually send any synchronization messages or be considered to have a defined

Dynamic

Fault-Tolerant

ET

the time

until

Clock

bound

Synchronization

has elapsed

183

on its duration

timer.

The details

are left

no name

is ever

to the reader. We

have

removed names

assumed

from

of processors

processors paper,

One

are left

a cluster From

that

or deciding

forever

to time

using

The

that

method

be removed

accomplishing

memory

and

it may be convenient

participate.

they should

for

replicated

grows

time

no longer

that

mechanism

synchronous

the

of detecting

is outside

removal

a task analogous

to remove such

the scope of this

is by

an update

to ADD.

Again,

of

details

to the reader.

9. A Continuous The

that

CLUSTER.

logical

algorithm

Clock

clock

Solution

defined

by

is not continuous,

processor

p’s

current

clock

since it may be set forward

in

the

previous

by any amount

smaller

than Alll. It is clearly piecewise continuous. There are some applications for which it may be advantageous to have a continuous clock. As already noted by Lamport

and

amortizing presented tion

Melliar-Smith

clock

in Section

also works

The

clock

introduce SAVE Task

two

= DT MSG,

.

SAVE

G DT;



OLDA

e A;

Let

INT

algorithm

required

We briefly in order

keeping

variables,

sketch

construc-

7.

be a constant

A +- ET

chosen

Strong Separation Inequality, continuous approximation

matters,

we first

continuous

SAKE.

We

and add the following

the line

how

by

to do this. A similar

piecewise

and

discontinuities algorithm

To simplify

the

OLDA

these

the

of Section

are minimal.

while

at initialization, before

time.

3 can be modified

C’,

new

we can eliminate

over

for the join

modifications

continuous

[1985],

adjustments

lines

set

clock OLDA

add a C. We

= A

and

to the pseudocode

of

– DT:

such

that

O < INT

s PER

– ADJ.

(By

the

such a choice is possible.) We introduce A’, a to A. Suppose that A(t) # A(t+). We set

OLDA(t+) +- A(t), thereby saving the old value of A before updating it. Then, instead of increasing the value of A’ immediately to A(t + ), we amortize this increase

over an interval

of length

INT.

Thus,

we have the following

definition

of A: if DT

s SAVE

+ INT

then

A

~

OLDA

+ (A

– OLDA)(DT

– SAVE)/INT

else A’ +A. Define function A’(t)

C’(t) of

s A(t),

= DT(t)

time,

and

+ A’(t). hence

and if either

It

is easy to check

so is

OLDA(t)

A(t) = xl’(t). Our revised algorithm then C(t+) = ET – PER, since SAVE is adjusted. PER – AD],

C’. = A(t)

Moreover,

that at

or DT(t)

A’ any

is a continuous time

> SAVE(t)

we

have

+ INT,

t

then

guarantees that if DT(t) = SAVE(t+), is set to DT at exactly the time t that A

It follows that if C(t) > ET and hence that C(t) = C’(t).

– ADJ, then DT(t) > SAVE(t) + With this observation, it is easY to

check that we could have replaced C by C’ in algorithm same result for every test where C was used. We leave

& and obtained the details to the reader.

10. Conclusion We have algorithm

described an algorithm that periodically can tolerate arbitrary link and processor

desynchronizes failures as long

clocks. The as messages

D. DOLEV ET AL.

184 can diffuse have

through

provided

the network

a technique

within

for

some preassigned

initializing

clocks,

time

and

bound.

have

We also

shown

how

our

algorithm could be extended to allow new processors to join the network. The constants in our algorithm are reasonable for many practical applications.

We

have

performance

suggested

of the

improvements

a number

algorithm

are possible.

could

A variant

of ways

throughout

be improved.

We

of this algorithm,

the

suspect

for which

paper that

that

further

the join

is not

so fault-tolerant, has been implemented for a prototype highly available system at the IBM Almaden Research Center [Griefer and Strong 1988]. The join algorithm provided in this paper represents a compromise between the simplicity the

of allowing

complexity

logically

of

synchronous

cated memory.

scheduled

very small. much

Then,

larger

updates

replicated times.

to simplify

memory [1978a,

The

structure

these

fast response overhead

is scheduled

algorithm

we call

the process

by allowing

minimal there

resynchronization

demand,

to a data

We provide

unless

synchronous replicated pioneered by Lamport

on

memory

we provide

period,

only when join

We have chosen

ing synchronous periodic

joining

providing

synchronous

on repli-

of joining

and maintain-

processes

to run

time

by making

by desynchronizing

is a processor

and

depends

waiting

to join.

only

at

this period only with Our

is in the spirit of the state-machine 1978b, 1984]. Moreover, our basic

a

use of

approach, resynchro-

nization algorithm without its timeliness tests is a minor variant of a scheme proposed by Lamport [1978a]. The advantage and main contribution of our approach fault-tolerance

lies in the simplicity properties (not shared

ACKNOWLEDGMENTS. taking

to read

The

the entire

authors paper

of our algorithms together with by the original Lamport scheme).

would

carefully

like

to thank

and for many

the referees helpful

their

for under-

suggestions.

REFERENCES CRISTIAN,F. 1989. Probabilistic clock synchronization. Drst. Cm-nput. 3, 3 (July), 146-158. CRMTIAN,F., AGHIL1,H., STRONG,H. R., AND DOLEV, D. 1986. Atomic broadcast: from simple message diffusion to Byzantine agreement. IBM Tech. Rep. RJ 5244. IBM. San Jose, Calif. DOLEV, D., H.ALPERN,J. Y., SIMONS,B. B., AND STRONG, H. R. 1987. A new look at fault tolerant network routing. Zrzf. Comput. 72, 180-196. DOLEV, D., HALPERN, J. Y., AND STRONG, H. R. 1986. On the possibility and impossibility of Syst. ,SCL.32, 2 (Apr.), 230–250. achieving clock synchronization. J. [email protected] algorithms for Byzantine agreement. DOLEV, D. AND STRONG, H. R. 1983. Authenticated SLAM J. Cot?zput, 12>4 (Nov.), 656-666. GRIEFER, A. D., AND STRONG, H. R. 1988. DCF: Distributed communication with fault tolerance. In Proceedings of the 7th A nrzzlal A CM Symposizwn on Principles of Dist?ibufed Computing

(Toronto, Ont., Canada, Aug. 15-17). ACM, New York, pp. 18-27. HALPERN.

J. Y..

MEGIDDO,

N., AND MUNSHI,

A.

1985.

Optimal

preckion

in the

presence

of

uncertainty. J. Complexity 1, 2 (June), 170–196. HALPERN, J. Y., SIMONS, B. B., STRONG, H. R., AND DOLEV, D. 1984. Fault-tolerant clock synchronization. In Proceedings of the 3rd Annual A CM Symposium on Pri>zciples of Distributed Corrzputiizg (Vancouver, B. C., Canada, Aug. 27–29). ACM, New York, pp. 89-10?. KRISHNA, C. M., SHIN, K. G., AND BUTLER, R. W. 1985. Ensuring fault tolerance of phase-locked clocks. lEEE Tirozs. Comput. C-34. 8, 752–756. LAMPORT, L. 1978a. Time, clocks and the ordering of events in a distributed system. Commun. ACM 21, 7, (July), 558-565. LAMPORT, L. 1978b. The implementation of reliable distributed multiprocess systems. Corrzput. Netw. 2, 2 (May), 95-114. L.WPORT, L. 1984, Using time instead of timeout for fault-tolerant distributed systems. ACM Trans. Prog. Lang. Syst. 6, 2 (Apr.) 254-280.

Dynamic

Fault-Tolerant

Clock

LAMPORT. L., AND MELLIAR-SMITH, J. ACM 32, 1 (Jan.), 52-78.

LUNDELIUS,J., ANDLYNCH,N. Control

62, 21 (Aug. /Sept.),

Synchronization P. M.

1985.

Anupper

1984.

185

Synchronizing andlower

clocks in the presence of fwlts.

bound forelock

synchronization.

Znf.

190-204.

MARZULLO, K. 1983. dissertation. Stanford PEASE, M., SHOSTAK, R., (Apr.), J. ACM27,2,

Loosely-coupled distributed services: A distributed time system. Ph.D. Univ., Stanford, Calif. AND LAMPORT, L. 1980. Reaching agreement inthepresence of faults.

RWEST,R. L.,

A., AND ADELMAN,

228-234.

Amethod for obtaining digital signatures 2 (Feb.), 120–126. so ftheACM21, clock synchronization RAMANATHAN,P., SHIN, K. G., AND BUTLER,R. W. 1990. Fault-tolerant in distributed systems. LZEEComput.(Ott.),33-42. SCHNEIDER, F. B. 1987. Understanding protocols for byzantine clock synchronization. Tech. Rep. Dept. Computer Science, Cornell University, Ithaca, N.Y. SRIKANTH, T. K., AND TOUEG, S. 1987. Optimal clock synchronization. J. ACM34, 3 (July), 626-645. WELCH, J. LUNDELIUS, AND LYNCH, N. 1988. Anew fault-tolerant algorithm forclocksynchronization. Inf. Comput. 77, 1,1–36. andpublic-key

RECEIVED

SHAMIR,

cryptosystems.

MARCH 1989;

L.

1978.

Communication

REVISED JULY

1989; ACCEPTED

Journdof

FEBRUARY

1994.

the AssocMlonforCumputNgMach uuzry.Vol

4?, N0 l. Jdnu.%~1995