Dynamic
DANNY RAY
Fault-Tolerant
DOLEV,
Clock Synchronization
JOSEPH
Y.
HALPERN,
BARBARA
SIMONS,
AND
STRONG
IBM Almaden
Research Cente~ San Jose, California
Abstract. This paper gives two simple efficient distributed algorithms: one for keeping clocks in a network synchronized and one for allowing new processors to join the network with their clocks synchronized. Assuming a fault-tolerant authentication protocol, the algorithms tolerate both link and processor failures of any type. The algorithm for maintaining synchronization works for arbitra~ networks (rather than just completely connected networks) and tolerates any number of processor or communication link faults as long as the correct processors remain connected by fault-free paths. It thus represents an improvement over other clock synchronization algorithms such as those of Lamport and Melliar-Smith [1985] and Welch and Lynch [1988], although, unlike them, it does require an authentication protocol to handle Byzantine faults. Our algorithm for allowing new processors to join requires that more than half the processors be correct, a requirement that is provably necessa~. Categories and Subject Descriptors: C.2.4 [Computer-Communications
Networks]: Distributed
applications; distributed databases; network operating systems; C.4 [Performance of Systems]: reliability, availability, and serviceability; D.4. 1 [Operating Systems]: Process Systems—distributed
Management—synchronization; D.4.5 [Operating Systems]: Reliability —~ault-tolerance General Terms: Algorithms, performance, reliability, theory Additional time-of-day
Key Words clock
and Phrases:
Byzantine
failures,
clock synchronization,
fault-tolerance,
1. Introduction In a distributed actions
possesses assumed
system,
at roughly its
own
the
it is often same
time.
independent
to have a bounded
rate
necessary In
for processors
such
physical of drift
a system,
clock from
or real
to perform
each duration
time.
processor timer,
However,
certain usually which
over
is
time,
This is a revised version of a paper entitled Fault-Tolerant Clock Synchronization, which appeared in Proceedings of the 3rd AnnualA CM Symposium on Principles of Distributed Computing. ACM, New York, 1984, pp. 89-102. D. Dolev is also affiliated with Hebrew University. Authors’ present addresses: D. Dolev, Department of Computer Science, Hebrew University, Jerusalem 91904, Israel,
[email protected]. il; J. Halpern, IBM Research Division, Almaden Research Center, 650 Harry Road, San Jose, CA 95120-6099,
[email protected]; B. Simons, IBM Application Development Technology Institute, Santa Teresa Laboratories, 555 Bailey Ave., San Jose, CA 95141,
[email protected]; R. Strong, JBM Research Division, Almaden Research Center, 650 Harry Road, San Jose, CA 95120-6099, strong@ almaden.ibm.com. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. 01995 ACM 0004-5411/95/0000-0143 $03.50 Journal of the As$oc,at,on for Computmg Machinery, Vol 42, No, 1, Januaq 1995, pp 143–185
144
DOLEV ET AL.
D.
these duration timers nized” periodically. More
precisely,
logical
clock
to drift
we assume
time
has no control)
tend
apart.
that
Thus,
the clocks
each processor
is the sum of the reading and its adjustment
must be “desynchro-
has an adjustment
of its duration
register.
It is these
timer
logical
Its
register.
(over
clock
which
times
it
that
are to be kept close together, even in the presence of processor and link failures. Let the logical clock time of processor i at real time t be represented by
C,(t).
We
require
that
there
be
some
constant
DMAX
(for
maximum
de~iation) such that lC,(t ) – C,(t)l < DJL4X. As is mentioned in Dolev et al. [1986], there are trivial algorithms for keeping logical clocks close together. For example, the logical clock time can always be a constant, say O. Of course, this is not terribly useful in practice. A useful clock synchronization also guarantee that logical clocks stay within some linear duration below clock that
timers
(i.e., the time
by a linear time
is indeed
keeps
linear
function
logical
envelope
on the logical
of the time
a reasonable clocks
of
the
synchronization. A number of recent
of
must
duration
close
is said
have presented
timer),
to real
processors
timers
be bounded
duration
approximation
correct
papers
clock
on the
to
algorithm must enuelope of the
time.
An
together
linear
that
and
logical
algorithm
and
maintain
algorithms
above
so that
within
a
enuelope
maintain
linear
envelope synchronization in the presence of faults [Krishna et al. 1985; Lamport and Melliar-Smith 1985; Marzullo 1983; Srikanth and Toueg 1987; Lundelius et al. 1988]. The algorithms of Lamport and Melliar-Smith [1985], Marzullo process
[1983], that
averaging,
these
processors. [1985] require
and Welch
involves Two
reading
algorithms of the
and the algorithms
col, requires
[1985],
analysis
algorithms
2f
are all based
there
be more
presented
in Lamport [1988]
to handle
f faults;
assumes
deal with
the existence
The
processors processors within an algorithms
a third
algorithm
on an averaging
processors.
and Lynch
+ 1 processors.
is provided,
that
[1988]
of all the other
of Welch
which
also requires 3f + 1 protocol, and 2f + 1 tains synchronization explained later. The
Lynch
the clocks require
3f + 1 processors
Melliar-Smith
and
nonfaulty and
and Krishna algorithm
Because than
Melliar-Smith et al. [1985]
of Lamport
of an authentication
of Srikanth
of
faulty
and Toueg
and proto[1987]
to handle f faults without an authentication with an authentication protocol, but it mainoptimal linear envelope in a precise sense of Marzullo [1983], for which no worst-case
ranges
of times
rather
than
a single
logical
clock
time and therefore are not directly comparable. The algorithm of Krishna et al. [1985], called phase-locking, is very close in spirit to the algorithm presented here, in that both algorithms have processors sending out synchronization messages at predetermined times. However, the algorithm of Krishna et al. [1985] requires that the number of faulty clocks be less than one third of the number of participants, and also requires certain assumptions about the nature of the communication medium. For the most recent work on phase-locking and comparison studies for hardware versus software implementations of clock synchronization algorithms, see Ramanathan et al. [1990]. In this paper, a synchronization algorithm is presented that does not require any minimum number of processors to subnetwork containing the nonfaulty that this does not contradict the lower that only n/3 faults can be tolerated,
handle f processor faults, so long as the processors remains connected. (Notice bound of Dolev et al. [1986], which says since we are assuming an authentication
Qnamic
Fault-Tolerant
protocol
here.)
The crucial
that
the majority
necessary requires is the
Clock
the transmission total
number
145
Synchronization
point
is that
since we do not use averaging,
of processors of at most
of processors
be correct.
n 2 messages in the
Moreover,
our
it is not algorithm
per synchronization
system).
The
(where
algorithms
n
of Srikanth
and Toueg [1987] and Welch and Lynch [1988], and one of the algorithms Lamport and Melliar-Smith [1985] also require only n2 messages; the other
of two
algorithms messages
of Lamport and Melliar-Smith [1985] might need as many as n‘+ J to tolerate ~ faults. A final advantage of our algorithm is that it can
deal with
either
processors [1984]
and
and
connected are no faulty
faults
The
Lynch
in any network,
algorithms
[1988]
deal
provided
of Lamport with
only
algorithm
is based
processors,
can broadcast a message with depending on the frequency
If
there
different
are A
faults,
faulty
times)
Melliar-Smith
processor
on the following
one processor
faults
in
simple
a
its current time once an hour of synchronization required).
however,
then
synchronizer
to different
there
might
processors,
observation.
can act as a synchronizer
would then adjust its clock function accordingly, making necessary for the transmission time of the message. approach.
the nonfaulty
and
network.
synchronization
If there
or link
connected.
Welch
completely The
processor
remain
are
minor
obvious
broadcast or it might
the
allowances
problems same
broadcast
and
(or day, or week, Each processor
with
if this
messages
(i.e.,
the same message
but at different times, or it might “forget” to broadcast the message to some processors. Note that it is not necessary to assume “malevolence” on the part of the synchronizer for such behavior to occur. For example, a synchronizer might fail (halt) in the middle of broadcasting the message “The time is 9 A.M.”, spontaneously recover 5 minutes later, and continue broadcasting the same message. Thus, some of the processors would receive the message “The time is 9 A. M.” Nevertheless,
at 9 A. M., while the remainder the idea of using a synchronizer
efficient synchronization faults. The role of the tries
to
succeeds. protocol
act
as a synchronizer
To ensure that
algorithm synchronizer
that
guarantees
that
that is correct is distributed:
at roughly
this
happens
the
time,
the
processors
resynchronization
mented by a method for initializing that they are close together. It must
and
at least
previously
faulty
algorithms processors
time,”
agree
on the expected
algorithm
must
we use a
be supple-
the clocks of the original participants so also be possible for new processors to join
can be extended to rejoin)
one
same
the system or for previously faulty processors to rejoin the system with clocks synchronized to those of already existing processors. Initializing clocks of the original processors turns out to be an easy task. Moreover, synchronization
an
even in the presence of Every (correct) processor
same
at “roughly
all the correct
time for the next synchronization. In practice, such a periodic
would receive it at 9:05. can be modified to obtain
to allow
the network.
new processors The join
their the our
to join
algorithm
(or
allows
joining processors to join a short time after they request to do so. Our join algorithm requires that fewer than half the processors be faulty during the join process. Again, we can tolerate any number of link failures provided that the nonfaulty processors remain connected. This requirement is provably necessary. The remainder of the paper is organized as follows: In the next section, the problem is formalized, a formal definition of linear envelope synchronization is
D. DOLEV ET AL.
146 given, These
and the precise assumptions underlying the assumptions include the existence of a bounded
duration
timers
of correct
processors,
a known
sion time of messages between correct cate signatures. The resynchronization analyzed
in Section
guaranteed discussion [1986], related
algorithm are described. rate of drift between the
upper
bound
on the transmis-
processors, and the ability to authentialgorithm is described in Section 3 and
4. The worst-case
difference
between
logical
clocks
that
is
by our algorithm is almost as small as possible, but a careful of this property is beyond the scope of this paper (see Dolev et al.
Halpern et al. [1985], Lundelius and Lynch [1984]). We discuss issues to initialization and joining in Section 5. In Section 6, we present a
synchronous update service, which enables all correct processes to agree on which processes are currently joined; this service plays a key role in our join algorithm.
The join
algorithm
is presented
8. In Section
9, we show how to modify
and 7 so that
the logical
a piecewise results and
clock
continuous
in Section
8. Section
is a continuous
function.
We
10. We recommend 2 contains
in Section
7 and analyzed
the algorithms function
conclude that
assumptions
presented of real time
with
some
the casual
reader
and
specifications
in Section
in Sections rather
discussion
of our
skip Sections that
3
than 4, 5,
are important,
but not necessary for a basic understanding of the algorithm. The reader who is interested only in the algorithms might wish to read only Sections 3, 7, and 9.
2. Assumptions In
this
and Specifications
section,
denoted
we
A1–A5,
discuss
and
algorithms.
We break
that
immediately
follow
the
the
five
from
the
of an external
the processors.
source
are met
into
structure
made by our
two parts:
of the
some effort assumptions of “real
Just as Lamport
assumptions
that
these specifications
are deeper properties that require Let us first consider the basic existence
basic
specifications
P1–P4
algorithm,
to prove. of the model.
time, ” not
necessarily
and Melliar-Smith
[1985],
in
our
model,
synchronization are properties while We
CS1–CS4 assume
the
measurable
by
Srikanth
and Toueg
[1987], and Welch and Lynch [1988], we distinguish between real time, as measured on this external clock, and duration time, the time measured on some processor’s duration timer DT. We also adopt the convention that variables and constants that range over real time are written in lowercase and variables and
constants
that
uppercase. We define time by no more than Al:
over
the
processors’
clock
time
a correct duration timer to be one that a bounded amount. More formally,
are
written
drifts
from
in real
Each correct duration timer DT is a monotone increasing function of real time, and there is a known constant p > 0 such that for all real times L,, u, with v > u: (1 +
For
range
technical
P)-’(LJ– u)
reasons
i. We take
some
then
notational
sent by correct
that
our
all
algorithms
i >0.
(Compare
PER (for period), ll&OLX E, and e, with PER > ADI
maintain
the
S2 of PER is a
synchronized;
the
in
fact,
ith
and
CS2
to
properties
Lamport and Melliar-Smith [1985 ].) As we mentioned above, the constant is an estimate on the time between successive synchronizations. All] adjustment that real-time interval
CS1
following
S1 and
bound on the maximum constant e defines the
our
(for and
a processor makes to its clock. The within which all correct clocks are
synchronization
occurs
during
the
interval
[t,, t, + e]. DMXX is the bound on how tightly processors not assume that processors can actually compute DA4XX,
synchronize. We do since it depends on
tdel,
can compute
which
upper
they
bound
in other Again,
may
not
that
CSl(i):
~ < ETP(t)
If
That
is, when their
CS2(i):
If processor
CS3(i):
are
they
on tdel),
required
< ~+,,
have
ET
then
to
between
lCP(t)
p makes
an adjustment
– CP(t)
are set forward
at time
by less than
only
for
correct
– c~(t)l
< DW. of synchronization
t and ~ < ETP(t)
ETP(t)
< ~, = ~ and CP(t)
> ~ – AD.1,
If t is in [t,, t, + e], then ETP(t) is either ~ – ADJ < CP(t) < L( + (1 + p)e,
(d) if t = t, + e, then
ET,(t)
= ~ + PER,
(e) if t > t, + e, then
ETP(t)
> ~ + PER.
If processor p is correct at time then ETP is defined throughout in that
s ~+ ~,
ADJ.
ETP(t)
times
some
they can use
< ADJ.
(b) if t = t,, then
critical
which
hold
the same pair
(a) if t < t,,then (c)
CS4(i):
below
that
bound
are close together.
O s CP(t+)
is, clocks
do assume
an upper
and C defined.
< ETq(t)
processors clocks
then That
have ET
We
(using
computations. all the conditions
processors
values,
know.
E for Dik&LY
interval.
t,~ < ETP(t) the
interval
~
or
~ + PER,
and
< ~+ ~, and t > t, + e, [t, + e, t) and p has no
@namic
Fault-Tolerant
Clock
We now show that LES. THEOREM i >0
2.1.
Synchronization
conditions
PI, P2, and CS1–CS3
If an algorithm
M satisfies PI,
in a run r, then M maintains
+ (1 + p)e}, PROOF.
Assume
[u, L)] in
observe
that
LES
p and q are correct, a run
if
r of M.
To
> DM4X.
part
(a) of CS3(j)
that
and
then
(1)
defined
(1) of LES
holds
trivially.
only if there
in
holds,
Suppose
is some j such
s If < ETP(u). Assume, without 10SS of Since q is correct and ET’’(L) > ~, by
v > tj.Since p is correct
have
ADJ
8 = ADJ.
condition
By CS1, this can happen
we must
for all
A = max{DMXX,
and ETP and ETq are both
prove
that E~P(.v) < ~ < ET~(u) or ET~(u) generahty, that ETP( u) < ~ < ET~(u).
to guarantee
P2, and CSl(i)-CS3(i)
– ADJ),
ICP( u) – C~( u)\ < DJL4X,
ICP(U) – C~(U)l
are enough
with parameters
1, ~ = O, y = PER\(PER
a =
interval
153
and ETP(u)
< ~, by
part (e) of CS3(j) we must have u s tj + e. Thus, t~ < u s tj + e. By part (c)of CS3(j), it now follows that ICP(U) – C~(U)\ < ADJ – (1 + p)e. Thus, in general, \CP(u) – C~(u)l < max{DkL4X, ADJ + (1 + p)e}. For part (2), observe DTP(u) ADJ),
– DTP(u)
that
s CP(V)
P1 and the definition – CP(U).
We
next
of C immediately show
that
give us that
if y = PER/(
PER
–
then CP(U)
If p makes
– CP(U)
no adjustments
by the definition
< y(DTP(u)
in [u, u], then
of C. Suppose
p makes
+ ADJ.
– D~p(z~))
CP(~I) – CP(U) = llTP(~’) at least
one adjustment
– DT’(u), in [u,L’1.
By
Pl, there is a first and last adjustment in the interval. Let w be the time of the first adjustment and let z be the time of the last. By P1 and P2, since ETP is always have
a multiple CP(Z+)
adjustments changes
of
PER,
– CP(W’) made
in the
the clock
and at adjustments
= (k – I)PER, interval
by at most
CP is equal
where
k
is at
[u, u]. Moreover,
ADJ.
Therefore,
to some
least
the
by CS2, each
-
C,(w+)
of
adjustment
CP(Z+ ) – CP( w + ) < DTP( z) –
DTP(w) + (k – l)A~J. Thus, DTP(z) – DTP(w) > (k – l)(PER y(DTP(z) – DTP(w)) > (k – I)PER. It follows that Cp(z+)
ETP, we
number
< @TP(z)
– ADJ)
and
– DTP(w)).
Now CP(U) because
– CP(Z+)
there
= DTP(u)
are no adjustments
– DTP(z)
< y(DTP(L’)
– DT’(z)),
in (z, u ]. Since the only
adjustment
in [u, w
1
is at w, we also have, by CS2, that CP(w+)
– CP(U)
Summing
these
< DTP(w) inequalities,
C,(L)
Thus,
we
PER/
(PER
get
the
– ADJ),
–
we conclude
CP(U)
second and
(for
“period”)
condition
– DTP(u)) of
LES,
– DTP(u))
+ADJ.
+ADJ.
with
a = 1,
~ = O,
Y =
❑
as desired.
Algorithm
uses two parameters:
is the time
< y(DTP(w)
that
< y(~TP(u)
S = ADJ,
3. The Basic Resynchronization The basic algorithm
+ADJ
– ~~P(U)
between
PER
and E. Roughly
synchronizations
(and
thus
speaking, corresponds
PER to
D. DOLEV ET AL.
154 the
R of Lamport
[1988]),
while
and
~ (for
Melliar-Smith
estimated
[1985]
maximum
and the
deviation)
P of Welch
is an upper
and Lynch
bound
on the
difference between correct clocks. In the next section, we discuss how parameters should be chosen. For processor p, let ETP (the expected time of the next synchronization),
these A.P
(the adjustment register), and CP (logical clock time) be local variables. DTP 1s a continuously updated variable representing the duration timer (hardware clock) of processor p. When processor p starts running the algorithm, ETP = PER
and
AP =
–DTP.
Recall
that
CP is defined
as DTP + AP. Thus,
initially,
CP is O. (More precisely, if p is initialized at time LL, then we take CP(U) to be undefined and CP(U + ) = O.) In this section we assume that all processors in the network than
start
d. In
running
Section
the algorithm 5, we show
during
how
a real
time
to accomplish
this
processors initially in the network. We use the following abbreviations comprise SIGN SEND The
interval
of length
synchronous
in the description
start
less for
of the two tasks that
the algorithm:
means means
“compute “send
algorithm
a signature
consists
of two
processor. The first task, TM a processor’s clock reads ET synchronization
and append
it to the message.”
out to all neighbors.”
messages
tasks
that
run
continuously
on each
(as in assumption
A3)
from
the other
CP(t) = ETP(t), then processor p signs and sends a message saying “The time is ET” and ET is incremented by PER. Task TM if C = ET
then
correct
(for Time Monitor), deals with the case in which before that processor has received any authentic processors.
If
to all processors
begin
SIGN AND SEND ET G ET + pER;
“The
time
is ET”;
end The second
task, MSG
a processor receives receives an authentic
(for
Message
Manager),
If this message is timei’y, that is, if it comes ET – s . E < C, then processor p updates both out the message.
deals with
the case in which
a message before its clock reads ET. Suppose processor p message with s distinct signatures saying “The time is T.”
Otherwise,
the message
at a time when T = ET and ET and A and signs and sends
is ignored.
Task MSG if {(an authentic message M with s distinct signatures saying “The A (T = ET) A (ET – s “ E < C)} then begin received) SIGN AND SEND “M”; AGET–DT; ET & ET + PER;
time
is T“ is
end This completes the description of the algorithm. Intuitively, the effect of these two tasks is to have correct
processors
at the rate of the fastest “reasonable” processor, that is, one whose pass the timeliness tests. As an example of how the algorithm
running messages operates,
L$namic
Fault-Tolerant
Clock
Synchronization
155
is expected at 11:00 (i.e., suppose PER = 1 hour, and the next synchronization ET = 11). If processor p has not received a timely message (one that passes the tests of MSG)
by 11:00 on its clock,
does receive
a timely
MSG.
one of these
read
Once
12:00. Note
that
it receives saying MSG. In general,
then
before
it executes
11:00, then
tasks is executed,
this means
that
task TM.
it executes
p updates
p will
then
If processor
the body
its local
ignore
variable
any further
p
of Task ET
to
messages
“The time is 11:00,” since they will not pass the tests of Task exactly one of the tasks TM and MSG will run to completion
in a synchronization particular, MSG, but
message
interval,
and it will
be run
many messages saying “The time only one of them will be considered
to completion
only
once.
(In
is T“ may be received by task timely in each synchronization
period.) A message with s signatures saying “The time is T“ might arrive as much as s . E “early” (before ET) and still be considered timely according to the test in MSG. Nonetheless, as we show in the next section, at the completion of a synchronization the correct which is less than E. The
following
interval
during
Suppose
example which
DMXX
(the
processors
illustrates
a message actual
are
synchronized
to within
why the test in Task is considered
maximum
MSG
acceptable
deviation
between
(1 + p)d,
must
to have correct
allow size
the s “E.
clocks)
is 0.1
i receives second and in the algorithm we take E = DMAX = 0.1. If processor a message with three signatures saying “The time is 11:00,” and the message arrives
0.29 seconds
before
processor
i’s clock
reads
11:00,
processor
i will
think that the message is timely according to Task MSG. Suppose, however, that processor j is also correct and is running 0.099 seconds slower than processor i (which is possible since DJL4X = 0.1). If processor j receives processor i’s message almost instantaneously, roughly 0.39 seconds before 11:00 on its clock. signatures, MSG
processor
did
number
not
allow
j will the
of signatures,
Indeed
also consider
interval the
it is straightforward
it timely.
of “timeliness”
message to convert
then j will receive the message Since the message now has four
might
not
However, to grow have
this example
been
if the test in Task as a function considered
to a scenario
of the timely.
in which
any
bound on the size of the interval in which a message is considered timely that is independent of the number of signatures on the message results in an incorrect algorithm. In the next section,
we prove
every run of the algorithm given all i >0. As a consequence, our
4. Ana@is 4.1.
that,
are satisfied,
then
above satisfies P 1–P4 and CS l(i)–CS4(i) algorithm maintains LES.
if assumptions
A1–A5
for
of the Algorithin
INITIALIZATION
ASSUMPTIONS AND PARAMETER
DEFINITIONS.
Let
M be
the algorithm described in Section 3, with parameters E and PER chosen to satisfy the conditions presented below. Assume that there are n processors and that they are all initialized with C = O and ET = PER during a real-time interval of duration less than d. Since we take tO to be the first time some correct processor’s clock reads VO = O, it follows that all correct processors are [t., to + d).If a processor p is initialized at time u, we initialized in the interval take ~TP(u) and AP(u) to be undefined, while ETP(u+) = PER and AP(u’) = –DTP(u),
so that
CP(U+)
= O.
D. DOLEV ET AL.
156 We
choose
following
the
parameters
~e>d; DMAX=(l
+p)e+2p.
. ADJ
l)E;
●
= (f+
o E > DLL4X o PER > ADJ
this can be done:
First,
which
(~+ l)E. Next, choose PER > ADJ set DMXX = (1 + p)e + 2pPER.
they
satisfy
the
fix
e > d. Then
is possible
so that
choose
by A5. Then,
E such
set ADJ
E > (1 + p)e + 2 pPER.
to
Finally,
CORRECTNESS PROOF
THEOREM
4.2.1.
Under assumption
PI–P4
and CS1( i)–CS4(i)
fewer
than nz synchronization
From
Theorem
COROLLARY
real-time
A 1–A 5, elle~
all i >0.
the correctness
logical
clocks
correct
processors
that
of length
of correct
algorithm
of algorithm
all
correct
e. At the end of the
processors
are within
have the same value
are
M maintains
ith
interval,
namely
LES.
straightforward.
synchronized
(1 + p)e
of ET,
corollaqv
.@ is quite
clocks
send
lalue.
2.1 we get the following A 1–A5,
M sotisjies
the correct processors
messages for each synchronization
guarantees
interval
mn of algorithm
Moreoler,
Under assumptions
behind
algorithm
for
4.2.1 and Theorem
4.2.2.
The intuition The
so that
PER;
E > (1 + p)e + 2 p( f + l)E,
4.2.
CS1–CS4
(Drift Inequality); (Separation Inequality).
It is easy to see that that
of conditions
conditions:
within
at time
of each other
~ + PER
a
t, + e, all and all
(which
in this
algorithm is ~+ ~). The next synchronization occurs in the interval [t, + ~, t, + ~ + e]. PER. We also show that during the interval We show that t, ~ ~ – ti is roughly PER
clocks
drift
apart
by
at
most
an
extra
2p”
PER.
This
gives
us
the
expression for D&lXX, which is the right-hand side of the Drift Inequality. In practice, the interval during which clocks are desynchronized, which has duration which
at most
e, is quite
has duration
some typical Although
short,
roughly
while
PER,
the
values for the parameters. the intuition behind the
straightforward, a formal induction, which is why
interval
is quite
long.
correctness
proof requires CS 1–CS4 are
between After of
the the
resynchronizations, proof
we consider
algorithm
some care. We prove the paramet erized by i. The
is quite result proof
by of
Theorem 4.2.1 proceeds through a sequence of lemmas, where we prove the relevant properties one by one (and some added necessary properties). In the proof of these lemmas, we assume that properties A1-A4 hold. LEMMA PROOF.
4.2.3.
Evey
We first
run of ti
prove
most
satisfies Pl, of PI.
P2,
P3,
and P4.
It is easy to see by inspection
of tasks
TM and MSG that AP and ETP are both defined for the same values of t if p is a correct processor and that AP changes value only when ETP changes value. ETP is first defined as PER and when it is changed, it increases by PER, so that it is a monotone nondecreasing step function. AP is also a since it changes only when ETP changes. We prove at the end that AP is nondecreasing. Suppose ETP( t ) is defined. Since multiple of PER, suppose ETP = k “ PER. Since ETP increases
step function, of the lemma it must be a by PER each
@namic time
Fault-Tolerant
it is changed
adjusted
Clock
Synchronization
and starts
no more
than
out
since ETP is a step function interval ending with t,there ETP is constant
itfollows
at PER,
k – 1 times
157 that
in any interval
ETP can have
ending
been
t.Moreover,
with
which assumes only finitely many values in any must be an interval of the form (u, t] such that
in this interval,
and either
ETP(v)
is undefined
or ETP(u’)
#
at L), our initializaETP( ~’). We clearly must have L’ < t. If ETP is first defined tion assumption guarantees that CP( v + ) = ETP( L)+ ) – PER. Otherwise, this fact
is guaranteed
works
by the
code
of tasks
MSG
and
TM.
A similar
argument
in the case of AP.
For PER.
P2 observe Since
ET
that
processors
is changed
that are positive MSG also shows
only
are initialized by adding
with
PER,
integer multiples of PER. An that the synchronization values
ET
A =
–DT
can take
and
ET
on only
=
values
inspection of tasks TM and sent are always equal to the
current value of ET, and after an adjustment, a logical clock is set to the current value of ET. P4 follows from inspection of tasks TM and MSG and our assumption that if p is initialized at time u, then ETP(u’) = PER and CP(U’) = O. For P3, suppose that a processor p is correct and CP is defined at some time
t,which shown,
is not a critical there
is a first
time.
We first
prove
u < t such that
time
that
interval.
Suppose
that
CP(LL) > ETP(u)
continuity strictly
of
definition
of
that
We now prove previous while
guarantees
ETP throughout
w because
CP(U) > ETP(u). particular,
that
paragraph.
that the
be a
– PER. Since AP is and increasing function in this
some
there
u G (L), t].
that
is some
interval
each neighborhood
Itfollows C,(t)
for
L’ must
= ETP( L‘)
Let
inf{u
w =
~
By continuity, w > u and C’,(w) = ETP(w). Thus, by and MSG, we have ETP(w’) = ETP(w) + PER. The
CP then
less than
As has been
AP( L’+ ) = AP(t ). Since
critical time for p, we have by P4 that CP( L‘) constant in the interval (L), t], C’p is a continuous (u, t]:C,(u) > ETP(24)}. inspection of tasks TM
Cp(t) s ETP(t).
x > w such
that
CP is
(w, x). But
this
contradicts
w must
contain
some
of
CP s ETP throughout
the interval
the
u with
(u, t] and, in
s ETp(t).
Cp(t) > ETP(t) Since
ETP is a step function,
CP(V+) there
– PER. = ETP(v’)
Let
L) and
t be defined
– PER,
and
as in the
CP is increasing
must be some u’ > L such that
CP > ETP –
PER throughout the interval throughout (L), u]}. We claim
(L, L)’ ]. Let w = sup{u = (L, t ]: CP > ETP – PER that CP(W ) > ETP( w) – PER. To see this, observe
that
changes
since only finitely
many
to ET
take place
an xl = (v, w) such that ETP is constant throughout w < t, there must be an Xz = (w, t) such that ETP
in (L], t], there (xl, w). In is constant
(w, Xz). By construction, CP(X1 ) > ETP(xI) – PER. Since continuous from the left at w, while ETP is constant
must
be
addition, if throughout
CP is increasing in (xl, w), we
and have
CP(W) > ETP(w) – PER, as desired. If w = t, we are now done. If w < t,then we clearly must have ETP(w + ) > ETP( w). By inspection of tasks TM and MSG, we have CP(WJ+) = ETP(w + ) – PER. Since ETP is constant in (w, Xz ), it follows that CP > ETP – PER throughout [w, Xz). But this contradicts the definition of w. Thus CP(t) > ETP(t) – PER, and this completes the proof of P3. All that remains is to complete the proof of PI by showing that AP nondecreasing. Observe that the only task that changes AP is task MSG.
is If
> ETP(t) task MSG changes AP at time t,then it follows from P3 that AP(t’) – DTP(t) > CP(t) – DTP(t) = AP(t). Thus, AP is also a monotone nondecreasing step function of real time. This completes the proof of PI. ❑
D. DOLEV ET AL.
158 LEMMA and
4.2.4.
CP(t+)
receiLes
Let t be a critical
= O, (b)
CP(t)
a synchronization
time forp.
is defined
message
Then either
and
with
CP(t)
(a) CP(t)
> CP(t+)
synchronization
is undefined
– f “E,
value
or
(c)
p
CP( t + ) at time
t
signed by some other con-ect processor. PROOF.
The
initialized
only
t, (2)
at
way
t can be a critical
that
CP(t)
= ETP(t)
(according
to
time task
for
p
TM),
is if (1) or
(3)
p
Cp(t)
is is
defined and p receives a timely message in task MSG. If (1) holds, then Cp(t) is undefined and Cp(t + ) = O, while if (2) holds, then Cp(t ) = Cp(t’). Thus, suppose (3) holds, and p receives T and s signatures. The timeliness s . E. If s < f, then we are done. message
must
LEMMA
be that
4.2.5.
processor
a timely message with synchromzation value test guarantees that CP(t) = T > CP(t’) – Otherwise,
of a correct
If i >0
that is correct
by A4, one of the signatures
processor,
and t, is finite,
(1) t,< t,+~ and
then
at t, such that CP(t,)
on the ❑
so again we are done.
(2) there is a
> ~ – fs E and ETP(tl)
= ~.
PROOF. Let Zj = min{u : exists a processor p that is correct at time u and CP(U + ) = j “ PER}. If for no time u is it the case that there is a processor correct at u with CP(14+ ) = j “ PER, then we take z] = ~. By P2 and the fact that VI > 0, {t, : i >0, t, finite} is a subset of {z, : j >0, z, finite}. Let p be a processor
correct
at time
CP(ZI ) = j . PER. critical P,
then
integer
time
for p, then
Cp(Zj)
=
multiples
z] such that
CP(ZJ) and ETP(z,) Cp(Z~). of PER
CP(Z,+) = j “ PER. are defined
We want
because
CP(z~)
to show that >0.
ETP(z, ) = CP( ZJ+) by P4. If z] is not a critical
If z, is a time
for
In this case, CP(ZJ) = ETP(z1) since they are both and ETP(z, ) – PER < CP(ZJ ) s ETP(z, ) by P3. Thus,
in either case, ETP(z, ) = j “ PER as desired. Since ETP( z] ) = j . PER, it follows from P1 that for j >1 there exists a y < z, such that ETP( y+) = j “ PER and CP( y+) = ETP(y+) – PER = (j – 1) “ PER. Therefore, z,_ ~ < z,. Since the t,’s are a subset of the ZJ’S, and since the finite ZJ’s are totally ordered, it follows that the finite t,’s are also totally ordered. This proves part (1) of the lemma. For
part
(2), by definition
of
t, there
is a processor
p
correct
at t, with
Cp(t~) = ~. We have ~ = j “ PER for some j with t, = z]. We argued above that in this case, both Cp(tl) and ETP(tl) were defined and ETP(t,) = j . PER = ~. If t, is not a critical time for p, then Cp(t,) = Cp(t~ ) = ~, so (2) holds. If t, is a critical time for p, then by Lemma 4.2.4, one of the following three cases holds:
(a) Cp(t~) (b) (c)
= O,
Cp(t,) is defined, and Cp(tz) > Cp(t~) -f” E, a synchronization message with synchronization from another correct processor.
value
Cp(t~ ) is received
Since i >0 by assumption, case (a) does not hold. Case (c) also does not hold for otherwise (by A2) there would be a correct processor whose clock read ~ at Thus, case (b) must hold and Cp(t,) > Cp(t,+ ) – f” E = ~ – a time prior to t,. f. E. Hence, whether or not t,is a critical and ETP(t, ) = ~. ❑ The next lemma a run is sent out message.
shows that more than
time
for p, we have C,(t,)
the (i + l)st synchronization e time units later than the
> L( -f.
E
message sent out in i th synchronization
L&tamic
Fault-Tolerant 4.2.6.
LEMMA
and ti is finite,
Clock
Synchronization
In eve~ run of@
159
and for all i >0,
ifpart
(a) of CS3(i)
holds
then t,+ ~ > ti + e.
PROOF. Suppose that ti & finite. If t,+~ is infinite, then the lemma clearly holds. Otherwise, by the previous lemma, there is some processor, say p, that is + ~ such that Cp(tL + ~) > ~+ ~ –f. E and ETP(t,+l) correct at time tl = V+l. BY PI, there From
is a u < tl+l
that is the earliest time such that ETP(L~’) = ETP(tl + ~). (a) of CS3(i) and PI, it follows that u > t,. By Pl, we have that
part
:,(u+)
= ETP(u+)
– PER
= ~+1
– PER.
Moreover,
(u, t,+ ,), since, by Pl, A, changes Interval Al, we have that CP(tz+l) s ~+1 – PER ~ – t,).Combining this with + (1 + p)(ti+
CP is continuous
in the
only when ET, changes. Thus, by + (1 + p)(t,+l– u) s ~+1 – PER the earlier inequality CP(tL + ~) z
~+1 – f” E, we get that (1 + p)(ti+,– t,)> PER – f. E. The Drift and Separation Inequalities together imply that PER – f” E > (1 + p)e, so we get that ❑
t,+ ~ > t, + e, as desired. LEMMA
If CS3(i)
4.2.7.
PROOF.
Suppose
be two processors
and CS4(i)
r is a run where that
are correct
F(+l and ~ < ET~(t) s ~+l. Since both ETP(t) and ET~(t) multiples
of PER
~ + PER.
By
in a run of d,
CS3(i)
and CS4(i)
t in run
at time
then so does CSl(i).
hold,
and let p and q
r such that
~ < ETP(t)
t,. are greater than ~, and these values must all be
by P2, we must
P3, it follows
hold
that
have that CP(t)
ETP(t)
> ~
and
> ~ + PER C~(t)
> ~.
and
ETq(t)
By part
>
(c)
of
CS3(i), if t is in the interval [ti, ti + e], then CP(t) < ~ + (1 + p)e, and C~(t) < ~ + (1 + p)e. Thus, both CP(t) and Cg(t) are in the interval (~, ~ + (1 + p)e), so that lCP(t) – C,(t)/ < (1 + p)e, which is less than DMAX. Now the
suppose
interval
interval. there
t > t, + e. By CS4(i),
[ti + e, t).
Suppose
without
can be no point
For if there
were,
Thus,
we have that
CP and
Cg
loss of generality
t’in the interval
then by P3 we would
AP and Aq are constant
are
continuous
that
CP(t ) > C~(t ). We claim
such that
functions
Cp(t’ ) is of the form
have Cp(t’)
= ETP(t’).
Then
in
in this that
k. PER.
by task TM
CS4(i). By parts (c) a synchronization value would be sent at t‘,contradicting and (d) of CS3(i) together with P3, we know that C~(ti + e) > ~ and CP(tl + e) < ~ + (1 + p)e. Since ~. is a multiple of PER, and CP(t) cannot be a multiple of PER in the interval [t, + e, t), we know that CP(t ) s ~ + PER. It is easy to see that we overestimate the maximum separation between CP and C~ at time
t by assuming
(1) CP(ti + e) = ~ + (1 + p)e,
(2) C~(tl
+ e) = ~,
(3) C, runs at the maximum possible rate (1 + p) in the interval [t, + e, t], (4) C~ runs at the minimum possible rate (1 + p)- 1 in this interval, and (5) CP(t) = ~ + PER (so that the interval is as long as possible). Making these assumptions, we see that t = ti + e + (1 + p)-l(PER – e), CP(t) = ~ + PER, and p)-z >1
– e). Thus, Cp(t) – Cq(t) < (1 – (1 + C~(t) = ~ + (1 + p)-2(PER )PER + (1 + p)-ze. Since straightforward algebra shows that (1 + p)-2 – 2p, this expression too is bounded by D&lAX. ❑
LEMMA
PROOF. ~ < ETP(t)
4.2.8.
If CS3(i
Suppose s ~+l.
+ 1) holds
in a run of.&,
p is correct and makes By P4 (which holds
then so does CS2(i).
an adjustment at a time t such that by Lemma 4.2.3), ETP(t) = z+ 1,
D. DOLEV ET AL.
160 ETP(t+)
= ~+1
+ PER,
+ 1), t must
CS3(i follows
from
LEMMA
part
4.2.9.
(c) of CS3(i If CS3(i)
Suppose
PROOF.
hence
faulty,
[tl+l,
+ 1) that
holds
fl+l
in a run of d, at time
then so does CS4(i).
t,~ < ET’(t)
s ~+,,
all correct processors are initialized once ETP is defined it stays defined
it follows
that
ETP
(d) and (e) of
+ e). Since
Cp(t+) = ~+1, it Cp(t’) – Cp(t) < MY. ❑
is defined
d s e and t,z to)in [t, + e,
(since
By parts so t > t,+l.
= ~+1,
interval
p is correct
Since by assumption [tO, tO + d), and since becomes
and CP(t+)
be in the
in the
Suppose
t].
and
in until
interval there
t > t, + e.
the interval processor p
[tO + d, t],
were
a critical
and time
u for p in the interval [t, + e, t). Since u > t, + e, by CS3(i) and PI, it follows that ETP(u) > T(. By P4, ETP(u) = ~ for some j > i. Thus, ETP(u) > ~+ ~. By P4,
we
have
ETP(u+)
= ~ + PER.
contradicting our assumption. interval. CS4(i) now follows. LEMMA
4.2.10.
PROOF.
CS3(i)
We proceed
Thus,
Hence, El
holds for all i >0 by induction
and VO = O by definition,
by
there
Pl,
we
have
is no critical
ETP(t)
time
for
in the
in ezleq run of M.
on i. For the case i = O, recall
and we assumed
> ~+1, p
that
all processors
that
tO = O
are initialized
at
some time in the interval [0, e). Since we have also assumed that if a processor p is initialized at time u we have ET’’(u) undefined, it is easy to see that parts (a) and (b) of CS3(0) hold vacuously. Clearly if a correct processor’s logical clock is not adjusted before time tO + e, then by Al it reads a value in the range [0, (1 + p )e) wherever it is defined in the interval [to, to + e], while its value of ET is PER. On the other hand, if some correct processor’s clock is adjusted in this interval, then tl < tO + e. By Lemma 4.2.5 and the fact that VI > PER, for some correct processor p we have CP(tl) > VI – f. E > PER – f” E. Since p cannot have adjusted its clock prior to tl, we must have (1 + p)e > CP(tl ) > PER – f” E, Thus, no correct processor adjusts as well Now then finite.
contradicting the Separation Inequality. its clock before to + e. This proves part (c)
as part (d). Part (e) follows from PI and P2. assume CS3(i) holds; we show that CS3(i + 1) holds.
so is ~+ ~, by definition, For part
4.2.6 implies have ETP(t)
(a) observe
so CS3(i
+ 1) is vacuous.
that by P2, it follows
that t,+ ~ > t, + e. Suppose > ~+l. By P2, it follows
that
If t,+ ~ is infinite,
So suppose
that
~+ ~ > ~ + PER.
t,+,
is
Lemma
p is correct and for some t < t,+ ~, we that ETP(f) > ~+ ~ + PER. By parts
(a)-(d) of CS3(i), it is easy to see that we must have t > t, + e. We next show that CP must be continuous in [t, + e, t). If not, then there is some adjustment (such a u exists by Pi). in [t, + e, t).Let u be the time of the first adjustment By P4, CP(U+ ) = E~P(u) = ~ for some j. By parts (d) and (e) of CS3(i), conETP(u) > ~, so j > L + 1. If ETP(u) = ~+1, then the fact that ZL < t,+, tradicts the definition of t,+~.If ETP(t4) > ~+,, then by P3, CP(U) > ETP(u) – that PER > ~+,. Since CP is continuous in the interval [ti + e, u ), it follows CP( u) = ~+ ~ for some u in the interval and hence, again by continuity, that CP(L’+) = ~+l. But this contradicts the definition of t,+~.Thus, CP is continuous in [t, + e, t), as claimed. By P3, we have CP(t) > ETP(t) – PER > ~+ ~. Since Cp(t, + e) < ~ + (1 + p)e < ~ + PER s ~+ ~, itfollows from the continuity of CP that for some point u in the interval (t, + e, t), we have
Dynamic cP(u) tion
Fault-Tolerant
= ~+1
Clock
and hence
CP(L1+)
=
But
~+l.
Thus, we must have ETP(t) of t,+,.
of CS3(i
this
again
contradicts
s ~+ ~, as desired.
the defini-
This proves
part
(a)
+ 1).
For part correct
161
Synchronization
(b), first
observe
at t,+ ~ we have
induction
assumption
that
Cr(t, +l)
together
by Lemma > ~+1
with
4.2.5 for some processor
- ~” E. By CSl(i)
Lemmas
(which
4.2.7 and 4.2.9),
p that
holds
is
by the
for every proces-
sor q that is correct at ti+, we have lC’P(tl+l) – C~(tt+l)l < DMZ4X, so Cq(tz+ ~) >~+1 -f. EDMXX>~+l – ADJ. Since ADJ < PER by the Separation Inequality, it follows from P3 that we must have ETq(t,,, ) > ~+ ~. In combination with the previous paragraph, this gives us part (b) of CS3(i + 1). (Since ETP does not change in the interval [t, + e, ti+ ~), we can in fact show that ETP(t,, ~) = ~ + PER, and hence that ~+, = ~ + PER. Thus, we could carry along
as an inductive
hypothesis
that
~ = i . PER
if t, is finite,
but we do not
need this fact here, nor will it hold for our join algorithm.) For part (c) of CS3(i + 1),suppose that p is correct at time e]. There
are
four
cases to
consider:
(1)
ETP(t)
t = [t, + ~, t, + ~ +
< ~+ ~, (2)
ETP(t)
= ~+ ~,
(3) ET (t) = ~+ ~ + PER, and (4) ETp(t) > ~+ ~ + PER. We show that only case (2J or case (3) can hold, and that, in these cases, ~.+, – ADJ < CP(t) < ~+ ~ + (1 + p)e. By Lemma 4.2.6 and the induction assumption applied to part (a) of CS3(i), we can assume assumption applied to CS3(i), Suppose
case (1) holds,
t,+ ~ > t, + e; by Lemma we can assume CS4(i).
so ETP(t)
> t, + e. By part
we have t > t,+l that ~ < ETP(t ) < ~+ ~. By [ti + e, t); ETP is defined
t,+~.By part (b) of CS3(i
4.2.9 and the induction
< ~+ ~. By assumption (e) of CS3(i)
CS4(i),
ETP
is defined
at t by assumption.
throughout
It follows
+ 1),ETP(t, + ~) = ~+,.
and Lemma
and the assumption, that
4.2.6,
we know the
interval
ETP is defined
at
Since t > t,+~,this contradicts
P1. Suppose is constant
case (2) holds, throughout
– ADJ.
By
so ETP(t)
P3,
Cp(t)
= ~+ ~. By CS4(i),
< ETP(t)
[~+~,+~+1 + e] we have ~+1 –ADJ Suppose case (3) holds, so ETP(t) u < t such
that
ETP(u+)
= ~+,
ETP is defined
[ti+ *,t).By part (b) of CS3(i
the interval
= ~+l.
< C,(t) = ~,,
+ PER.
Moreover,
By
< ~+1 + PER.
P1
and
and AP
+ 1), Cp(t, + ~) Al,
+ (1 + p)e. By Pl, there
since
t =
is a first
time
By part (a) of CS3(i + 1), u > t,+ ~. Finally, by PI and the fact that there
by PI, we have C (u+) = ~+l. are no changes to ETP in t ( e interval (u, t),there are no changes to AP, and CP is continuous in this interval. Since ti+, s u < t < t,. ~ + e, using Al and Pl, we get ~+1 < CP(t) < ~+1 + (1 + p)e. Suppose case (4) holds, so ETP(t) > ~+ ~ + PER. Let T = ETP(t). By P2, we
must have T > ~+ ~ + 2PER. By Pl, there is a first time u < t such that some correct processor q has ETq(u’) = T. There are two subcases: (4a) L~ is a critical time for q, and (4b) u is not a critical time for q. Suppose (4a) holds. Then, by P4, C~(u’) = ET~(u’) – PER Thus
t]s u. But ET~(u+)
j > i + 1. By Lemma 4.2.6 + e < t,+ ~, contradicting ti+l
= T > ~+1 and part Lemma
+ 2PER,
so C~(u+)
(a) of CS3(i 4.2.5.
= ~ for some ~.
> ~+1
+ 1), we have
+ PER,
and
t,s LL < t s
Suppose case (4b) holds, so that u is not a critical time for g. Then ETq( u) is defined and ET~(u) > ~+ ~ + PER. By PI, there is a first time u < u such that = ETq(LI+) – PER = T – 2PER > ~+l. Since ETq(L1+) = ETq(u) and Cq(LI+) ETq(u+) > ~+1 + PER, Itfollows from part (a) of CS3(i + 1) that if w is any time
in (L1, u),
then
w > t,+,.Thus
t,+l
s z] < u < t s t,+,
+ e. Now
C~ is
D. DOLEV ET AL.
162 continuous
on (u, u], so Cg(u+)
the fact that
(1 + p)e
< PER.
< C~(U+) But,
for
+ (1 + p)e
< T – PER,
any w in (u, t), C~(w)
P3. Thus, C~(u’) > T – PER, contradicting pletes the proof of part (c) of CS3(i + 1).
C~(u’)
by Al
and
> T – PER,
< T – PER.
This
by
com-
For part (d), suppose that q is correct at t,+ ~ + e and ET~(tl+ ~ + e) # ~+ ~ + PER. By part (c) of CS3(i + 1), we must have ET’’(t, + ~ + e) = ~+ ~. Let p be a processor correct at t,,~ such that CP(t~+ ~) = ~+ ~. We have assumed that all processors are initialized before tO + e. By Lemma 4.2.6, t,+ 1 > to + e. Thus, it follows from the definition of correct that q must be correct throughout the interval [t, + ~, ti+ ~ + e) and that CP(tl + ~) and EP(tl + 1) are both defined. By part (b) of CS3(i + 1), EP(t, +l) = ~+l. We claim that p sends synchronizaby inspection of tion value ~+ ~ at time t,+~.If Cp(t, + ~) = ~+ ~, this follows task TM.
Otherwise
Cp(tl + ~) # Cp(tC~ ~), so that
p must
invoke
task
MSG
at
t,+~,and at that time
p sends synchronization value ~+ ~. We now apply A2 with t = ti+l. Let po, . . . . p~ be the sequence of processors guaranteed to exist by A2, with p = po, q = p~, and k “ tdel < d. Note that
time
for
all
{~+1,
times ~+1
~,t,+ ~ + d], u = [t,+
+ PER}
by part
ICPJU) – CP,(U)I < DiWIX the synchronization value
each
(c) of CS3(i
correct
The base case holds by assumption.
time
with
that
p,+,
value
Suppose
~+ ~ before
has
ETPJu)
G
then
[t,+ ~, t,+ ~ + (j
p, sends the synchronization
value
and j < k. By A2,
p]+ ~ receives
p~ + ~ already
a synchronization
this time,
sent the message
been in the interval
p]
by CSl(i). We show by induction on j that p, sends ~+ ~ at some time in the interval [t, + ~, t, + ~ + j” tddl.
in the interval [t, + ~, t,+ ~ + js tdel], .y+l message before t,+ ~ + (j + 1) “ tdel. If message
processor
+ 1) and, if ET,< = ETP,, = ~+1,
then
it set its clock + l)tdel],
sent
by tasks TM
and MSG
to ~+ ~. This
as desired.
time
this at the
must
have
If p, + ~ did not already
send such a message, then it suffices to show that the message it receives from p] is timely, that is, it passes all the tests of task MSG. Suppose p, sent its message at time u and the message is received by p]+ 1 at time t.Since the [u, t] is contained
interval
in the
interval
[t, + ~, t,+ 1 + e) and
since
we have
assumed that pi+ ~ has not sent ~+ ~ by time t,part (c) of CS3(i + 1) implies that the value of ET for p,+ ~ must be ~+ ~ (the only other choice is ~+ ~ + PER, but by inspection nization value ~+ ~ is sent lCP\u)
-
DMXX. then
CP,+~u)l There
arrives
since
t > u, it follows
with
Thus,
CP,+jt)
one
signature)
> ~+1
– DM21X,
passes the
a message with synchroto ~+ ~ + PER). Thus, that
are now two cases. If p, used task TM
CPJU) = ~+l.
(which
< DIW4X;
of tasks TM and MSG, out when ET is set
C,,+,(t)
> C,,(u)
-
to send out its message,
so in this case the message
timeliness
test.
If
pj
used
task
MSG, pj was responding to a message with s signatures and sending a message with s + 1 signatures. Since p~ found the message timely, CP,(U) > ~+ ~ — s - E, and so CP,+,(t) > ~+ ~ – (S + 1)-E. Since p,+ ~ receives the message with s + 1 signatures, again it passes the timeliness test. By task MSG, it now follows that p~ + ~ sends out a message with synchronization value ~.* sometime in the interval [t, + ~, t,+, + (j + l)tdel). Since k. tdel < d < e, it follows that q sends out such a message before time ti+ ~ + e. By P4 when q sends out this message, it sets ET~ to L( + ~ + PER. By P1 this contradicts the original conclusion that ETq(tl, ~ + e) = ~+ ~. The contradiction completes the proof of (d). Part (e) is immediate from part (d) and P1. ❑ Proof
of Theorem
4.2.1.
By Lemma
4.2.3 @ satisfies
P1–P4
in every run.
By
l&tamic
Fault-Tolerant
Lemma Lemmas For
Clock
Synchronization
163
4.2.10 it satisfies CS3(i) for all i >0 in every run. It now follows by 4.2.7, 4.2.8, and 4.2.9 that it also satisfies CSl(i), CS2(i), and CS4(i).
each synchronization
messages:
one
value,
synchronization
than
n2 message
proof
of Theorem
each correct
message
are sent for
processor
sends at most
to each of its neighbors.
each synchronization
value.
n – 1
Thus,
This
fewer
completes
the
❑
4.2.1.
4.3. PERFORMANCE ISSUES. We now consider some typical values for parameters of the algorithm. Suppose p = 10-6, tdel = 0.1 second, and network is completely connected with n processors. Then, so long as there
the the are
no more than two processor failures and the network remains connected with d = e = 0.2 second, E = diameter at most 2, we can take PER = 1 hour, DMAX = 0.21 second, and ADJ = 0.63 second. If we allow only processor failures
(as is the
Lynch
[1988],
diameter
case in Lamport
then
we
can
of the network
0.11 second, roughly smaller
do
is still
and Melliar–Smith even
better,
since
1. We can take
d = e = 0.1 second,
PER
[1985] we
are
and Welch assured
= 1 hour,
and AD.1 = 0.33 second.
Note
and
that
the
E = DMXX
=
that
is
DW
equal to d. As stated in Section 2, we can make d, and hence DM.4X, by giving the synchronization process high priority in the scheduling of
the operating Since our
system of the processor. algorithm never sets clocks
back,
if duration
timers
have
fixed
rates of drift from real time (as is often the case) and there are no faults, then clocks will run at the rate of the fastest correct duration timer. This means that logical
clocks
worst
case,
we
of correct
PER/(
PER
– ADJ).
have
processors from
Since
will
Theorem ADJ
tend 2.1
= (f+
to run faster that
l)E,
than
processors
if PER
real time. run
>> ADJ
at
and
In the
a rate
of
E = DMAX
= (1 + p)e + 2P” PER (these assumptions will all be typically true in practice), this worst-case rate is approximately equal to 1 + (ADJ\PER) = 1 + (f+ 1)2 p. In Srikanth and Toueg [1987], an algorithm is given that attains optimal synchronization in the sense that logical clocks of real time as duration timers (i.e., (1 + p)-l(u u > u). However, P)( u – U) for Srikanth and Toueg [1987] require less than half the total number necessary, achieve still
even with
is essentially
use our algorithm
the rate
at which
clocks
of correct
Moreover,
ours in completely
but perhaps
logical
then to set duration In our algorithm,
to maintain this optimal synchronization, that the number of faulty processors f be of processors, a requirement they prove
authentication. twice
clocks
timers DikMX
are within the same envelope – u) < C(U) – C(u) < (1 +
decrease
gain
time
the value connected
of DMAX networks.
the rate of speedup
in practice
using
to run slower by that rate. gives an upper bound on the
processors
that
have the same value
our
they
is to measure algorithm,
difference
of ET.
There
interval
their
clocks
differ
by at most
ADJ
and
between may be a
short interval of time (a subinterval of [t,, t, + e]) during which correct sors have different values of ET. By part (c) of CS3(i), itfollows that this short
can
One way to
proceseven in
+ (1 + p)e. If we assume
that p = O and E = DJL4X, then (since ADJ = (f+ l)E) this difference is bounded by approximately ( f + 2)e + 2p. PER. Using the estimates for p and PER given above, we see that the dominant term here is (f+ 2)e. This amount with n. One may be unacceptable in large systems, where f may grow linearly way around this problem is to prevent events that require timing from taking place in this interval, as suggested in Lamport and Melliar-Smith [1985].
164
D. DOLEV ET AL.
However, logical
there
clock
is another
approach.
(without
making
time t, and continue maximum real-time
running duration
distributed starts and clocks will
process, a virtual then undergoing differ
We can simply
any adjustments)
DMXX
rather
than just being
the adjustment
we make
piecewise
to clocks
Since we can take
Dolev
et al. [1986]
bounds to
in question
synchronize
Cristian
this interval,
is the some
continuous
which
may be
functions
of real
We can do this by amortizing interval,
rather
than
in Lamport and Melliar-Smith clocks in Section 9.
e = d in this algorithm,
the bound
doing
[1985].
on synchronization
it
We that
= (1 + p)e + 2 pPER) is essentially within a factor of 2 of d/2 attainable in systems with no clock drift at all (see
and Halpern
et al. [1985]
are guaranteed,
with
much
for further
worst-case
tighter
details).
bounds,
However,
the
and it may be possible
precision
with
high
probability
The
is initializing
(see,
e.g.,
and Joining
There
are two
logical
clocks
issues
that
are started
(joining)
new
or
remain. within
first
less than
repaired
d time
processors
so
units. that
the
[Cristian
et al. 1986; Dolev
initially
in
message
from
the
(thus
neighbors. time.
network another
setting
et al. 1986]. We assume starts
its logical
By assumption
either
processor.
their
clock
A2,
this
that
spontaneously
As soon
sends
requires
logical
so that
is integratclocks
message
are
diffusion
each of the processors or
as a processor
to O) and diffusion
system
The second
synchronized with those of all the other processors. The first task can be accomplished quite easily by a simple
–DT
before
[19891).
5. Initialization
ing
clocks
continuous.
over some time
all at once. This idea was suggested present an algorithm for continuous we maintain (DiM4X of the optimal bound
the “old”
with logical time until the process will suffice. Unadjusted logical
+ A . dur during
significantly less than AD] + (1 + p )e. Yet another approach is to make logical time,
using
events that begin
after a clock adjustment is made. If dur which a clock might be used to time clock coinciding no adjustments
by at most
continue
to time
upon
receipt
starts,
a message less than
of
a
it sets
A =
to all
of its
d units
of real
We now turn our attention to the problem of joining, to which most of the remainder of the paper is devoted. We start with some notation: A previously synchronized group of processors is called a chlster, and a new processor that wants to join the cluster is called a joiner. We want an algorithm that allows a processor to join a cluster within a bounded time of requesting to do so. Such an algorithm is crucial in a dynamic network in which new processors arc being added to the system. If we have a method of fault detection, such a join algorithm also allows faulty processors that have been repaired to rejoin a cluster. An algorithm achieves bounded joining if for some bound b > 0 a correct processor that requests to join a cluster is guaranteed to join within real time b. Unlike the basic clock synchronization algorithm, which does not require that some minimum number of processors be correct, a necessa~ condition for a bounded joining algorithm to be guaranteed to succeed is that a majority of the processors
in the cluster
be correct.
Dynamic
Fault-Tolerant 5.1.
THEOREM
if a processor
Clock
Synchronization
No algorithm
tries to join
165
can maintain
a cluster
LES
and guarantee
where one half
or more
a bounded
join
of the processors
are
faulty. PROOF.
Assume
algorithm
~
maintains
and ~ and bounded joining with processors are correct throughout
t and
choose in the cluster
real time processors
LES
with
parameters
A, a, ~, y,
bound b. Consider a run r where all the run and in the same cluster. Choose
n a
T such that the time on the logical clocks of the at real time t is at most T. Now choose T’ such that
T’ – T > b(y(l -t p) – a(l + p)-l) + (8 – ~) + 2A. The LES condition clocks of all correct guarantees that at some time t’ in run r, the logical processors show a time greater than T’. (This would not necessarily be true if &Z’ were not required logical clocks always
to maintain read O.)
We use r to construct groups the
LES;
two further
X and Y, each of size n/2
first
run,
rx,
the
processors
for example, runs
of ~.
(we assume in
X
are
r might Divide
be a run where
the
for simplicity correct
and
n processors that
all into
n is even).
processor
p
In
(a new
tries to join at time t.All the processors t just as in run r. At time t,processors in Y move into the state they had at time t’ in r. In the second run, rY, the processors in Y are correct and p tries to join at time t’ with the same local state it has when it tries to join in rx at time t.All processors proceed through in run rY until time t’just as in run r. Then just as p tries to join, processors processor,
proceed
not
through
in either run
X
or
tx until
Y)
time
X move into
the state they had at time
tries
to join,
no processor
time
p tries
to join,
each processor By assumption, The clock by at least
the clock
in Y by more p joins
at and after
the time
Moreover,
in X differs
p
at the
from
the clock
of
in an interval
of length
b.
T – T’. at some point
in X differs
p joins.
that
the two scenarios.
of each processor than
the network
of each processor T’ – T when
t in r. Note
can distinguish
from
Condition
the clock
Al,
part
of each processor
(2) of the LES
in Y
condition,
and the choice of T’ guarantee that they differ by at least 2A throughout the interval of length b after p joins. Thus, p will not be within A of the correct processors in at least one of the two scenarios. ❑ Theorem existence will
5.1 does not of an algorithm
in fact eventually
time
required).
were no bound “fast” group rithm
For
join
preclude
the network,
example,
of the time
the
guaranteeing in the
required
possibility that with situation
to join,
of
eventual
a processor
that
no guaranteed sketched the joining
joining requests
upper in the
processor
bound proof, could
(i.e.,
the
to join on the if there tell the
processors to run slower and the “slow” processors to run faster, each still staying within some linear envelope. We conjecture that an algothat achieves LES and eventual joining may exist without the assumption
that less than half the processors are faulty. (The following is an idea for such a possible protocol: A joiner can obtain synchronization values from all participants (this can be done, for example, using the join protocol we describe in Section 7). If a processor sees that the synchronization value it sent is above the average value of the set, then it slows down by an agreed-upon rate; otherwise, it speeds up by this rate. If this rate is sufficiently large, then joiners and all other processors can detect and ignore uncooperative processors. The process is repeated periodically until all synchronization values in the set are
D. DOLEV ET AL.
166 the
same
or ignorable.
since our algorithm
6. A Synchronous In this section, cluster
algorithm [1986],
Update
joiners
can join
the
unanimous
an algorithm
list of processors
to essentially solves
but
only
our
that
enables
in the cluster,
agree
the atomic for
set.)
we assume
However,
for
our join
Seruiee
we present
of the current the
Then
interest in this paper is in bounded joining, that less than half the processors are faulty.
on which
broadcast
special
processors
problem
purpose.
a processor
and enables
are in the
as presented
It is not
to keep track
all the processors cluster.
by Cristian
suggested
in This
et al.
as a general-pur-
pose atomic broadcast algorithm. We use it to update the list of joining processors as of the same clock time on all correct, joined processors in the system. Again,
we
maintains
start
a data
ory. We say that time T identical
with
some
structure replicated
memory
if the replicated as of clock time
guarantees
that
by each processor
definitions.
suggestively
We
called
assume
that
a (synchronous)
is consistent
all updates
to this structure
in a cluster.
in a set of processors
(Note
are made
the similarity
specifications trize
that
agreement algorithm,
satisfies
of the update
the specification all i >0,
memat clock
at the same clock
of these
informal
time
specifica-
[Dolev and Strong 1983; Pease et al. we can maintain the consistency of
replicate memory. We use the algorithm to ensure that agree on who is currently in the cluster. The update algorithm is assumed to run concurrently algorithm
processor
memories on all correct processors in the set are T. We provide a ,gwchronous update algorithm that
tions to those of Byzantine 1980].) Thus, by using the
nization
each
replicated
P 1–P4
algorithm
and that
with
CS 1–CS4.
formally.
by i. We require
all correct
Again,
We
processors
an update
to replicated
a clock now
it is useful
the following
and all correct
processors synchro-
define
the
to parame-
two properties
are
satisfied
for
SUl(i):
If p initiates
SU2(i):
memory for all ~ < ETP(t) < ~+1, then by time t,,~,the replicated with processors that are correct with ET defined at t,+~ is updated UPD. If p updates its replicated memory with UPD at time t with ~ < ETP(t ) s ~+ ~ and CP(t ) = T, then for all processors q correct at time t1+1 with ETq(t, +, ) defined, there exists a time tq < t,+~ such that Cq(tq) = T and q updates replicated memory with UPD at tq.
UPD
p: memory
at time
t such that
Intuitively, SU1 guarantees that if a correct processor initiates an update UPD to replicated memory, all memories are updated with UPD within a bounded real time. SU2 guarantees that if any correct processor updates its replicated memory with UPD, then all correct processors do so, and they do so at the same time U on their local clocks. We now provide an update algorithm. The algorithm has a similar flavor to the clock synchronization algorithm. Just as all the updates to clock values occur at prearranged times ET (which are all multiples of PER), updates to replicated memory occur in the update algorithm at prearranged times which, for technical reasons explained later, we take to be times of the form ET –
Dynamic
Fault-Tolerant
AD.1. We show that
Clock in order
in time
to do the update,
at time
ET
– 3 “ALU.
in earlier
sections
Synchronization to ensure
information
processors
must
start
We have to strengthen
had
throughout
that
the message the
form
PER
appear on the clocks of all correct As in the clock synchronization diffuses
167
the network,
has arrived
about
processors. algorithm,
information
time.
apply
the system
Inequality,
to guarantee
and processors
this message
through
the Separation
> ADJ,
at an acceptable
hear
diffusing
that
which
such
about
times
the
update
tests to determine
Processors
now
if the
maintain
two
sets UPDMSG and PENDING, both containing pairs of the form (T, UPD), where T is a clock time and UPD is an update value to be applied to replicated memory. UPDMSG consists of messages to be sent out and the times they are to be sent out, while PENDING consists of values with which replicated memory is to be updated, and the times that the update is to take place. Finally, MEM is a variable denoting the current replicated memory. We define APPLY(MEM,
UPD)
to be an action
the value UPD. The update algorithm
consists
UPDATE.
UPDINIT,
The
synchronization
first
task,
algorithm.
If
that
updates
of three is the
CP = ET
the replicated
tasks,
UPDINIT,
analogue
– 3. ADJ
memory
with
DIFFUSE,
of task TM
and processor
and
in the clock p has a pair
of the form (T, UPD) = UPDMSG, with (T + 2 “ ADJ, UPD) not already in PENDING and T = ET – 3. ADJ, then, using task UPDINIT, processor p signs and sends a message SYNC(T, UPD) to all its neighbors. We can think of this message as saying “schedule an update UPD to replicated memory at clock time
T + 2 .ADJ(
added
to the
from
the
= ET
PENDING
UPDMSG
list
– ADJ).” list.
On
once
the
This the
means other
message
(T + 2 .ADJ,
hand,
(T, UPD)
is sent.
(In
our
UPD)
must
be
can be removed applications,
we
guarantee that for all pairs (T, UPD) = UPDMSG, T indeed has the form k oPER – 3 “ADJ, so there will be no “useless” pairs in UPDMSG.) If (T+ 2 “ ADJ, UPD) G PENDING, then the update has already been scheduled, so there
is no need
to schedule
Task UPDINIT if {((T, UPD) e UPDMSG) A(T=ET–3” ADJ)} then
it again.
A ((T
+ 2. ADJ,
UPD)
@ PENDING)
A (c
=
T)
begin
SIGN AND SEND SYNC(T, 17PD); U {(T+ 2 “ ADJ, PENDING e PENDING UPDMSG
+- UPDMSG
UPD)};
– {(T, UPD)};
end Task
DIFFUSE
is the
analogue
of task
MSG
in our
clock
synchronization
algorithm. It guarantees that a SYNC( T, UPD) message will be passed along, provided the message is “convincing.” In order for the message SYNC(T, UPD) to reach processor q convincingly, it must pass two tests. The first just checks that T = ET – 3. ADJ. To show that a message is convincing, we need to show that when a message sent by a correct processor p reaches q, the value of ETP when the message was sent is the same as the value of ET~ when the message is received. This is done in Lemma 6.1 below. The second test verifies that if s is the number of signatures on the message, then T – s cE < C~ < T + 2s . E. Unlike Again,
the test in task MSG, this test is a two-sided test, and is asymmetric. the size of the acceptable interval depends on the number of messages,
168
D.
DOLEV ET AL.
so that a message considered convincing by p and then forwarded to q will still be considered convincing by q. The reason for the factor of 2 in the right-hand side of the inequality is that one multiple of E is needed to allow for the difference between the clocks of p and q, and another to allow for the time taken by the message to diffuse from p to q. Task
DIFFUSE
if {(an
authentic
message
distinct
signatures
A(T–
S” E
4. ADJ
8p(t
(Strong
we need
Separation
this
inequality
can be satisfied,
we need
factor
that
above,
the Separation
not
for protocol as before, Inequality
Inequality).
together to strengthen
with
the
drift
assumption
inequality A5 by adding
(E
> the
of 4: + 1) t, + e, and parts (a) and (b) of CS3(i + 1), ifp is correct at time t, ~ < ETP(t) < v ,+1, and ETP(t) – 4 “ADJ + DiVL4X < CP(t) < ETP(t) – ADJ – 2 .DMAX, then tl+e t,.In
Since since
Dynamic ETP(t) PER. >4
Fault-Tolerant and
~
by part
must
both
and
DihlAX
(c) of CS3(i),
t+d ETP(t)
Since “ADJ
Clock
we must
t,+l
is
+ DiW4X it follows
have
infinite,
itfollows
of PER,
– 4. ADJ
> (1 + p)e,
169
and, that
that
by our
CP(t)
is
immediate.
> ~ + PER
> ~ + (1 + p)e.
t > t, + e. We want
this
ETP(t)
constraints
to show
If
not,
that
there
So,
in fact is
some
processor, say q’, which is correct at t,+ ~. By part (b) of CS3(i + 1), we must have ETq,(t, +l) = ~+ ~. Let u = rnin{tz+l, t}. Because tl+l > t, + e, Csd(i) implies
that
both
CP( u ) and
C~,( v ) are
defined.
Using
parts
(a) and
(b)
of
CS3(i + 1), we have that ETP(LI) and ET~(u) are both < ~+ ~. Using part (e) > ~. CSl(i) now implies of CS3(i), we have that ETP( L ) and ETq,( u) are both that
lCP(L~) – C~,(Zl)l
DMXik ~+ ~ – ADJ
by
+
part
that
– Cq(u)
> DMAX>
(1 + p)e
> (1 + p)d.
t,+~ > u. Since u = min{t, t,+~},itfollows that L) = t and t < t,+~.CM(i) [t, t,+ ~), so, since C~,( t, + ~) – C~l u) implies that Aq, is constant in the interval > (1 + p)d, by Al we must have t,+l> t + d.
Thus,
Suppose
that
u c [t, t + d],processor
q is correct
at time
u, and
ET~(u)
is
defined. By CS3(i) and part (a) of CS3(i + 1), we have ~ + PER < ET~(u) < q is correct and C(I is continuous in the interval [t,u]. In v [+1. By CS4(i), particular, this means that q does not adjust its clock in this interval. From CSl(i), it follows that lCP(t) – C~(t)[ < DillAX. We have assumed that ETj(t) < CP(t) < ETP(t) – ADJ – 2. DM.4X. Since PER > – 4 “ ADJ + DikMX 4 “ ADJ
by the Strong ETP(t)
Since
q does
Separation – PER
not
adjust
Inequality,
< C~(t) its
clock
we have that
< ETP(t) in
the
– ADJ interval
– DMAX. [t, u],
we
have,
for
any
u’ = [t, u], ETP(t) By P3, we know
Since ETP(t) and ETP(t) = ET~(u’).
THEOREM synchronous is a positiue
6.2.
< C~(u’)
< ETP(t).
– PER
< C~(L1’) < ET~(u’).
that E~l(u’)
We now prove
– PER
ET~(u’) ❑
are both
the correctness
multiples
PER
by P2, it follows
that
of the algorithm.
If a run of the algorithm
memoiy integer.
of
above satisfies P2,
the?z all updates
are cam”ed out at a time of the form k . PER For each i >0, if a run satisfies PI–P4,
– ADJ, CS l(i),
to
where k CS3(i),
CS4(i), parts (a) and (b) of CS3(i + 1),and ft, is finite, then t,+ ~ > t, + e, and Moreover, each update to synchronous memoty also satisjies SU l(i)and SU2(i). requires at most n 2 messages. PROOF.
By Task
out by a correct tasks UPDINIT
UPDATE,
t,then for
T = T’ + 2. ADJ, some positive integer
only property
an update
to synchronous
memory
is carried
processor p only if Cp(t) = T and (T, UPD) = PENDINGP. By and DIFFUSE, if (T, UPD) is inserted into PENDINGP at time
used here.
where T’ = ETP(t) – 3. ADJ. By P2, ETP(t) = PER k, so that T = k . PER – ADJ. Note that P2 is the
170
D. DOLEV
To
prove
algorithm CS3(i
the
remainder
that
satisfies
of
+ 1), and if t, is finite
then
SUl(i
that
t,+l
) and SU2(i)
the
P1–P4, then
hold
theorem,
CSl(i),
assume
CS3(i),
we
CS4(i),
t,+ ~ > t, + e. First
vacuously;
have parts
a run (a)
ET AL. of
and
the
(b)
of
note
that
if t, is infinite
so we may assume
that
t, is finite
and
> t, + e.
CLAIM
to PENDING,
(a) If p is the first correct processor to add (T+ 2” AM, UPD) and it does so at a time t with ~ < ETP(t) < ~+ ~, then 3 “ AD.7 interval
and
correct processor with ET [t,t + d) will have added (T + 2. ADI,
some time
every
For
with
(a) of the
the update
SYNC(T, UPD)
memory
part
initiated
UPD)
at time
– the at
in this interval.
(b) If q is a correct processor time t~ < t,+* such that cated
T = ETP(t)
defined throughout UPD) to PENDING
UPD claim,
using
defined at t,+ ~, then there will be some = T + 2 “AD.1 and q will update its repli-
With
ET
Cq(tq) at time there
tq. are two
task UPDINIT
cases to consider: by signing
(1) Processor
and sending
p
the message
a convincing message SYNC(T, at time t and (2) p received t,and thus added (T + 2. ADJ, UPD) to PENDING using task
DIFFUSE. In
case (l),
We now
show
of processors within
task
UPDINIT
that
the message
that
real time
guarantees
are correct
d. It suffices
SYNC(T, and
have
that
CP(t)
UPD) ET
= T = ETP(t)
diffuses
defined
to show that when
through
in the
a correct
– 3 “ AD~. the network
interval
processor
[t,t + d] q’ sends a
message SYNC(T, UPD) to its neighbor q, the message reaches q convincingly if q is correct and has ET defined. Suppose the message reaches q at time t’ with s signatures. By A2, it follows that t’ < t + d, so itfollows from Lemma 6.1 that
T = ETP(t)
the
test.
first
– 3 “AIM
Moreover,
synchronizations
occur
Since,
we have
by CSl(i),
= ET~(t’)
Lemma in the
interval
lC~(t’)
– 3 “ ADJ.
6.1 guarantees
[t,t’1. Thus,
– CP(t’)l
Thus,
that
< DAL4X
the
message
passes
t’ < t + d < t,+ ~, so no
Cp(t’)< C (t)+ (1 + P)d. and CP(t~ = T, itfollows
that T – DTL4X < C~(t’) < T + Dk!XX + (1 + p)d. Our constraints now guarantee that T – E < C~(t’) < T + 2E, so the message passes the second test. In case (2), p must receive a SYNC(T, UPD) which was convincing Suppose the message has s signatures. These must be the signatures processors
(otherwise
p
would
not
be
the
first
correct
processor
at time t. of faulty to
add
(T + 2 “ADJ, UPD) to PENDING), have T–s. EL CP(t)~T+2s”
so we must have s s f E and T= ETP(t)–3’ADJ-
by’ A4. We must Since s~f
it follows and ADJ = (f+ l)E, ET (t) – ADJ – 2” DTL4X. Thus, Ta~ing q and q’ as in the previous
that ETP(t) – 4. ADJ + DW < c’P(t) < the hypotheses of Lemma 6.1 are satisfied. paragraph and using the same reasoning as
above, we can again show that T = ETP(t) – 3 “ADJ = ET~(t’) – 3 “ ADJ, so < T + 2(s + l)E, the first constraint is satisfied. Also, T – (s + 1)“E < C$t’) so the second constraint is satisfied as well. Since s s f, we have T – ADJ < C.q(t’) < T + 2” ADJ; we use this fact below. Again, the message successfully diffuses throughout the network, and part (a) is proven. For part (b) of the claim, suppose q is correct and has ET defined at t,. ~. By Lemma 6.1, we have t, + e < t. Therefore, by CS4(i) and CS3(i + l)(b), q is
@tamic
Fault-Tolerant
correct
and
arguments
has
Clock
ET~
above
that
Synchronization
defined
in
the
171 [t, t,+~]. It
interval
q adds (T + 2. All,
UPD)
follows
to PENDING
from
our
at some
time
t‘ in the interval [t, t + d). Thus, it suffices to show that there is a time tq G [t,t,+~) such that Cq(tq)= T + 2. AD], since it is clear that using task UPDATE, replicated memory will be updated at such a time tq.Suppose that C~(t’ ) = T’. We have shown that T – AM < C~(t’) < T + 2. ADJ. In particular, this means that the update is scheduled for a time in the future. From CS4(i) hence,
and CS3(i + l)(b), it follows that C~ is continuous in this interval.
v 1+1 – AD] there
> ET~(t’)
must
– AD.1 = ETP(t)
be a time
with
suppose
~ < ETP(t)
– ADJ
tq in the interval
part (b) and hence the entire It is easy to see that SUl(i) For SU2(i),
[t’, t,+ ~); A~ is constant in the interval By CS3(i + l)(b), we have C~(t,, ~) >
that
claim. follows
p updates
s lf+l.
Suppose
= T + 2 .ADJ.
when
C~(t~)
immediately
from
its replicated that
part
memory
CP(t)
By
continuity,
= T + 2. AD.1,
proving
(a) of the claim.
with
= T + 2. AD.1.
UPD
at time
Then,
by
UPDATE, it must be the case that (T + 2. ADJ, UPD) = PENDING. P3, it follows T + 2” ADJ s ETP(t) < T + 2 “AD.1 + PER. Suppose that the
first
does
processor
so at
to add
t’.We
time
(T + 2 “ AD.1, UPD) now
prove
arguments showed, we have tasks UPDINIT and DIFFUSE lows that
lETq(t’)
– ETP(t)/
that
to PENDING,
ETq(t’)
and
= ETP(t).
As
t
task By q is
suppose our
it
earlier
T – ADJ < Cq( t’ ) < T + 2 “ ADJ. In addition, guarantee that T = ET~(t’) – 3. ADJ. It fol-
< PER
and hence
(since
ET
is always
a multiple
of PER), that ET~(t’) = ETP(t). Thus, ~ < ETq(t’) s ~+ ~. By the claim, it follows that for all processors q’ with ET defined at ti+~,there is some time t~, < t,+~ such that C~ (tq,)= T + 2. ADJ memory with UPD at this time. The
nz
update
neighbors. We
message
message
can improve
sends
the performance estimate
on DMAX. thetwo-sided
+ s o(D
is straightforward,
processor
that
since
at most
they
update
it is clear
one
message
replicated
that to
for
each
each of
its
H
can get a better estimate replace
bound
each
and
+ E).
7. A Synchronization
E. Recall
algorithm
that
somewhat
E was meant
if we
to be an
If we can get an improved estimate D on d, then we can test T–s. E< C< T+2S. E by T–s. E to+ d, there
For joined
The
we require
it is only
throughout
next
to
(A7)
another
than
A7:
that
says that
processor
it wants
during
holds
a join
~ processors
at all times
process.)
that
are correct
and
[t, t + d]. at all
that
t a correct
times
is joined
This interval [t, t + d + (1 + p )PER]. that a joining processor has a neighbor processors
the assumption
to hold
are more
the interval
assumption
connected
that
required
and
correct
assumption will that it can rely
processor
is
throughout
the
be used to guarantee on to notify the other
to join.
For all processors p and all times t > to, there is a Correct joined processor q that is a neighbor of p such that q is correct throughout the
[t,t + d + (1 + p) PER1.
interval A7
can
be eliminated,
algorithm. of A7.
Instead,
The
parameters, multiple of Roughly nization
role
although
we assume
the
that
of the parameter
result
PER PER
speaking,
if there
take place
a resynchronization
will
LPER.
take place within
complicated
mitigating
the strength
.& is now
shared
by two
should be thought of as a large values are multiples of PER.
are no processors
once every
be a more
thus
in algorithm
LPER and PER, where LPER PER. As before, resynchronization
will
would
is small,
trying
to join,
If processors PER,
then
a resynchro-
are trying
thus minimizing
to join,
then
the amount
time joining processors have to wait to join the cluster. In addition to PER and LPER, the algorithm uses the parameters
E, ADJ,
and ~ (so that, informally, processors know an upper bound on the number failures). Again, each processor has local variables ET, xl, and C, as well local
variables
cluster), other
CLUSTER
JOINERS
variables
(describing
(describing
we shall
which
describe
which
shortly.
joined at time t if ET’’(t) is defined. processor to cover processors that join sor p is correct
at time
t if it follows
joined, its duration timer has been it joined through time t. We assume all
the
that
processors
a joiner in
the
knows network
processors
processors
want
Formally, We after
who
extend the initialization
its neighbors
signature functions (the SP of assumption of all processors in the network, as well
the
of as
in the
a processor
of p is
definition of correct as follows: a proces-
specification,
(i.e., has satisfied
(including
currently
and a number
we say that
its algorithmic
correct
are
to join),
of
Al)
and, if it is from
the time
are. We also assume
that
joiners)
own
know
their
A3) and how to check the signatures as the values of the parameters D, ~,
E, PER, and LPER. For simplicity, we assume that the signature function of a joining processor is distinct from all other signature functions that were ever used in the network. (In particular, this means that if a processor is rejoining after A8:
being
repaired,
it must
use a new name
and signature.)
possesses a signature of processor p If at time t some correct processor and if p is correct at time t,then p has been correct since it issued the signature.
At the end of Section 8, we indicate how to remove this assumption, at the cost of a slight increase in the complexity of the algorithm and an increase to a worst-case time requirement for all joins. For simplicity, we also assume that the string representing the name of any processor p is unforgeable. (For example, we could identify p with Sfl applied to the empty body.)
Dynamic
Fault-Tolerant
We assume timer, how
but
a correct
its variables
they
assume
that
Clock
become
an initial
all initialized
processor
ET,
defined cluster
(using
Synchronization
173
that wants
to join
has a correct
A, C, and CLUSTER
are all undefined.
during
of
the
execution
RO containing
more
the initialization
than
algorithm
the
duration We show
algorithm.
~ correct
discussed
We
joined
also
processors,
in Section
5) during
[t., to+ d). The correct members of the initial cluster are initialthe interval ized with A = – DT, ET = PER, JOINERS = 0, and CLUSTER = RO. The first task of the algorithm is called RTJ (for request processor wants to join a cluster, it sends out a special message
of the form
p to decide Task
RTJ
when (Request
if processor SIGN
RTJ( p) to its neighbors.
it wants
AND
for
for joiner)
to join
SEND
some mechanism
a
to join.)
to Join;
p wants
(We assume
to join). When “request-to-join”
then
begin
RTJ(P);
end All
correct
when
processors
a processor
processor
p
must
agree
p in the cluster
schedules
on which
receives
an update
to replicated
clock time after it receives the message Thus, if p receives a message before sending
of a SYNC
message
for time
send the message
in this synchronization
to be sent at time
ET
to receive
+ PER
a request-to-join
processors memory
ET
– 3. ADJ. period,
that
all processors
If not, then
It is possible
from
in the cluster
time, this guarantees that By including q’s signature the
other
processors
request-to-join faulty Task
processor
first
Thus, from
q,
possible
it is too late to
q before
time that
ET
our
the update
processor
– 30 ADJ,
replicated
Since
perform
is scheduled
for one correct
while
memory
update
is
algorithm
at the same clock
all processors in the cluster will agree on JOINERS. on the request-to-join message, p is “proving” to all
that
message.
at the
and the message
another does not. Our later tasks will ensure updated only once. The result of the update adds q to JOINERS. ensures
to join.
message
(by appropriately updating UPDMSG). it schedules the time ET – 3. ADJ,
– 3 “ ADJ. message
want
a request-to-join
the
update
Without
to arrange
message
this
was
requirement,
for “phantom”
sent
in
it would
processors
to join
response
to
a
be possible
for
a
the network.
ADD
if {(joined) begin if C< ADJ;
A (an
authentic
ET–3SADJ
UPDMSG
e
message then
UPDMSG
M
with
T~ET–3”
body ADJ
RTJ(q) else
is received)}
then
T~ET+PER–3”
U ({T, M)};
end The next task TM’ invoked
when
is the analogue
a processor’s
clock
of task TM.
reads
ET
Just like TM,
before
the task TM’
the processor
is
has received
any authentic synchronization messages. However, there are some differences between TM and TM’. The most important is that TM’ must also add new processors to CLUSTER. Thus, when a processor p invokes TM’, it sends out a message “J(ET, JOINERS u CLUSTER)” which says (essentially) “The time is ET; set CLUSTER to JOINERS U CLUSTER.” The message is sent to all of p’s neighbors in JOINERS U CLUSTER. (This is how we interpret tive SEND below.) The second half of the message does not convey
the primiany useful
174
D.
information they
to the
all agree
information processors,
on
processors
currently
JOINERS
to the joiners. they will learn
and
in the
cluster,
CLUSTER.
since,
However,
DOLEV
ET
AL.
as we shall
it does convey
see,
useful
By getting copies of this message from a number of which processors ought to be in the current cluster.
(In general, we would need to include the complete contents of replicated memory in this message, so that a joining processor would be able to set its replicated
memory
replicated Another
memory here.) difference between
would
like
appropriately.
to take
PER
to join
not
to desynchronize
want since
this
unnecessary
CLUSTER would
divides
of PER).
Therefore,
tions
occur
will
There
is one
Thus, means
during
last
it knows
in this
there
is no other
of optimization. algorithm,
units
if there TM’
traffic
to
due
to
only if JOINERS
of some new processors is an appropriately there
a
we do
are no requests
of message
invokes
in which
We
to allow
to do so. However,
amounts
LPER
intervals
minor
time
a processor
(where
every
is the result
requesting
in excessive
ET
roughly
PER
we assume
small
after
every
result
# 0 (which
or if LPER
and TM’
soon
synchronizations.
– CLUSTER join)
TM
simplicity,
to be relatively
processor join,
For
that want
chosen
are no joins,
to
multiple
synchroniza-
LPER.
subtlety
in TM’.
We
mentioned
above
that
it is
possible that a joined processor p receives a request-to-join from q before time ET – 3 “ ADJ on p’s clock. while another joined processor p‘ receives q‘s request-to-join after ET – 3 “ ADJ. Assuming that p remains correct long enough
to
initiate
JOINERS
for
message
telling
an
update
all processors, everyone
to
replicated
although
to add
p’
memory, will
q to the
q
also have
list
of
will
be
scheduled
JOINERS
the
set
sending
during
synchronization period. Since this would be an unnecessary processors q e JOINERS, we remove from the UPDA4SG list
in
the
a
next
update, for all all pairs of the
form (T, M), where the body of M is RTJ(q). Let REMOVE(UPDMSG, JOINERS) be the task which does this. TM is run only by processors in the cluster (since they are the only ones with
C defined).
updated
After
appropriately.
the
message
Besides
is sent
the variables
uses new variables LASTV and LASTJ that value sent out and the last value for JOINERS.
out,
a number
mentioned
of variables
already,
are
algorithm
J%’
record the last synchronization (Initially, LASTV is undefined
and LASTJ is 0.) After the message is sent, LASTV is set to ET and LASTJ is set to JOINERS. In addition, ET is updated (by adding PER), CLUSTER is redefined to JOINERS U CLUSTER, and JOINERS is set to 0. Task TM if C = ET then begin if {(JOINERS – CLUSTER + 0] or ( LPER divides SIGN AND SEND J( ET, JOINERS U CLUSTER); LASTV +- ET; CLUSTER G JOINERS U CLUSTER; LASTJ G JOINERS; REMOVE(UPDMSG, JOINERS); JOINERS b D; end; ET G ET + PER; end
ET)}
then
begin
~narnic
Fault-Tolerant
We next describe detail,
task
form ET
Task
MSG
J(T,
R)
– ISIGI”
Clock
MSG,
works
signed
Synchronization which
as follows:
by the
is the analogue
If processor
processors
in
E < C, and R = JOINERS
on the message, p keeps
track
JOINERS
to
adjusts
175
SIG
that
then, ET
its clock
to ET,
and increases
of ET
and
and
sets
JOINERS,
JOINERS
to
MSG.
0.
In more
a message
is tinze~,
U CLUSTER,
of the last values CLUSTER,
of Task
p receives
of the
i.e.,
T = ET,
as before,
p passes
by PER.
In addition,
adds the processors One
new
feature
in here
(whose importance will become more apparent when we consider the next task) is that p records which processors signed the message, using a variable MSIG. MSIG consists of tuples of the form (T, R, SIG), where SIG is the set of processors (other than p itself) that are known to have signed a message of the form
J(T,
always = SIG Task
R). Initially
has at most
MSIG
is empty.
one tuple
if (T, R, SIG)
For
of the form
= MSIG;
each
T and
(T, R, SIG).
otherwise
we take
R, we ensure
We define
MSIG(T,
that
p
MSIG(T,
R)
R) = 0.
MSG
if {(an authentic received]
message
M of the form
A (JOINERS
– CLUSTER
J(T, # 0
R) with or
signature
set SIG
LPER
divides
G CLUSTER)
A (ET
ET)
is
A (T =
ET) AR
U CLUSTER)
= JOINERS
A (SIG
–
ISIGI - E
< c)} then
begin
SIGN AND SEND AGET– DT; LASTV
M;
G ET;
CLUSTER * JOINERS LASTJ G JOINERS; REMOVE(UPDMSG,
u CLUSTER; JOINERS);
JOINERS G D; ET +- ET + pER; MSIG(T,
R) +- SIG;
end A joiner
q is able
~ + 1 processors the cluster sufficient not
alone support
forward
more
the
in the cluster.
by getting
Task MSG’
to join
a message
is not sufficient to join. than
cluster
A joiner
to guarantee
message
is that of the
forwarded the first one, p will set ET to message can no longer satisfy the requirement task FORWARD, T # ET. In fact, ●
LASTJ
J(T, that
from
R) with a joining
a processor form
support
of at least
a processor
p in
q = R signed
by p.
processor
will
p in the cluster
J( T, R),
since
after
p
get will has
ET + PER, so the second such T = ET. By using the following
p may still pass on a message of the form J(T, R) even if p will do so if all the following conditions are met:
# @ (so that
o T = LASTV
q gets the
q gets support
of the form
The problem one
when
and
there
are some processors
R = CLUSTER
(so that
waiting
the message
to join), is one that
p sent
before it adjusted its clock), . IMSIG(T, T)l < f (so that p does not know of ~ processors besides itself who have previously signed this message) ● SIG — MSIG( T, R) # @ (so that there are some new signatures on this message).
176
D.
Task
DOLEV ET AL.
FORW~D
if {(an authentic message M of the form Y( T, R) with signature set SIG A (LASTJ # 0) A (T = LASTV) A (R = CLUSTER) A received) (IMSIG(T, R)l f + 1)) and q = R (so that q is one of the JOINERS), then q sets C to T, sets ET to T + PER, sets CLUSTER to R, and sets JOINERS and LASTJ to 0. At that point q has joined the cluster. Recall that if q is correct, then q is joined if and only if q has ET defined. As we prove formally in the next section, our assumptions guarantee that q collects f + 1 signatures on a message processor other
correct
Task
of the
form
sets its clock
J(T,
R) within
to T; thus,
processors
a short
q’s clock
time
is indeed
after close
the
first
to that
correct
of all the
at this point.
JOIN
if {a processor of the form if {IMSIG(T,
in R with ET undefined receives an authentic message J( T, R) with signature set SIG} then begin R) - MSIG(T, R) u SIG R)[ < f + 1} then MSIG(T,
if {I MSIG(T, R)l > f + 1} then A~T– DT; ET +- T + PER; JOINERS LASTJ
M
begin
G 0; ~
0;
CLUSTER
G R;
end end This
completes
8. Analysis
the description
of the algorithm.
of the Join Algorithm
In this section,
we choose
section and of conditions Our parameter definitions
the parameters
used in the algorithm
of the previous
CS1–CS4 so that they satisfy the following conditions. are similar to those used in algorithm M, but there
are some differences: we use the Strong Separation Inequality, we use LPER rather than PER in defining DiWIX, and we assume e > 2 d rather than e > d. This latter choice allows a bigger window to give the joiners time to join the cluster. We choose the parameters for Algorithm J2? as follows:
● e>2d, o LPER ●
DMAX=
is an integer (1 + p)e
= ADJ = (f + l)E, . E > DMXX, and ~ PER > 4. ADJ.
multiple
of PER,
+ 2p- LPER,
L&mmic
Fault-Tolerant
Let
~
Clock
be the join
parameters
chosen
~
synchronization
to satisfy
THEOREM 8.1. algon”thm
Synchronization algorithm
the conditions
Under assumptions
satisjies
17’7
P1–P4
A1-A4,
and
described
in Section
7, with
above. A5’,
A6,
CSl(i)–CS4(i)
A7,
for
andA8,
all
euery run of
i > 0. Moreok’er,
a
correct processor p that requests to join will do so within (1 + p)(PER + 3. ADJ + DAL4X) + 3d of the time the request is sent. In addition, fewer than nz messages are sent for each synchronization n joined
processors,
and fewer
tion ualue for which From
Theorem
value for which
than (k + f + I)nz
there are k > 1 joiners 2.1, we immediately
there are no joiners
and
messages for each synchroniza-
and n joined
processors.
get the following
corollary
to Theorem
8.1: COROLLARY under The
proof
following the
8.2.
assumptions
Algorithm A1–A4,
of Theorem
sequence
rather
than
tasks
with
of
M
and
changes
LEMMA
8.3.
LES
to that
of Theorem
that
occasionally
of these
Every run of S
of task JOIN,
LEMMA receives
T must
8.4.
CP(t+)
identical
considering
to those
the
considering
lemmas
tasks
LPER
to the reader,
satisfies Pl,
P2,
of
instead
indicating
of ~ of
only
P3, and P4.
4.2.3.
We
Let
assumption
A4,
be a multiple
t be a critical
= O, (b)
CP(t)
a synchronization
must
check
that
these
processors as well as for processors already in the is showing that P2 holds for the joining processors. joining processor, say p, for which P2 fails. By and our assumptions
tion, we can show that p sets ET to T + PER, where value that must have been sent by at least one correct
and
join
4.2.1. We have the
are almost
4 (modulo
CHANGES FROM THE PROOF OF LEMMA
by hypothesis
a bounded
required.
properties hold for joining cluster. The only difficulty If not, consider the first inspection
and achieves
and A8.
proofs
in Section
Thus, we leave the proofs
the major
A7,
8.1 is similar
lemmas
the
A6,
of lemmas,
corresponding
PER).
A%’ maintains
A5’,
of PER,
message with
and
CP(t)
(a) CP(t)
> CP(t+)
synchronization
initializa-
❑
we are done.
time for p. Then either
is defined
about
T is a synchronization joined processor. Since
–f”
ualue
is undefined E,
or
p
(c)
CP(t + ) by time
t
signed by some other correct processor. CHANGES TO THE PROOF OF LEMMA a critical received
sors, one of which LEMMA processor LEMMA
must
8.5. If i >0 that is correct 8.6.
and t, is finite, LEMMA
4.2.4.
It is now possible
that
t could
be
value for p because p joined at t. But in this case p must have messages with synchronization value CP( t+) signed by ~ + 1 proces-
8.7.
be correct
by A4. Thus,
(c) holds.
❑
and t, is ji%ite, then (1) t, < t,, ~ and (2) there is a at t, such that CP(ti) > ~ – f. E and ETP(t,) = ~.
In every run of@
and for
all i >0,
if part
(a)
of CS3(i)
holds
then t,+ ~ > t, + e. If CS3(i)
and CS4(i)
hold
in a run of 97, then so does CSl(i).
CHANGES TO THE PROOF OF LEMMA 4.2.7. that there could be no point t’in the interval
Whereas before we could show such that CP(t’ ) = PER, we can
D. DOLEV ET
178
AL.
now show that there is no point t’ in the interval such that CP(t’) = LPER. Thus, we need to replace PER by LPER in the expression for DMAX. ❑ LEMMA
8.8.
If CS3(i
LEMMA
8.9.
If CS3(i)
that ETP is defined Note
that
the
corresponding
v 1+1 clause that
now
then so does CS2(i).
and if ~ < ETP(t)
the interval
[t, + e, t], then CS4(i)
of
8.9 are
Lemma
since
we now
is defined
since
with
prove
4.2.9,
ETP
is necessa~;
a processor
We need
hypotheses
that
in a run of 9,
holds in a run of@
throughout
Lemma
implies
+ 1) holds
have
throughout
we now
allow
joining,
stronger clause
the
interval
it is not
an analogue
some additional
of Lemma
hypotheses.
4.2.10.
In
also holds.
than
the
J( < ETP(t ) < ~+ ~ has been joined
< ~+ ~ implies
those
“if
[t, + e, t]”.
necessarily
the s This
the
since time
order
of
~. < ET’(t)
case
t, + e.
to do this,
we will
Define:
NEWO(i). If i >1, then no processors can join in the interval [t, _ ~ + e, t, ]; moreover, if processor p joins after time ti _ ~ + e, then it must be as a result of receiving NEWl(i).
a message
If a correct
does so first correct
of the form
processor
during
If
i >1
the interval NEW’3(i). ERSP(t,)
and
prior
< ETP(t)
~.l
to that < ~,
q are joined correct and CLUSTER
If p
LEMMA
PROOF.
is correct
= ~ 8.10.
and NEW5(i)
it
in R still
ETP
is defined
throughout
at
time
t,, then
processors at time + e) = CL USTER~(tl
t < t,+ ~ and
LASwP(t)
.lOLV-
t, + e, then + e).
is defined,
then
for some j 5 i.
CS3(i),
NEWO(i),
hold for all i >0 We
R), then
time. then
processors at time = Cluster.
NEW’4(i). If p and q are joined correct JOINERSP(t, + e) = 0 and CL USTERP(t, LAS~P(t)
I(Y,
[t, _, + e, t].
If p and = Joiners
NE W5(i).
j > i and p = R. of the form
[t,, ti+ d] and all the processors
the interval
at t,+ 2 d have joined
NEW’2(i).
1( ~, R) with
signs a message
proceed
NEWl(i),
NEW2(i),
NEW3(i),
NEW4(i),
in every run of 9.
by induction
on
i. For
the
case
i = O, the
proof
of
CS3(0) proceeds just as in Lemma 4.2.10, so we omit it. NEWO(0) and NEW2(0) are vacuously true since O z 1. NEW1(0) holds because VO = O by definition, and, by P2, no correct processor signs a message of the form J(O, R). For NEW3(0), note that there are no joined correct processors at time to.For NEW4(0), suppose that p and q are joined correct processors at time tO + e. Notice that p and q must have been part of the initial cluster, since no processor can join until after a synchronization value has been sent out. This cannot happen before some initially correct processor has executed task TM’ or MSG’, and, by Lemma 8.6, that cannot happen before time tO + e. When p and q are initialized (which, by assumption, happens at some time in the interval [tO, tO + e)), then JOINERS is set to 0 and CLUSTER is set to RO. JOINERS is changed from this initial setting only by using the synchronous update service. By P2 and Theorem 6.2, an update by the synchronous update service is performed only at a clock time of the form ET – ADJ, which is at least PER – ADJ. By Al and Pl, no correct processor’s clock reads PER – ADJ until after time tO + e. Thus, we must have JOINERSP(tO + e) = JOINERSq(tO
Dynamic
Fault-Tolerant
+ e) = 0.
Since
Clock
CLUSTER
is sent out, it follows NEW5(0),
observe
or task MSG’, For show
the
hold
processor
LASTV
p
step,
for
joins
assume
at
time
a new synchronization
+ e) = CL USTER~(to
is defined
processor
only by execution
has LASTV
that
i + 1. We
179
only when
CL USTERP(tO
so no correct
inductive
they
is updated
that that
Synchronization
first
all
our
prove
t > t, + e. It
of either
defined
hypotheses NEWO(i
must
be
value
+ e) = RO. For task TM’
until
t;.
hold
for
+ 1).
j s i; we
Suppose
as a result
correct
of
receiving
messages of the form J(T, R) with a total of at least ~ + 1 signatures p = R. By A4, one of these signatures must be that of a correct processor, q. Tasks
TM’,
MSG’,
and FORWARD
guarantee
that
T must
be a synchro-
value ~ with j > 1. Be definition of t~ we must have t > t].To prove + 1), it suffices to show that j > i + 1. Suppose j < i + 1 so that we
nization NEWO(i
can apply our induction hypotheses. Thus, by NEWl(j), q must t,+ d and p must have joined or message .7( ~, R) before
tj+ 2 d. By A8, p cannot Thus,
and say
p
original
cannot
have
assumption
correctly
joined
request
at or
t > ti + e. This
to join
twice
with
have sent the failed before
the same name.
t,+ 2 d s t, + e, contradicting
after proves
NEWO(i
the
+ 1).
NEW2(i + 1) is immediate from NEWO(i + 1). Next we show that the hypotheses of Theorem
6.2 hold:
we prove
CSl(i),
then t,+ ~ > t, + e. CS4(i), parts (a) and (b) of CS3(i + 1),and if ti is finite From Lemma 8.9 and NEW2(i + 1) we have CS4(i), and by Lemma 8.7, we have CSl(i). Moreover, the proof that parts (a) and (b) of CS3(i + 1) hold is now identical the
proof
to that
of Lemma
of Lemma 4.2.10
interval
[t, + e, t,+ ~]. We
NEWO(i
+ l).)
completes
Now,
chronous
algorithm
NEW3(i
use
case that
p and
that
JOINERS
in the
synchronous time
q were
JOINERSP(tz update
t,+ ~, so
CL USTERP only when
interval
joins
joined
t, is finite,
since then
this part
can join it
of
in the
follows
from
t,, ~ > t, + e. This
6.2. SUl(i)
that
p and
and
SU2(i)
correct
processors
of the
syn-
By SU2(i),
happen the
+ ~) = JOINERS~(ti+ in the interval value
correct
at ti + e. By NEW4(i), All
the
as a result
updates ~). There
have
the fact that
we
updates
to
of using
the
occurred
by
updates
to
all
are
[t, + e, t,, ~), since
is sent. Thus,
proces-
[t, + e, t,+ ~1, it must be the
+ e) = 0.
[ ti + e, t, + ~] must
CL USTER~
q are joined
in the interval
+ e) = JOINERS~(t,
algorithm.
a synchronization
that
hold.
JOINERSP(ti
and
(Note
no processor
assumption,
if
properties
+ 1) we suppose
sors at t,+ ~. Since no processor know
that
this
8.6,
that
and is omitted. fact
for Theorem
6.2, we know
update
To prove
can
by Lemma
the hypotheses
By Theorem
4.2.10,
uses the
no
updates
occur
CL USTERP(t,
+ ~)
= CL USTER~(t,, ~) follows from NEW4(i). (If there is an update to CL USTERP or CL USTER~ at t,.~, then we may have CL USTERP(t~~, ) + CL USTERq(t~+ ~).) This proves NEW3(i + 1). The proof
of part
(c) of CS3(i
+ 1) is the same as that
of Lemma
4.2.10.
We next prove NEWl(i + 1). Suppose some p is a correct processor that signs a message of the form J(V + ~, R). The first time p signs such a message, it must be the result of executing either task TM’ or task MSG’. By inspection of the algorithm, only a joined processor with ET set to ~., can correctly sign a message of the form J(K+ ~, R) using task TM’ or MSG’. By NEWO(i + 1), if a processor q joins after time t, + e, then q must set its initial value of ET to at least ~+ ~ + PER so q cannot sign such a message. Thus, p must have been joined
at time
t, + e. An
additional
inspection
of the algorithm
shows that
we
D. DOLEV ET AL.
180 must
have R = CL USTERP(t,
other
processor
q that
+ ~) U JOINERSP(t,
is joined
at time
+, ). By NEW3(i
t,+ ~, we must
+ 1), for any
have cLusi%R~(ti+
~) =
CL USTERP(tl, ~) and JOINERS~(t,, ~) = JOINERSP(t,. ~). Using this observation, as in the proof of Lemma 4.10, we can show that if q is correct and joined at t,+,and q is still
correct at time t,+~ + d, then q sets ET to ~+ ~ + PER at t G [t, + ~,t,+~ + d).At time t~, q signs and sends out a message of
some time
the form J/~+,, R) and sets JOINERS, = 0, CLUSTER, = R, and ET, = ~+ ~ + PER. This proves the first half of NEWl(i + 1). We still must prove that any processor q ● R that is correct at t,, ~ + 2d has joined
before
that
time.
Without
loss of generality,
assume
that
q has not
before t,.~ + 2d, and in joined by time t,+,.We now show q has joined addition that when q joins, it sets ET~ to ~+ ~ + PER. This will prove part (d) of CS3(i + 1) in addition to providing NEWl(i + 1). It suffices to show that q receives a total of ~ + 1 signatures on messages of the form J( ~ + ~, R) by time
t,+~ + 2d. correct
By assumption A6, there are at least ~ + 1 processors that are that is and joined in the interval [t, + ~,t,+~ + d). Let p be a processor
correct
and
joined
in
this
interval.
By
previous
arguments,
p
sends
out
a
the J(Z + ~, R) at some time u in the interval [t, + ~,t,+~ + d). Consider to exist sequence of processors p], ..., pk with p = pl and q = p~ guaranteed message
by A2, with
t = u. If p’s message
there is some p, that earlier ~ + 1 signatures. Thus, either 2d,
or q has already
~ + 1 signatures
received
by time
does not
diffuse
to q, this must
be because
sent out messages of this form with a total q receives p’s message by time u + 2d < t,+, messages
u + 2d. Since
of the there
required
are
f+
form
1 correct
with
a total
joined
of + of
proces-
by sors, q will receive messages of this form with a total of f + 1 signatures Hence, time t,+~ + 2d. When q gets these messages, it sets ETq appropriately.
+l,tl+l+ 2d], and at time tq,q sets C~ = ~+1, q joins at some time tq G [t, ETq = ~,, + PER, JOINERS = 0, and CL US TERSq = R. Part (e) of CS3(i + 1) follows from part (d) for processors that were already joined
t1+1
at +
e,
message
t,+ ~ + e, as in Lemma NEWO(i + 1) shows that
4.2.10. For a processor p that joins after p must have joined as a result of receiving a
of the form J(V, R), with j > i + 1.Thus, at this point p sets ETP to + PER. The result now follows from P1. + 1), observe that for any correct processor p, LASWP is
~+PER>~+l For NEW5(i initially
undefined;
by inspection
of the tasks of %, h is clear
reset only at a critical time for p. Moreover, LASWP(U+) = ETP(u), ETP(u+) = ETP(u)
that
US~P
is
if LASWP is reset at time u, then + PER, and ETP(u) = ~ for some
j. If LASWP(t) is defined, then LASWP(t) < ETP(t) – PER. If t s t,+,,then ETP(t) s ~+, by parts (a) and (b) of CS3(i + 1). Thus, LASWP(t) < ~+ ~, and LASWP(t) = ~ for some ~ s i. This proves NEW5(i + 1). It remains to prove NEW4(i + 1). Suppose that q is a joined correct ~ + e) = 0 at time t,+ ~ + e. We want to show that JOINERS~(t,+ and CLUSTER~(t, h ~ + e) = R,+ ~. If q is already joined at time t,+~,then our + ~,t,+ ~ + d), q sets CL USprevious arguments show that at some time tq G [t,
processor
= 0. JOINERS~ can become nonempty after tq TERq = R,, ~ and JOINERS~ only if there is an update to synchronous memory. By Theorem 6.2, there can be such an update only at time t such that C~(t) = k . PER – ADJ for some k. By part (c) of CS3(i + 1), there cannot be such a time in the interval an inspection of the , + e) = 0. Similarly, [t,+ ,, t,+, + e]. Thus, JOINERS,(tL+ tasks of ~ shows that CL USTERq can change values only at a critical time for
Qynamic
Fault-Tolerant
Clock
Synchronization
181
q. Since ET~(t~) = F(+ ~ + PER, the next critical time for q after tq must come 8.6, t,+ ~ > t,+ ~ + e. Thus, we can conclude that at or after t,+~. By Lemma CLUSTER~(tl+,
+ e) = R,+l.
Now, suppose
that
q joins
this must be as a result
at some time
of receiving
t~ = [t,+ ~, t,, ~ + e]. By NEWO(i
a message
of the form
.J(~,
+ 1),
R) with
q = R
and j > i + 1. By the same arguments as used in the proof of NEWO(i + 1), q must receive such a message signed by a correct processor, say q’. By inspection of the tasks of ~, it is immediate only if T = ETg (t ) or T = LASW~,(t LASW~,(t)
is defined,
then
~.+l+PERift ET~(t)
– PER
q receives
SYNC(ET (t) – 3 “ADJ, M) reads ET~~t) – 3. ADJ. (The also recewed
q receives
u, and it is connected
at least
(1 + p) . PER
is guaranteed
(1 + p)(PER
p’s
message
+
q remains
algorithm, after which time p is to guarantee that p joins the
p’s
request-to-join
message
and
(2)
C~(t)
by P3 and message,
> ET~(t)
q remains
it follows
that
and started
an update.)
M
at
are two cases – 3 “ ADJ.
correct the
is signed and sent by q at or before message may be sent earlier if another
p‘s request-to-join
after
to exist
+ 30 ADJ
is straightforward:
By A2, we have t – u < d. There
< ~+l.
C~(t)
at time
for
sufficiently long to invoke the update to JOINERS. Then NEW2 is invoked
t,and ~ < ET~(t)
For
remains message.
cluster soon thereafter. In more detail, suppose time
p requests
q that
for
at
message its clock processor
By Theorem
at 6.1, all processors still correct at time t,+~ will have added q to JOINERS time ET~(t ) – ADJ on their local clocks. It follows that JOINERS – CLUSTER will
be nonempty
at local
clock
time
ET~(t ) – ADJ,
synchronization attempt will take place with value and t,+~ is the first time a correct processor sends tion value ET~(t). Since we have assumed that q time (1 + p)PER, it is easy to show that q is still time is no more than (1 + p) PER after q receives
from
which
we get that
a
ETq(t ). Thus, ~., == ETq(t ), a message with synchronizaremains correct for at least correct at time ti+~,and this p’s message. From NEWl(i
182
D.
+ 1), it follows then.
Thus,
that
if p is still
p joins
join message< For case (2),
within
since
by
(1 + p) “ PER
after
ET~(t)
– 3 “ADJ,
+ PER
3 “ADJ, joined
M)
unless
processor
ET~(t) + PER then t, occurs
p’s
q
time
attempt
remains
no later
take
correct
ET
C~(t’)
with
value
by
Thus, from and
the
assumption). that
+ ADJ.
By
C~(t’) By
time reads
SYNC(ET~(t)
–
if some
on its clock. that
ETq( t) + PER,
a
so that
at time t,,suppose q’ is a correct joined Cq(t~) s ETq(t) + PER by CS3(j)(b) and
that
Cq(t’)
the
Strong
> ET~(t)
P2
part (a) of CS3(j fact
and
P3
it follows
that
> ET~(t)
– 3. ADJ
+ DMXX),
Since by NEWl(j)
so
it
that
– DkMX.
(PER
> ~.
ET~(t’)
follows
and joined
= ETq(t) + PER C~(t) > ET~(t )
Inequality ETq(t)
– 1), we have t’> t,_~.From
q’ must be correct
C~ (t’)
Since
follows
> ~_ ~ + ADJ,
we get that
t,s t’+ (1 + p)(3 “ADJ
C (t’)
t &ince
Separation
+ ADJ. it
NEW2(j)
+ DM4X).
by
is ~, with j = i + 1 or j = i + 2. If q is stall correct at time t,, at most (1 + p)(PER + 3. ADJ) after q receives p’s request-
we know
> ~_*
least
may happen
– 3. ADJ
P3. We know that q is correct at the time t’such that — 3 “ADJ, and this time is at most (1 + p) PER after – 3 “ADJ
at
its clock
As in case (1), it now follows
place
to-join message. If q is not correct processor at time t,.We must have
4 sADJ),
for when
(This
than
have joined
p sends its request-to-
q is correct
the cluster.
in case (l).)
will
+ 2d, p will
q sends the message
joined
p’s message
t,+*
+ 3d of when
message,
by which
p has already
we are back
synchronization
at time
assumption
q receives
received
If this happens,
correct
(1 + p) . PER
DOLEV ET AL.
> ~.
part (c)of that
From
~ + PER.
CS3(j
– 1)
t’ > t,_ ~ + e.
and from at t’, this
>
~, we have
CSl(j
By
– 1),
it follows
that
that
t,s t + (1 + p)(PER + 3 “ADJ we have that p joins by time t]+ 2 d (if it is still
correct then), and p sends its request-to-join message at most d before t, we get the desired bounds. At most nz messages are sent if no processor requests to join, just as in the case of Algorithm causes one update addition,
if there
~ + 1 messages
are joining (using
or (k + f + l)nz In
general,
algorithm,
J% If k processors request to join, each request-to-join to replicated memory, resulting in k . nz messages. In
in all.
the
processors,
task FORWARD),
each joined giving
processor
a further
may send up to
(~+
l)nz
in
DMAX.
messages,
❑
(1 + p)e
e = 2 d, whereas
term
is the
in the basic
dominant
term
resynchronization
algorithm,
In
this
we have
e = d. This factor of 2 is introduced by the late signature gathering process. It can be eliminated by having yet another synchronization after all the processors have joined. This is essentially the technique used in an earlier version of this paper [Halpern We now discuss
et al. 1984]. how to relax
assumption
A8,
which
states
that
rejoining
processors must use new signatures. If the JOIN task is modified so that a processor will continue to advance its clock according to JOIN (i.e., continue to execute JOIN) until an interval of length (1 + p)(PER + 3. ADJ + DMAX) + 3d has elapsed from the time it requested to join, then we no longer need the assumption that a rejoining processor must use a new signature. A processor may be convinced to set its clock using messages left over from a previous attempt to join; but provided our other assumptions hold, it will have advanced to the correct time within the prescribed time bound. Of course, it may not actually send any synchronization messages or be considered to have a defined
Dynamic
Fault-Tolerant
ET
the time
until
Clock
bound
Synchronization
has elapsed
183
on its duration
timer.
The details
are left
no name
is ever
to the reader. We
have
removed names
assumed
from
of processors
processors paper,
One
are left
a cluster From
that
or deciding
forever
to time
using
The
that
method
be removed
accomplishing
memory
and
it may be convenient
participate.
they should
for
replicated
grows
time
no longer
that
mechanism
synchronous
the
of detecting
is outside
removal
a task analogous
to remove such
the scope of this
is by
an update
to ADD.
Again,
of
details
to the reader.
9. A Continuous The
that
CLUSTER.
logical
algorithm
Clock
clock
Solution
defined
by
is not continuous,
processor
p’s
current
clock
since it may be set forward
in
the
previous
by any amount
smaller
than Alll. It is clearly piecewise continuous. There are some applications for which it may be advantageous to have a continuous clock. As already noted by Lamport
and
amortizing presented tion
Melliar-Smith
clock
in Section
also works
The
clock
introduce SAVE Task
two
= DT MSG,
.
SAVE
G DT;
●
OLDA
e A;
Let
INT
algorithm
required
We briefly in order
keeping
variables,
sketch
construc-
7.
be a constant
A +- ET
chosen
Strong Separation Inequality, continuous approximation
matters,
we first
continuous
SAKE.
We
and add the following
the line
how
by
to do this. A similar
piecewise
and
discontinuities algorithm
To simplify
the
OLDA
these
the
of Section
are minimal.
while
at initialization, before
time.
3 can be modified
C’,
new
we can eliminate
over
for the join
modifications
continuous
[1985],
adjustments
lines
set
clock OLDA
add a C. We
= A
and
to the pseudocode
of
– DT:
such
that
O < INT
s PER
– ADJ.
(By
the
such a choice is possible.) We introduce A’, a to A. Suppose that A(t) # A(t+). We set
OLDA(t+) +- A(t), thereby saving the old value of A before updating it. Then, instead of increasing the value of A’ immediately to A(t + ), we amortize this increase
over an interval
of length
INT.
Thus,
we have the following
definition
of A: if DT
s SAVE
+ INT
then
A
~
OLDA
+ (A
– OLDA)(DT
– SAVE)/INT
else A’ +A. Define function A’(t)
C’(t) of
s A(t),
= DT(t)
time,
and
+ A’(t). hence
and if either
It
is easy to check
so is
OLDA(t)
A(t) = xl’(t). Our revised algorithm then C(t+) = ET – PER, since SAVE is adjusted. PER – AD],
C’. = A(t)
Moreover,
that at
or DT(t)
A’ any
is a continuous time
> SAVE(t)
we
have
+ INT,
t
then
guarantees that if DT(t) = SAVE(t+), is set to DT at exactly the time t that A
It follows that if C(t) > ET and hence that C(t) = C’(t).
– ADJ, then DT(t) > SAVE(t) + With this observation, it is easY to
check that we could have replaced C by C’ in algorithm same result for every test where C was used. We leave
& and obtained the details to the reader.
10. Conclusion We have algorithm
described an algorithm that periodically can tolerate arbitrary link and processor
desynchronizes failures as long
clocks. The as messages
D. DOLEV ET AL.
184 can diffuse have
through
provided
the network
a technique
within
for
some preassigned
initializing
clocks,
time
and
bound.
have
We also
shown
how
our
algorithm could be extended to allow new processors to join the network. The constants in our algorithm are reasonable for many practical applications.
We
have
performance
suggested
of the
improvements
a number
algorithm
are possible.
could
A variant
of ways
throughout
be improved.
We
of this algorithm,
the
suspect
for which
paper that
that
further
the join
is not
so fault-tolerant, has been implemented for a prototype highly available system at the IBM Almaden Research Center [Griefer and Strong 1988]. The join algorithm provided in this paper represents a compromise between the simplicity the
of allowing
complexity
logically
of
synchronous
cated memory.
scheduled
very small. much
Then,
larger
updates
replicated times.
to simplify
memory [1978a,
The
structure
these
fast response overhead
is scheduled
algorithm
we call
the process
by allowing
minimal there
resynchronization
demand,
to a data
We provide
unless
synchronous replicated pioneered by Lamport
on
memory
we provide
period,
only when join
We have chosen
ing synchronous periodic
joining
providing
synchronous
on repli-
of joining
and maintain-
processes
to run
time
by making
by desynchronizing
is a processor
and
depends
waiting
to join.
only
at
this period only with Our
is in the spirit of the state-machine 1978b, 1984]. Moreover, our basic
a
use of
approach, resynchro-
nization algorithm without its timeliness tests is a minor variant of a scheme proposed by Lamport [1978a]. The advantage and main contribution of our approach fault-tolerance
lies in the simplicity properties (not shared
ACKNOWLEDGMENTS. taking
to read
The
the entire
authors paper
of our algorithms together with by the original Lamport scheme).
would
carefully
like
to thank
and for many
the referees helpful
their
for under-
suggestions.
REFERENCES CRISTIAN,F. 1989. Probabilistic clock synchronization. Drst. Cm-nput. 3, 3 (July), 146-158. CRMTIAN,F., AGHIL1,H., STRONG,H. R., AND DOLEV, D. 1986. Atomic broadcast: from simple message diffusion to Byzantine agreement. IBM Tech. Rep. RJ 5244. IBM. San Jose, Calif. DOLEV, D., H.ALPERN,J. Y., SIMONS,B. B., AND STRONG, H. R. 1987. A new look at fault tolerant network routing. Zrzf. Comput. 72, 180-196. DOLEV, D., HALPERN, J. Y., AND STRONG, H. R. 1986. On the possibility and impossibility of Syst. ,SCL.32, 2 (Apr.), 230–250. achieving clock synchronization. J. Corn@. algorithms for Byzantine agreement. DOLEV, D. AND STRONG, H. R. 1983. Authenticated SLAM J. Cot?zput, 12>4 (Nov.), 656-666. GRIEFER, A. D., AND STRONG, H. R. 1988. DCF: Distributed communication with fault tolerance. In Proceedings of the 7th A nrzzlal A CM Symposizwn on Principles of Dist?ibufed Computing
(Toronto, Ont., Canada, Aug. 15-17). ACM, New York, pp. 18-27. HALPERN.
J. Y..
MEGIDDO,
N., AND MUNSHI,
A.
1985.
Optimal
preckion
in the
presence
of
uncertainty. J. Complexity 1, 2 (June), 170–196. HALPERN, J. Y., SIMONS, B. B., STRONG, H. R., AND DOLEV, D. 1984. Fault-tolerant clock synchronization. In Proceedings of the 3rd Annual A CM Symposium on Pri>zciples of Distributed Corrzputiizg (Vancouver, B. C., Canada, Aug. 27–29). ACM, New York, pp. 89-10?. KRISHNA, C. M., SHIN, K. G., AND BUTLER, R. W. 1985. Ensuring fault tolerance of phase-locked clocks. lEEE Tirozs. Comput. C-34. 8, 752–756. LAMPORT, L. 1978a. Time, clocks and the ordering of events in a distributed system. Commun. ACM 21, 7, (July), 558-565. LAMPORT, L. 1978b. The implementation of reliable distributed multiprocess systems. Corrzput. Netw. 2, 2 (May), 95-114. L.WPORT, L. 1984, Using time instead of timeout for fault-tolerant distributed systems. ACM Trans. Prog. Lang. Syst. 6, 2 (Apr.) 254-280.
Dynamic
Fault-Tolerant
Clock
LAMPORT. L., AND MELLIAR-SMITH, J. ACM 32, 1 (Jan.), 52-78.
LUNDELIUS,J., ANDLYNCH,N. Control
62, 21 (Aug. /Sept.),
Synchronization P. M.
1985.
Anupper
1984.
185
Synchronizing andlower
clocks in the presence of fwlts.
bound forelock
synchronization.
Znf.
190-204.
MARZULLO, K. 1983. dissertation. Stanford PEASE, M., SHOSTAK, R., (Apr.), J. ACM27,2,
Loosely-coupled distributed services: A distributed time system. Ph.D. Univ., Stanford, Calif. AND LAMPORT, L. 1980. Reaching agreement inthepresence of faults.
RWEST,R. L.,
A., AND ADELMAN,
228-234.
Amethod for obtaining digital signatures 2 (Feb.), 120–126. so ftheACM21, clock synchronization RAMANATHAN,P., SHIN, K. G., AND BUTLER,R. W. 1990. Fault-tolerant in distributed systems. LZEEComput.(Ott.),33-42. SCHNEIDER, F. B. 1987. Understanding protocols for byzantine clock synchronization. Tech. Rep. Dept. Computer Science, Cornell University, Ithaca, N.Y. SRIKANTH, T. K., AND TOUEG, S. 1987. Optimal clock synchronization. J. ACM34, 3 (July), 626-645. WELCH, J. LUNDELIUS, AND LYNCH, N. 1988. Anew fault-tolerant algorithm forclocksynchronization. Inf. Comput. 77, 1,1–36. andpublic-key
RECEIVED
SHAMIR,
cryptosystems.
MARCH 1989;
L.
1978.
Communication
REVISED JULY
1989; ACCEPTED
Journdof
FEBRUARY
1994.
the AssocMlonforCumputNgMach uuzry.Vol
4?, N0 l. Jdnu.%~1995