An Integrated Fault Tolerance Framework for Service

An Integrated Fault Tolerance Framework for Service Oriented Computing

Stephen Hall

2010

Thesis submitted for the degree of Doctor of Philosophy at Lancaster University.

Department of Computing, Infolab 21, South Drive, Lancaster University, Lancaster, Lancashire, LA1 4YW.

Abstract Late binding to services in business to business operations pose a serious problem for dependable system operation and trust. If third party services are to be trusted they need to be dependable. Fault tolerance, or FT, builds reliable systems using mediated replication techniques. However, these techniques are yet to be adopted in a systematic way within service oriented computing. The barrier to adoption is the perceived high costs of replication, a disagreement on whether dynamic binding is a realistic option and a lack of experience with FT so that techniques are unproven. We have developed a FT framework for service oriented computing that can support multiple protocols. Our approach is based on comparing the reliability and performance characteristics of dierent FT congurations. The framework is unique in several key respects. It supports an asynchronous messaging model with LAMB, a novel service oriented architecture. LAMB works by routing SOAP messages to WSDL described services based on their content, in addition to adhering to a set of principles. The FT process model is supported by Sandbox a container that exposes protocols as services. Sandbox makes FT service development easier by supporting a set of facilities for logging, authentication, crash failure detection, synchronisation and deterministic view changes. LAMB and Sandbox are enabled by a scalable P2P platform based on JXTA. This allows fast dynamic service discovery and overall control of the framework. We have developed implementations of seminal FT protocols including Recovery Block, N-Version Programming, Paxos and CLBFT. The framework can even switch between protocols at runtime based on QoS based service dierentiation. We have produced a detailed case study using a real-world trade oor application. This study uses doping and metrics to evaluate each protocol conguration in a number for scenarios covering all failure models, group membership properties and runtime reconguration.

Acknowledgements and Dedications I would like to acknowledge Professor Ian Sommerville for giving me the opportunity to start this PhD and his support and advice in the early days. I would like thank Dr Gerald Kotonya who took me on as his PhD student at my request and has kept faith through what has been a long, and at times hard, process. I would like acknowledge the help, support, food and lodgings oered to me by my Dad and Lesley whilst I was writing up and completing the corrections.

This thesis is dedicated to memory of my mother, Sylvia, whom I miss so much. This thesis is dedicated to my wife Rowan who has patiently supported me even when it seemed that the work would take forever. You will forever have my love and gratitude.

Finally, to my unborn child, may this work inspire you and make you proud of your Dad.

1

Contents 1 Introduction 1.1 1.2 1.3 1.4 1.5

Problem Statement . . . Motivation . . . . . . . . Research Objectives . . Research Contributions . Thesis Structure . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Concepts . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Other FT Factors . . . . . . . . . . . . . . Abstractions . . . . . . . . . . . . . . . . . . . . . 2.2.1 Failure Detection . . . . . . . . . . . . . . Practical Failure Detection . . . . . . . . Leader Elections . . . . . . . . . . . . . . 2.2.2 Broadcasting . . . . . . . . . . . . . . . . Total Order Broadcast . . . . . . . . . . . 2.2.3 Consensus . . . . . . . . . . . . . . . . . . Using Consensus for Atomic Broadcast . . Byzantine Agreement . . . . . . . . . . . Randomised Byzantine Agreement . . . . 2.2.4 Group Membership . . . . . . . . . . . . . 2.2.5 State Machine Replication . . . . . . . . . 2.2.6 More FT Abstractions . . . . . . . . . . . FT Protocols . . . . . . . . . . . . . . . . . . . . 2.3.1 Recovery Blocks . . . . . . . . . . . . . . 2.3.2 N-Version Programming . . . . . . . . . . 2.3.3 Combining Active and Passive Replication 2.3.4 Paxos . . . . . . . . . . . . . . . . . . . . 2.3.5 Paxos for Byzantine Agreement . . . . . . 2.3.6 CLBFT . . . . . . . . . . . . . . . . . . . Normal Operation . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

2 Reliable Distributed Computing 2.1 2.2

2.3

2

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

11

11 12 13 14 15

17

17 23 25 25 26 27 28 30 31 32 33 34 35 37 38 38 38 40 41 42 47 50 52

CONTENTS

2.4

Recovery . . . . . . . . Performance . . . . . . . BASE . . . . . . . . . . 2.3.7 FT Protocols Summary Summary . . . . . . . . . . . .

CONTENTS

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Service Oriented Architecture . . . . . . . . . . . . . Service Messaging . . . . . . . . . . . . . . . . . . . . 3.2.1 REST . . . . . . . . . . . . . . . . . . . . . . 3.2.2 MOM . . . . . . . . . . . . . . . . . . . . . . 3.2.3 SOAP with WS-Addressing . . . . . . . . . . 3.2.4 WS-Reliable Messaging . . . . . . . . . . . . 3.2.5 Discussion . . . . . . . . . . . . . . . . . . . . Service Discovery . . . . . . . . . . . . . . . . . . . . 3.3.1 Universal Description and Discovery Interface 3.3.2 Alternative Standards for Discovery . . . . . 3.3.3 Semantic based Discovery . . . . . . . . . . . 3.3.4 Distributed Discovery . . . . . . . . . . . . . 3.3.5 Discussion . . . . . . . . . . . . . . . . . . . . Service Composition . . . . . . . . . . . . . . . . . . Peer-to-Peer . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Distributed Hash Table . . . . . . . . . . . . 3.5.2 JXTA . . . . . . . . . . . . . . . . . . . . . . Discovery . . . . . . . . . . . . . . . . . . . . Performance . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

3 Service Oriented Computing 3.1 3.2

3.3

3.4 3.5

3.6

4 Existing Approaches to FT-SoC 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12

Criteria for Review . . . . . . . . . . . Generic FT Container . . . . . . . . . FAWS . . . . . . . . . . . . . . . . . . FT with WS-BPEL . . . . . . . . . . . Fault Tolerance Connectors . . . . . . Client based Frameworks . . . . . . . . CORBA Based FT . . . . . . . . . . . Transparent FT . . . . . . . . . . . . . Group Membership based Frameworks WS-BUS . . . . . . . . . . . . . . . . . FT-Net Traveler . . . . . . . . . . . . RWSI . . . . . . . . . . . . . . . . . . 3

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

54 58 59 59 61

62

62 64 64 65 69 70 72 73 73 75 76 79 81 81 84 86 89 90 92 93

94

94 95 96 97 97 99 100 102 104 105 106 106

CONTENTS

CONTENTS

4.13 Byzantine FT Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.14 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.15 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5 An Integrated FT-SoC Framework 5.1

5.2

5.3

5.4

5.5

Late Asynchronous Message Brokering . . . . . . . . . . . . . . . . . 5.1.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Service Bindings . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 LAMB with JXTA . . . . . . . . . . . . . . . . . . . . . . . . 5.1.4 QoS Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.5 Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sandbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Logging with Domesday . . . . . . . . . . . . . . . . . . . . . 5.2.3 Authentication with Gatekeeper . . . . . . . . . . . . . . . . 5.2.4 Crash Failure Detection with Eternity . . . . . . . . . . . . . 5.2.5 Synchronisation with Clockwork . . . . . . . . . . . . . . . . 5.2.6 Deterministic View Changes with Viewpoint . . . . . . . . . . 5.2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Platform Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Managing Peers . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Managing Services . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . Fault Tolerance Services . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Passive Replication with Patmos . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Active Replication with Elegance . . . . . . . . . . . . . . . . 5.4.3 Active Replication with Atakos . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.4 State Machine Replication with Ionian . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.5 A Non-Blocking Variant of Ionian . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.6 State Machine Replication for Byzantine Failure with Andros Recovery Protocol . . . . . . . . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 Evaluation 6.1 6.2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

115

117 122 123 124 127 128 129 130 132 133 135 135 137 138 138 139 140 141 143 144 147 148 149 150 151 155 155 157 157 162 163 164

166

Trading Floor Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 4

CONTENTS

6.3 6.4

6.5 6.6

6.2.1 Cloud . . . . . . . . . . . 6.2.2 Doping . . . . . . . . . . 6.2.3 Request Injection . . . . . 6.2.4 Metrics . . . . . . . . . . Evaluation . . . . . . . . . . . . . FT Protocol Evaluation . . . . . 6.4.1 Normal Operation . . . . 6.4.2 Fail-Silent . . . . . . . . . 6.4.3 Omission . . . . . . . . . 6.4.4 Timing . . . . . . . . . . 6.4.5 Commission . . . . . . . . 6.4.6 Denial Of Service Attack . 6.4.7 Divisive Attack . . . . . . 6.4.8 Byzantine . . . . . . . . . 6.4.9 Evaluation Summary . . . Framework Evaluation . . . . . . 6.5.1 FT-SoC Literature Survey Summary . . . . . . . . . . . . .

7 Conclusions 7.1 7.2 7.3 7.4 7.5

CONTENTS

Research Objectives Revisited Assessment . . . . . . . . . . Limitations of our Work . . . Further Research Directions . Final Remarks . . . . . . . .

. . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Revisited . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

170 171 173 173 177 177 178 181 183 185 188 188 191 192 196 198 201 203

204

204 205 205 207 208

Bibliography

209

A FT Protocol Code Listings

225

B Screenshots

236

C Additional Scenarios

238

C.1 Runtime Reconguration Test Case . . C.2 Group Membership Test Case . . . . . C.2.1 Sequential Group Membership C.2.2 Concurrent Group Membership C.2.3 Sequential Fail-Stop . . . . . . C.2.4 Concurrent Fail-Stop . . . . . . C.2.5 Primary Fail-Stop . . . . . . .

5

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

238 239 239 239 240 241 242

List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13

Process model of FT. . . . . . . . . . . . . . . . Process Automata as a Stack . . . . . . . . . . Ordered Failure Models . . . . . . . . . . . . . Vector Clocks [Raynal96]. . . . . . . . . . . . . Byzantine Generals Problem [Lamport82]. . . . Rampart Normal Operation [Reiter94]. . . . . . Recovery Blocks [Pullman01, Randell75]. . . . . N-Version Programming [Avizienis85]. . . . . . Consensus Recovery Blocks [Torres-Pomales00]. Multi-Paxos [Turner07]. . . . . . . . . . . . . . Paxos at War [Zielinski04]. . . . . . . . . . . . . CLBFT Normal Operation [Liskov07]. . . . . . CLBFT Recovery Protocol [Castro02]. . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

19 20 21 30 33 36 39 40 41 44 48 53 55

3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14

SOC the extended SoA [Papazoglou03]. . . . . . Service Oriented Architecture. . . . . . . . . . . . REST [Fielding00]. . . . . . . . . . . . . . . . . . MOM Exchange Patterns [Brambilla04]. . . . . . WSBUS [Erradi05]. . . . . . . . . . . . . . . . . . SOAP Message with WS-Addressing [Gudgin06]. WS-RM [Davis07]. . . . . . . . . . . . . . . . . . Anatomy of UDDI. . . . . . . . . . . . . . . . . . Service Description with OWL-S. . . . . . . . . . WS-BPEL . . . . . . . . . . . . . . . . . . . . . . P2P Messaging (Flood). . . . . . . . . . . . . . . Chord Ring Topology [Stoica03]. . . . . . . . . . JXTA Net Peer Group Advertisement [Oaks02]. . JXTA SRDI [Li03]. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

63 63 65 66 68 70 71 73 77 83 85 86 90 91

4.1 4.2

IWSD [Salatge07]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 FT-SOAP [Fang04]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6

LIST OF FIGURES

LIST OF FIGURES

4.3 4.4 4.5 4.6

FTWeb [Santos05]. . . . . . Transparent FT [Dialani02]. Thema [Merideth05]. . . . . BFT-WS [Zhao07]. . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

102 103 108 110

5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 5.17 5.18 5.19 5.20 5.21 5.22

Architechure of WSPBFT. . . . . . . . . . . . . . . FT Agnostic Flow in LAMB. . . . . . . . . . . . . Anatomy of LAMB. . . . . . . . . . . . . . . . . . LAMB SOAP Bindings. . . . . . . . . . . . . . . . LAMB WSDL Message Bindings. . . . . . . . . . . Advertisement Classes for JXTA Service Discovery. A Web Service Advertisement. . . . . . . . . . . . . LAMB Selection Priorities. . . . . . . . . . . . . . Sandbox Service Class Inheritance Example. . . . . Anatomy of Sandbox. . . . . . . . . . . . . . . . . Domesday Log Interface. . . . . . . . . . . . . . . . Gatekeeper Startup Sequence. . . . . . . . . . . . . Gatekeeper Header. . . . . . . . . . . . . . . . . . . Gatekeeper Interface to FT Services. . . . . . . . . Eternity Interface. . . . . . . . . . . . . . . . . . . Clockwork Bindings in WSDL. . . . . . . . . . . . Clockwork Interface. . . . . . . . . . . . . . . . . . Interface provided by Viewpoint. . . . . . . . . . . Platform Model. . . . . . . . . . . . . . . . . . . . Peer Discovery Sequence. . . . . . . . . . . . . . . Service View through User Interface. . . . . . . . . UI forwarding HTTP requests between Peers. . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

115 119 123 124 125 126 126 128 130 131 132 133 134 135 135 136 136 137 139 140 141 142

6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13

The Trading Floor Screen. . . . . . . . . . . . . . Trading Floor Services. . . . . . . . . . . . . . . . Testbed. . . . . . . . . . . . . . . . . . . . . . . . Lifecycle of a Dope. . . . . . . . . . . . . . . . . . Conguration of a Dope. . . . . . . . . . . . . . . Doping Scenarios through the Web Interface. . . Anatomy of Metric Gathering. . . . . . . . . . . . Metrics Interface. . . . . . . . . . . . . . . . . . . Graphing the Metrics. . . . . . . . . . . . . . . . Contrasting Latencies at Injection of 1 request/s. Congurations (n = 7) under Increasing Load. . . Comparing Andros Congurations. . . . . . . . . Fail-Silent (f ) Results. . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

167 168 170 171 171 172 174 175 176 179 180 181 182

7

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . . . . . . . .

LIST OF FIGURES

6.14 6.15 6.16 6.17 6.18 6.19 6.20 6.21 6.22 6.23 6.24 6.25 6.26 6.27

LIST OF FIGURES

Fail-Silent (f + 1) Results. . . . . . . . . . . . . . Omission Results. . . . . . . . . . . . . . . . . . . Timing (f ) Results. . . . . . . . . . . . . . . . . . Timing (f + 1) Results. . . . . . . . . . . . . . . Timing Alternate (f + 1) Results. . . . . . . . . . Timing Stochastic Results. . . . . . . . . . . . . . Commission Results for f and f + 1 Respectively. DoS Attack (f ) Results. . . . . . . . . . . . . . . DoS Attack (f + 1) Results. . . . . . . . . . . . . Divisive Attack (f ) Results. . . . . . . . . . . . . Divisive Attack (f + 1) Results. . . . . . . . . . . Byzantine (f ) Results. . . . . . . . . . . . . . . . Byzantine (f + 1) Results. . . . . . . . . . . . . . Byzantine Stochastic Results. . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

183 184 186 186 187 187 188 189 190 192 192 193 194 195

B.1 Doping and Metrics Screenshot . . . . . . . . . . . . . . . . . . . . . . . . . 236 B.2 LAMB Service Discovery Screenshot . . . . . . . . . . . . . . . . . . . . . . 237 C.1 C.2 C.3 C.4 C.5 C.6 C.7

Runtime Reconguration Results. . . . . Sequential Group Membership. . . . . . Concurrent Group Membership Results. Sequential Fail-Stop Results. . . . . . . . Concurrent (n − 1) Fail-Stop Results. . . Concurrent (n) Fail-Stop Results. . . . . Primary Fail-Stop Results. . . . . . . . .

8

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

238 240 241 243 244 245 246

List of Tables 2.1 2.2 2.3

Distributed System Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Failure Detection Classication [Chandra96a]. . . . . . . . . . . . . . . . . . 25 FT Protocols Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.1 3.2

Service Messaging Review. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Service Discovery Review. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.1

FT-SOC Frameworks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.1 6.2 6.3 6.4 6.5

Congurations. . . . . . . . . . . . . . Normal Operation with Soak Results. . FT Protocol Evaluation Summary. . . Summary of Framework Test Cases. . FT-SoC Literature Survey Revisited. .

9

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

169 179 197 201 202

List of Algorithms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Group Membership [Cristian91]. . . . . . . . . . . . . . . . . State Machine Replication with Atomic Broadcast. . . . . . Paxos [Lamport01, Lamport04, Turner07]. . . . . . . . . . . Checkpointing Garbage Collector in CLBFT [Castro02]. . . Normal Operation of CLBFT (without recovery) [Castro02]. Calculating a View-Change in CLBFT [Castro02]. . . . . . . Primary Decision for CLBFT View-Change [Castro02]. . . . DHT Lookup [Wiley03]. . . . . . . . . . . . . . . . . . . . . Chord Search [Stoica03, Wiley03]. . . . . . . . . . . . . . . . Patmos (Part 1) . . . . . . . . . . . . . . . . . . . . . . . . Patmos (Part 2) . . . . . . . . . . . . . . . . . . . . . . . . Atakos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ionian (Part 1) Proposer . . . . . . . . . . . . . . . . . . . . Ionian (Part 2) Acceptor, Learner and Recovery . . . . . . . IonianNB . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andros (Part 1) Primary . . . . . . . . . . . . . . . . . . . . Andros (Part 2) Backup Instances (Normal Operation) . . . Andros (Part 3) Recovery Protocol . . . . . . . . . . . . . .

10

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

35 37 43 52 54 55 56 87 88 145 146 150 153 154 156 159 160 163

Chapter 1

Introduction 1.1

Problem Statement

Service-oriented architectures (SoA) such as web services pose a serious problem for dependable system operation because they promise late binding. Late binding delegates the decision to trust a service to an external software agent. However, if third party services are to be trusted they need to be dependable. Fault tolerance or FT is a well studied area of research that builds reliable systems using mediated replication techniques. These techniques have not yet been adopted in any widespread or systematic way within service oriented computing (SoC). Indeed the only open service standard providing FT is Web Service Reliable Messaging [Davis07]. The barrier to adoption is multifaceted. Firstly, there is a high perceived cost to replicating services. Secondly, there is a general disagreement on whether machine-based agents are to perform eective discovery because of semantic barriers [Alonso02, Shirky02]. However, many FT techniques require this dynamism. Lastly, there is a lack of experience with FT techniques within the service development community. Without widespread adoption of FT the currently available protocols have yet to be be proven in action.

11

Motivation

1.2

Chapter 1:

Introduction

Motivation

We are motivated to address the research problem because of the failings in the present "state of the art" FT approaches to SoC. These inadequacies are summarised as follows:

•

Limited coverage of FT techniques.

The FT community have provided a range of

paradigm agnostic FT protocols for distributed systems. However, coverage of these protocols on a framework by framework basis is limited. To support this statement we cite the lack of support for Paxos [Lamport01], a well known protocol that provides state machine replication. Some frameworks provide extensibility mechanisms but these are limited to simple active and passive replication techniques. Other approaches are direct implementations of a specic FT protocol such as CLBFT [Castro01, Merideth05, Zhao07]. A direct consequence of this limited coverage is a lack of any comprehensive evaluation of known FT protocols with regards to reliability and performance. This in turn leads to the question of whether these dened FT protocols are t for purpose in modern SoC.

•

Poor adoption of the FT process model.

The FT process model [Guerraoui06] de-

scribes a pure asynchronous messaging environment to remove all implicit timing assumptions about interactions. Current frameworks struggle in general to work in conjunction with standard SoAs to provide asynchronism. Some are based on SoAs that only support synchronous request-response exchanges by being tied the underlying transport protocol such as HTTP [Looker05, Jayasinghe05, Santos05, Sommerville05]. Others bypass the SoA in favour of a hard-wired approach [Merideth05, Zhao07], thereby removing all chances of extensibility.

•

FT is treated as an orthogonal issue to SoA.

Current SoAs make no provision for FT.

Frameworks either apply FT at the transport level invoked by indirection [Looker05, Fang04, Santos05, Sommerville05] or alternatively embed the FT protocols within the application logic [Dobson05]. In both cases FT is transparent to SoA. The biggest obstacle lies not with the frameworks themselves but with the SoAs underpinning 12

Research Objectives

Chapter 1:

Introduction

those frameworks. FT protocols rely on the plurality of services making discovery of primary importance. In practice, current SoAs require human intercession to provide discovery because of the complexity of semantics whilst matching. However, the vision of late-binding is based on the principle of autonomous service discovery. This lack of semantic-based discovery limits the ability of the SoA to provide FT by choosing services at runtime based upon their quality of service (QoS). None of the existing FT frameworks for SoC provide a means for a client to dierentiate between FT services (that full the same well-known role [Alonso02]). This supports our assertion that current SoAs are insucient to provide FT.

•

Lack of decentralisation.

Current solutions generally suer from the problem of single

points of failure. Mediated approaches such as [Looker05, Jayasinghe05, Sommerville05] are noteworthy because the failure of the intermediary (proxy) triggers total system failure. Some frameworks provide singleton topology for management, repository and detection components [Fang04, Erradi05] again resulting in potentially catastrophic failure. Other approaches rely on a centralised repository such as UDDI for service discovery [Fang04, Santos05, Dialani02].

1.3

Research Objectives

Our principle research objective is to support the reliability characteristics of SoC by improving the provision of FT in conjunction with the SoA. This objective decomposes, based on the issues identied in section 1.2, into the following:

• Increase the range of FT protocols available to service oriented systems.

Simplify the deployment of FT protocols as services. Make FT service pluggable and swappable. Enhance FT protocols, where necessary, to operate with the SoA. Evaluate and contrast dierent FT protocols with regards to reliability and performance. 13

Research Contributions

Chapter 1:

Introduction

• Adapt current SoAs to make less orthogonal to FT.

Enhance SoA by enforcing asynchronous messaging, thereby facilitating the FT process model.

Provide autonomous runtime service discovery. Allow FT services to be dierentiated based on their QoS metrics. • Address the lack of decentralisation in current FT-SoC approaches.

Distribute service discovery infrastructure across many nodes. Integrate SoA with Peer to Peer (P2P) protocols to remove centralisation.

1.4

Research Contributions

Our principal research contribution is the development of new models and associated framework for SoC derived from a comprehensive and critical review into existing work in the eld. This work makes the following contributions:

• Increases the range of FT protocols available to SoC to six (covering all failure models). All protocols have been enhanced from their basic FT denitions, to include facilities such as multi-threading, to successfully operate in a SoC environment.

• Simplies the deployment of FT by providing an introspective hosting environment for protocols whilst providing generic solutions to several FT sub-problems such as failure detection, explicit synchronism and authentication.

• Enables new FT protocols to be rapidly prototyped through a pluggable infrastructure.

• An evolved SoA model, Late Asynchronous Message Brokering (LAMB), that enables FT by: Enforcing an asynchronous messaging environment (FT process model); providing service matching based on unique names (URIs) working under the well-known service assumption [Alonso02]; supplying QoS based service dierentiation. 14

Thesis Structure

Chapter 1:

Introduction

• A demonstration on how to use P2P protocols (implemented in JXTA) to truly decentralise the SoA and FT infrastructure. A secondary key contribution is the extensive experimental evaluation of six dierent FT protocols operating with a real world trading oor test case. To support this we have developed an integrated environment for controlled injection of interactions and failures along with metric capture.

1.5

Thesis Structure

Chapter 2 provides an introduction to FT. It starts by discussing the principal concepts such as the process, failure, timing and distributed system models. The second part is used to describe fault tolerant abstractions. These abstractions include timing-based failure detection and leader elections, broadcast with total-ordering, consensus and Byzantine agreement, group membership and nally state machine replication. The last part of the chapter is dedicated to an incremental survey of FT protocols that provide passive, active and state-machine replication.

Chapter 3 is principally a literature survey of SoA but also covers P2P networks. We survey the key aspects of SoA, namely messaging, discovery and composition with reference to our research objectives. The second part of the chapter is devoted to a survey of P2P protocols and frameworks.

Chapter 4 is a review of existing FT approaches that apply to SoC. The chapter reviews the dierent frameworks against criteria derived from our research objectives. The chapter ends with a table and discussion of the gaps for development of a new framework.

Chapter 5 describes the design and implementation of our FT framework. It introduces LAMB, a new asynchronous message message brokerage system that routes SOAP messages to services described by WSDL. Sandbox, a container for FT web services. The discussion also outlines a set of FT facilities provided to services within Sandbox. A platform derived from JXTA Peer-to-Peer is presented with discussion of peer, service and 15

Thesis Structure

Chapter 1:

Introduction

user management. This chapter concludes with a set of FT services that mirror the key protocols described in Chapter 2.

Chapter 6 presents a two part evaluation. We introduce the Trading Floor application from which we derive congurations to test in the evaluation. We present the doping mechanism used to inject request and controlled failures to build evaluation scenarios. A metric interface is described that is used to gather experimental information relating to reliability and performance of dierent congurations. We show the chart interface used to visualise the metrics. The rst part of the evaluation is dedicated to testing dierent FT protocols as services to full a particular research objective. The second part of the evaluation is a set of test cases that check whether the framework fulls the rest of the research objectives. The summary of the literature survey from Chapter 4 is revisited to demonstrate the gaps addressed by our framework.

Chapter 7 assesses the framework from the perspective of our top level research objective. We provide a discussion of the limitations of our work and areas for further work. Final we provide some nal remarks on this work.

16

Chapter 2

Reliable Distributed Computing The

Dependability

of a program equates to the human perception of it's trustworthiness

[Sommerville01]. Dependability is composed of a series of interrelated security, safety, reliability, availability, performance

concerned with

reliabilty

and

scalability.

ilities

such as

In this chapter we are

and the techniques used to achieve it, namely fault-tolerance or

FT. This chapter reviews the research on distributed FT. The rst part of the chapter reviews FT concepts in the form of problems and the abstractions that solve them. The second part of the chapter reviews the protocols built upon FT abstractions. Performance and scalability are important orthogonal issues in any reliability achieving systems.

2.1

Concepts

Reliability

is dened as the probabilistic condence that a service will perform it's tasks

to a dened

quality of service

(QoS) within an established environment. The term QoS

encompasses all non-functional requirements for a system which include response times, expected network behaviours and even dependability metrics [Dobson07]. Laprie et al [Avizienis01, Laprie95] simply call this continuity of correct service. Another expression of reliability is the

mean time between failures

17

(MTBF) [Avizienis01]. A more common

Concepts

Chapter 2:

Reliable Distributed Computing

measure of reliability is the ratio of failures to overall executions; often represented as a percentage.

Performance

is dened as a measure of how a system matches a QoS expecta-

tion. Informally, we limit the term performance to acceptable response times. This is also referred to as

latency.

There is no agreed denition of are two associated terms;

Scalability

load-scaling

for distributed systems. However, there

is where throughput changes roughly in proportion

with the rate of inputs, i.e. in Big O notation this is O(n) or better;

structural-scaling

is

the property of adding actors without disproportionately aecting performance [Bondi00]. Routing protocols are accepted as n-scalable if they are able to resolve queries in the order of O(log n). The Internet itself is an example of a highly scalable network since it's overall throughput can increase roughly with the number of hosts. Reliability is aected by a number of factors including hardware breakages, network partitioning, faulty software and human inuence. The result of these factors is service failure where a delivered service deviates from it's intended operation [Sommerville01]. Once a failure occurs, a correct service becomes faulty. Laprie et al [Laprie92, Laprie95, Avizienis01] describe the so-called fault-error-failure chain to show the pathology of failures. A fault is an incorrect system state that remains dormant until activated. Once activated a fault becomes an error (an exceptional execution ow). A failure is the manifestation of an error. If unchecked a fault within a sub-component can result in overall failure of a system. This is demonstrated by the famous quote by Lamport [Lamport78a] A distributed system is one in which the failure of a computer you did not even know existed can render your own computer inoperable.

Reliable systems are achieved by fault prevention, removal, forecasting [Pullman01] or more commonly by tolerance. Fault tolerance consists of three types of action (1) detection, (2) recovery and (3) masking. Detection is the observation of the occurrence of an failure. Recovery is the process of restoring a faulty service to correctness, removing the error. Detection and recovery usually operate together in reliability protocols such as recovery block [Randell75]. Masking is used when failures are too subtle to be detected. It uses multiple copies of a service to

mask

the occurrence failures. Both masking and recovery 18

Concepts

Chapter 2:

require multiple instances of the same service, or FT. When coordinated sequentially it is called called failure

active replication

or

redundancy.


replicas,

that are coordinated to provide

passive replication ;

when concurrently it is

Replication avoids the problem of

the primary cause of total system failure. This makes

single-point-of-

decentralisation

of system

functionality desirable. In a formal model of FT, there are two properties that must be upheld.

Safety

is invariant property that must hold for the lifetime of the service. States

nothing bad should happen making the system reach an unacceptable state [Owicki82]. Violation renders the service faulty. For example, messages should not be invented out of thin air [Guerraoui06].

Liveness,

or progress [Lamport01], states something good should

eventually happen leaving the system in a desirable state [Owicki82]. To describe the problem domain for FT and number of models are used that form the distributed model of computation. The rst abstraction is the

process model

as shown

in Figure 2.1. A distributed system is composed of n processes represented by the set Π or {p, q, r}. Message delivery has two primitives

send

and

deliver.

Messages are asyn-

chronously exchanged between processes over logical links in an execution step. A global clock is often assumed making all events sequential rather than concurrent [Lamport78b]. The existence of the global clock, however, is unlikely but it simplies presentation.

Figure 2.1: Process model of FT.

19

Concepts

Chapter 2:


A logical link is an communication step. It represents a reliable or unreliable connection over a network of processes. Most links have key correctness properties. Integrity ensures the link does not invent or randomly duplicate messages. Validity ensures that if a message is sent to pi it is delivered by pi . Perfect links guarantee that the message gets delivered. Authenticated links use digital signatures or MACs to that recipients can verify the sender and check that messages have not been tampered.

Figure 2.2: Process Automata as a Stack The process model is rened in Figure 2.2 where each process automata is divided into stack of layers. Each layer enables a given abstraction such as consensus. Higher abstractions can extend, or transform, lower ones. They communicate through the deliver

send

and

primitives. FT research [Barborak93, Castro01, Chandra96c, Dolev87, Dwork88,

Hadzilacos94, Reiter95, Schneider90] use a model identical or similar to this. Indeed this model has been utilised in a Java based implementation called Appia [Carvalho03] which is used for teaching purposes. Failures themselves can be modelled. Failure models are a logical grouping of failures based on their characteristics [Guerraoui06]. These are in turn summarised in terms of controllability, consistency and consequences in [Avizienis01]. Failure models ordered by their severity are shown in Figure 2.3. Signalled

failures are the simplest to tolerate since their presence is detected internally

and a notication sent. This failure model is akin to an exception raised within a programspace. In programming terms, an idealised component [Randell95] is one that signals a 20

Concepts

Chapter 2:


Figure 2.3: Ordered Failure Models failure. A

crash-stop

failure occurs when a process stops sending and delivering messages

forever. In this failure model a process is correct before the failure occurs and faulty thereafter. Network partitioning is the classic example of the crash-stop failure model. When a process crashes for a period of time but then subsequently recovers, it's failure model is said to be

crash-recovery.

The

Omission

failure model occurs when some the

messages being sent between processes are arbitrarily lost. It is the arbitrariness that marks omission apart from crash failures.

Timing

failures occur when a process delays delivering

a message for an arbitrary period of time. This failure model is only meaningful when there is an upper-bound on the message delivery times, a state known as synchrony.

Commission

failures are messages that are incorrect by some acceptance measure, therefore, they are in the value domain whereas all others are in the timing domain. Commission failures can target application or control messages. Byzantine

failures encompass any other type of failure model including ones that have

not been thought of yet. The very arbitrariness of Byzantine failures make them dicult to defeat and mitigate against. Lamport et al have characterised arbitrary failures using the Byzantine Generals Problem [Lamport82]. This scenario ascribes failures to treacherous generals. However, in real systems Byzantine failures are often attributed to non-malicious sources such as software defects. Other types of failures that coincide with the above failure models are the common21

Concepts

Chapter 2:


mode and denial of service (DoS). Common-mode failures occur when all replicas possess the same latent fault and therefore fail simultaneously. DoS failures occur when a third party saturates a set of services with requests to prevent others accessing those services. DoS failures can encompass the crash, omission and timing failure models. The most commonly studied failure models are crash and Byzantine. FT is strongly aected by the assumptions that can be made about how and more specically when processes communicate with each other in normal and faulty operation. This model is called timing

assumptions.

A distributed system is said to be synchronous if

there is an upper bound ∆ on communication delays (communication synchrony), an upper bound Φ on how much faster the local clock of process pi can run relative to process pk (process synchrony) and an upper bound on the time taken for a computation (computation synchrony) [Dwork88]. For simplicity we can say that process clocks cannot deviate from a global clock by Φ. If there are no bounds for ∆ and Φ the system is asynchronous. A fully synchronous system allows for simple and fast algorithms; on the other hand, timing assumptions are dangerous to rely on especially in large scale heterogeneous networks, for example the Internet [Kursawe02]. Asynchronous

means no timing assumptions can be made about the processes model

making liveness assertions problematic. Fischer et al [Fischer85] argue that is impossible to achieve a

consensus

between processes in a asynchronous environment even if only one

process crashes. This result is called the result.

Fischer, Lynch and Paterson (FLP) impossibility

Its argument stems from an inability to determine whether a process has crashed

or is acting slowly (i.e. a timing failure). A great deal of research has taken place into addressing the problem. chrony

Partial Syn-

[Dolev87, Dwork88] is two intermediate models of synchrony where processes are

assumed to act synchronously most of the time only deviating from this under certain conditions such as heavy load. The rst model is where there are values for ∆ and Φ but this is not known before the system starts. A second model uses xed values for ∆ and Φ but these only hold after an unknown Global

Stabilisation Time. Failure Detection Oracles

[Chandra96a] prove even weaker assumptions can solve consensus, this is discussed in more 22

Concepts

Chapter 2:


detail in 2.2.1. Finally, randomisation can be used to weaken the termination safety property to

p-termination

where consensus is only reached with a high probability [Correia06].

Unfortunately, there are no empirical or comparative studies that contrast randomisation with partial synchronisation or failure detectors. Even in a system without physical clocks, time can be monitored with regards to message passing by using logical clocks [Lamport78b]. Their purpose is to establish a causality

between messages. If m1 → m2 , m1 happened before m2 . This relationship

holds if (a) m1 and m2 were sent by the same process but temporally m1 occurred before

m2 , or (b) the delivery of m1 resulted in the sending of m2 , even if that included the sending of intermediate messages. By combining process, failure and timing models distributed environments can be classied by the

distributed system model.

Common distributed system models are presented

in Table 2.1. It is common for distributed system models to be described as failure models and vice-versa. Designation

Failure Model

Communication Model

Timing Assumptions

Fail-Stop Fail-Noisy Fail-Silent

Crash-Stop Crash-Stop Crash-Stop

Perfect Links Perfect Links Perfect Links

Fail-Recovery

Stubborn Links

Byzantine

CrashRecovery Arbitrary

Synchronous P Partial P Asynchronous (safety) Partial (liveness) ∆ Partial P

Authenticated Byzantine

Arbitrary

Fair Loss Links Multicast Authenticated Links

Asynchronous (safety) Partial (liveness) ∆ Asynchronous (safety) Partial (liveness) ∆

Table 2.1: Distributed System Models.

2.1.1

Other FT Factors

FT is aected by a number of additional factors. In order to calculate the cardinality of replicas required (n) we dened the

resilience

in terms of the number of concurrent

failures that may occur without total failure. This expression depends almost entirely upon the distributed system model though there are examples where resilience is traded 23

Concepts

Chapter 2:


for performances [Martin06]. In the fail-stop model n = f + 1, i.e. at least one replica must maintain correct operation. In a fail silent model n = 2f + 1, i.e. a majority of processes must remain correct. Finally, in the Byzantine models n = 3f + 1 because two thirds of processes must provide a (possibly incorrect) response to ensure that a (third plus one) majority provide a correct response. The true performance of a distributed system can only be veried through experimental evaluation. However, when systems are expressed in a formal-notation they use several measures to reason about performance. The

round-complexity

or ε, is the number of near

synchronous steps a algorithm takes to complete [Correia06]. For eaxmple Paxos , a crash tolerant agreement protocol, reaches consensus in ε = 2 [Lamport01]. Whereas CLBFT, its Byzantine tolerant equivalent, reaches consensus in ε = 3 [Castro99]. The total number of messages exchanged in an algorithm instance is called the Sometimes the messages per round is expressed as ϕround = Other factors inuence performance.

ϕ ε

messaging-complexity

or ϕ.

[Guerraoui06].

Communication-complexity

is a measure of large

messages are. Protocols with causal relationships may embed other messages in them causing a high communication complexity. Innite capacity in communication links cannot be assumed [Correia06]. Cryptographic techniques like asymmetric key encryption have a high

computational-complexity.

In the past this has made FT protocols, such as Rampart

and CLBFT-PK [Correia06, Castro01, Reiter95], impractical. Message Authentication Codes that use symmetrical key encryption are a weaker but faster alternative [Castro02]. Redundant, exact copies of software components alone cannot increase reliability in the face of design faults [Pullman01] because those faults may be common-mode. Diversity, or N-Version Programming denotes a situation where replicas have dierent implementations of the same original specication [Avizienis85], in this case they become variants. It avoids the occurrence of common-mode failures. However, diversity is expensive and it's cost often makes it prohibitive for most applications [Torres-Pomales00].

24

Abstractions

2.2

Chapter 2:


Abstractions

Abstractions are the solutions to specic FT problems dened in Section 2.1. They neatly map to the layers in the automata stack shown in Figure 2.2 and therefore eventually form part of the process model. An abstraction achieves FT by addressing detection, recovery or masking. They include failure detection, broadcasting, consensus and group membership.

2.2.1

Failure Detection

A failure detector formalises the timing model to recover from crashes or ensure liveness in more severe failure models [Chandra96a]. The topology varies widely but often they are seen as local oracles that advise processes to the state of other processes. Detectors are classied by two properties.

Completeness

is the suspicion of faulty processes.

Accurracy

is the non-suspicion of correct processes. The classications are summarised in Table 2.2. A perfect failure detector P is achieved simply by processes pushing heartbeats to each other. In a synchronous environment hearbeats are guaranteed to be received in 2∆ + Φ, if no heartbeat is received the sender has crashed. An eventually perfect failure detector

P is achieved by allowing suspected crashes to be repealed if a later heartbeat is received. The heartbeat bound 2∆ + Φ is then increased to ensure no more inaccurate suspicions. It has been demonstrated that even the weakest failure detector W can solve the FLP result when suspicions are received by an majority agreement of correct processes [Chandra96a]. This class of failure detector uses a weaker model than partial synchrony [Dwork88] where both bounds ∆ and Φ and the globalisation stabilisation time are unknown. Designation P Perfect S Strong Q W Weak

P Eventually Perfect S Eventually Strong Q W Eventually Weak

Completeness Strong Strong Weak Weak Strong Strong Weak Weak

Accuracy Strong Weak Weak Weak Eventually Strong Eventually Weak Eventually Strong Eventually Weak

Timing Model Synchronous Synchronous Synchronous Synchronous Partially Synchronous Partially Synchronous Partially Synchronous Partially Synchronous

Table 2.2: Failure Detection Classication [Chandra96a]. 25

Abstractions

Chapter 2:


Practical Failure Detection As mentioned in Section 2.2.1, failure detection is broadly achieved by processes sending heartbeats to each other. However, there are practical considerations. Heartbeats must change to traverse system topology [Hayashibara02]. More importantly the number of messages ϕ send each round is n2 causing the phenomenon of message explosion. This explosion aects the performance and possibly correctness of communications and processes. The simplest optimisation is to move from a pull to a push mechanism. Pulling requires two messages, the request and response. Pushing is simply sending alive messages. Of course pushing requires each process has a more detailed knowledge of other processes. Failure detection can be partitioned so that heartbeats do not traverse subnetworks [Felber99]. This approach requires a hierarchy of detectors where privileged ones can share information between sub-networks. This approach is not adaptable. A lazy failure detector uses application messages and their acknowledgements to piggyback heartbeats [Fetzer01]. When no application messages are forthcoming pull-style heartbeats are sent. This detector satises P by adapting the upper-bound T of the maximum round-trip time when suspicions arise. The detector includes a built-in mistake time allowing for decisions to repealed. A lazy detector is ecient under high-loads by not adding messaging complexity to the application. Another adaptable failure detection protocol takes QoS measurements from the environment guess ∆ and Φ [Ma07]. Links between processes are given weighted probabilities in terms or reliability and delivery times. A novel approach to failure detection relies on gossiping between processes [Renesse98]. Instead of heartbeats, processes periodically share lists of clocks for known correct processes. Every T (gossip) a process increments it's internal clock and sends it's list of known correct clocks to a randomly chosen set of k processes. After receiving a list a process merges it by taking the highest clock for each listed process. A crashed process leads to a clock that never gets increments. After a time this can be used to suspect the crashed process. Gossiping satises P because suspicions can be repealed when a clock gets incremented. The main advantage of gossiping is that ϕ = k compared to n2 . The 26

Abstractions

Chapter 2:


value k is a trade-o between messaging complexity and detection time. Processes are free to join the network by simply sending their clock to another process. Finally, gossiping can be adapted into a hierarchy that partitions networks.

Leader Elections Leader elections are an abstraction based on failure detection that guarantee the liveness of a FT protocol. It asserts that one process is the current leader and is able to coordinate operation. Unlike an intermediary process, a leader changes throughout the lifetime of the protocol in relation to failure or deadlock events. A perfect leader election is based on a monarchical list of processes {p1 ...pn }. The top process on the list typically ordered by process name is the leader. A perfect failure detector denitively indicates when a process has crashed. It the leader crashes the next one on the list is accepted as leader. An eventual leader election Ω extends the monarchical list with an epoch. When a process believes itself to be leader for a current epoch it signals it's intent along with the epoch to all other processes. If a process receives an intent it checks the epoch number against it's own monarchical list. A message of subservience is sent is sent to the prospective leader if a process decides the epoch and list agree. When a prospective leader receives n 2

+ 1 message signals it can act as leader. Internal epochs are increased every time a

failure suspicion is received from a P detector. Ω is used in the Paxos family of protocols [Lamport01]. An evolution on leader elections is the view-change abstraction. A view is synonymous with an epoch in Ω. Using a monarchical list and a view a leader, or primary, is chosen. If a crash is detected by process pi using a P detector then it increments it's view to v + 1 and sends it in a view change message to all n processes. When any process receives f + 1 view change messages for v + 1 then is installed and the new primary for v + 1 can coordinate. View-change is part of a protocol called

Viewstamped Replication

[Oki88] made up from

view and timestamp, in this case a logical clock [Lamport78b]. This combination is used

27

Abstractions

Chapter 2:


to sequence requests for the purpose of state replication. Viewstamped replication is for crash failures only. However, it has been adapted in the CLBFT protocol to maintain liveness even the presence of Byzantine failures [Castro99, Liskov07].

2.2.2

Broadcasting

Broadcast is a one-to-many delivery abstraction. A group of correctness properties dene the type of broadcast [Hadzilacos94]. Validity and integrity stem trivially from links, if messages gets sent it is delivered and no messages shall be invented or duplicated. Agreement ensures that correct processes deliver the same set of messages. Uniformity ensures that a message sent to Π gets delivered to all process in Π, whether faulty or not, or not at all. The property of uniformity is impossible to impose in the Byzantine failure model [Défago04, Hadzilacos94]. The simplest form of broadcast is

IP Multicast.

It is a simple ecient primitive sup-

ported by many networks using the UDP/IP protocol to any listening machines. Broadcast is achieved in ε = 1 and ϕ = 1 but with no validity, integrity or agreement guarantees. eort broadcast

Best

is a series of perfect links between a sender and destinations. It derives

validity and integrity from the perfect links but the sender may crash during the broadcast preventing agreement. It is sometimes called unreliable broadcast. Uniform reliable broadcast

guarantees agreement. It works by receiving processes sig-

nalling that they have received a message to all other processes in Π. The message is delivered by process pi if the set of acknowledgements is equal to Π for the fail-stop distributed system model. In the fail-silent model the set of acknowledgements must be greater than

n 2.

Reliable broadcast delivers a message in ε = 2 and ϕround = n2 .

Epidemic Broadcast

uses non-determinism for gain. It has properties that are satis-

ed with a certain probability thus circumventing the FLP result [Birman99, Correia06, Gupta00, Toueg84]. Scalability is achieved because messages are sent to k other processes in r rounds where k and r are not tied of n. Epidemic arises from observations of natural phenomena such as virus spread. It works like this. A process broadcasts a message to k

28

Abstractions

Chapter 2:


other in round r, they then broadcast to k others in round r − 1 until r = 0. It achieves broadcast in ϕround = k and ε = r. Bimodal Multicast

[Birman99] attempts to broadcast using the ecient IP multicast.

Gossiping between k processes is used to nd gaps in the messages received. Using pointto-point links the missing messages are received and delivered. This broadcast can take

ϕ = 1 + (εgossip n) + 2n but eective multicast can mean delivery in ϕ = 1. A probabilistic leader election [Gupta00] extends bimodal multicast. By hashing a message concatenated with the IP address of the process an index is formed. If the index is greater than a factor K n,

where K is a constant, then the process takes part in the leader election. When the

leader is chosen the result is multicast by the chosen few to all n processes. This election trades accuracy for scalability. Ordering

denes how messages are organised by processes between reception and de-

livery. It is useful if not essential [Hadzilacos94] that replicas process messages in the same order as each other. The simplest form of order is

First In First Out

process delivers messages in the order they were broadcast.

or FIFO where a

Causal ordering

is dened as,

if m1 causally precedes m2 [Lamport78b], then no correct process delivers m2 unless it rst delivers m1 . Of course causal ordering subsumes FIFO because messages broadcast from the same process are causally related. To achieve causal ordering a message is broadcast along with a history log of messages that causally precede it. When a process pi receives a causal broadcast message it must rst deliver all the messages in the history if they haven't been already. As messages are delivered they get added to the local history log. When pi becomes a sender it too sends the history log with the message. This broadcast is no-waiting. This approach takes ε = 2 and ϕround = n2 just like reliable broadcast but has potentially a massive communication complexity. Vector clocks [Raynal96] is an alternative approach to causal ordering where every process holds a vector of logical clock timestamps for all processes. When sending a message the process pi updates it's own timestamp and embeds the vector. When a vector

29

Abstractions

Chapter 2:


Figure 2.4: Vector Clocks [Raynal96]. is received it is merged with the local one always taking the highest timestamp for all processes such that vectorlocal [i] = max(vectorlocal [i], vectorpi [i]), as shown in Figure 2.4. When receiving a causally broadcast message mi from pi the process px compares the two vectors vectorpi and vectorlocal to see if there are messages to be delivered before mi . It is possible that those messages have not been received yet so the protocol blocks until they are. Vector clock reduces the communication complexity.

Total Order Broadcast Total-order guarantees that all correct processes deliver exactly the same sequence of messages. This order is irrespective of the sequence in which the messages were sent, total ordering can be combined with FIFO and causal. If fact total-causal-order is often referred to as just total-order[Hadzilacos94]. There are several approaches to total-ordering [Défago04].

Priviledge-Based

uses a token that is passed between sender processes. The to-

ken can only be passed when the last message is delivered by all correct processes. Delivery is simply then a FIFO broadcast problem.

Communication History

[Lamport78b] allows

senders to broadcast at any time but receivers log all incoming messages. In conjunction with vector clocks a causal history is obtained, this is then transformed periodically into total order history with a knowledge of timeslots for senders. 30

Abstractions

Chapter 2:

Sequencing


is where one process is elected to assign logical timestamps to incoming

messages and then unreliably broadcasts to the other processes. This is the most common way to achieve total order broadcast [Castro99, Lamport01, Oki88]. If the sequencer fails a new one is elected. A process may unicast a message directly to a sequencer who then broadcasts the message with timestamp to the other processes. This unicast-broadcast takes ε = 2 and ϕ = n. To prevent message omission the sender may unreliably broadcast the message to all processes, again the sequencer broadcasts the timestamp. This broadcast-broadcast takes ϕ = 2n. Finally, the sender may acquire a timestamp from the sequencer before broadcasting the message and timestamp to all processes. This unicastunicast-broadcast takes ϕ = n + 2.

2.2.3

Consensus

Consensus

tives

is a high-level abstraction that is very similar to broadcast. It has two primi-

propose

and

decide.

It diers from broadcast in that many proposals lead to just one

value being decided. It has a symbiotic relationship with broadcast. Total-order broadcast requires consensus that in turn requires an unreliable broadcast. The following properties must be satised to achieve consensus. Agreement, or consistency [Dwork88], so that no two processes can decide dierently. Termination where every correct process decides eventually, this guarantees liveness. Validity so that every value decided was rstly proposed by a correct process [Doudou05]. Finally, integrity where processes can only decide once. The simplest form of consensus is hierarchical [Guerraoui06]. A process may receive a proposal at any time either locally or through a multicast. All process move through a series of rounds. For each round one process is the highest ranked. Every T the highest ranked process checks to see if a proposal has been received, if it has then the process decides that proposal and multicasts the proposal and round to all other processes. These choose the proposal if the receiver is higher ranked than the receiver. All processes then increment the round number and the next process decides the current proposal. Hierarchical consensus will only work in the fail-stop distributed system model otherwise termination cannot be

31

Abstractions

Chapter 2:


guaranteed. To complete it takes ε = n rounds with ϕround = n. Without the fail-stop model consensus can only be achieved if enough processes agree. A a

quorum

is where enough processes agree to make progress. A quorum is indicated by

certicate,

a collection of messages that show a quorum has been reached. A quorum

certicate is f + 1 in the crash failure model and 2f + 1 in the Byzantine failure model. A weak certicate is also f + 1. Other types of certicate are protocol specic. Failure to gain a certicate tends to eventually result in a reconguration such as a leader election. A

proof

is a certicate composed of authenticated messages.

Using Consensus for Atomic Broadcast Atomic broadcast is just another name for total-order broadcast. Consensus can be used to achieve atomic broadcast in both the crash and Byzantine failure models, this is sometimes called

destinations agreement.

Défago et al [Défago04] provide a comprehensive survey of

destination agreements. An example of atomic broadcast [Chandra96c] uses an earlier consensus protocol called the rotating coordinator [Dwork88]. The coordinator extends a hierarchical consensus with an estimation round that allows several proposals to be unicast to the coordinator. It waits until it has a majority of estimates before it deterministically proposes one. The coordinator now broadcasts the proposal to a set of witness processes that either send back an ACK or NACK depending on whether the coordinator is suspected by P . It uses a partially synchronous timing model. The atomic broadcast protocol receives the set of messages decided by the rotating coordinator and delivers them in a deterministic order based on the message identiers. It completes in ε = 4 and ϕ = 4n. Protocols like [Chandra96c] are limited by their reliance on a partially synchronous timing model for safety this means they cannot tolerate Byzantine failures. In Section 2.3 we described FT protocols that use synchrony for liveness only [Castro99, Lamport01], these are informally described as asynchronous but they are not. The FLP result means that synchrony or randomisation is required.

32

Abstractions

Chapter 2:


Byzantine Agreement Byzantine agreement is a metaphor for tolerating arbitrary failures described in their seminal work by Lamport et al [Lamport82]. The Byzantine army is encamped outside Edessa. Each division is lead by a general. After observing the enemy the generals formulate a plan of attack. Some of the generals may be traitors intent on stopping loyal generals from reaching agreement. The protocol to prevent this must ensure that loyal generals decide on the same action plan and that plan is not distorted. This problem is reduced into a series of

Byzantine Generals problem

where one general is a commander. In this case all

loyal lieutenants must obey the same order only if the commanding general is himself loyal. Byzantine agreement is achieved by m rounds the Generals problem.

Figure 2.5: Byzantine Generals Problem [Lamport82]. In the rst case messages are transmitted between generals orally by messenger. The messenger could also be a traitor. Though, it is assumed that the receiver knows the sender and the absence of a message can be detected. The Byzantine Generals protocol runs over f +1 rounds. It starts with the Commander declaring attack or retreat in round one, the lieutenants duly record the command. Then the next general sends whether he though the command was attack or retreat in round two, again the command in noted. This is repeated until after the f + 1 round when all generals choose the majority command from their records.

33

Abstractions

Chapter 2:


To ensure an attack 2f + 1 commands must be gathered to ensure that there are f + 1 loyal attacks to f treacherous retreats. Of course f generals may not state their commands in the rst place. In total 3f + 1 generals are needed to ensure a consistent plan, a full proof of this is given by Lamport et al [Lamport82]. The protocol is inherently expensive, if f = 2 then ε = 3 and ϕround = 7. Authenticated Byzantine Agreement

is a special case of Byzantine agreement where it

is impossible for traitors to lie. It is assumed that a loyal general's signature cannot be forged, messages are tamper proof and anyone can check the authenticity of the signature. If inconsistency is detected between commands it can be attributed to a treacherous commander rather than a bad lieutenant. After f + 1 rounds, where each general states the command he believes along with the signature, loyal lieutenants should only ever have one correct order. The impact of this is that only 2f + 1 generals are required. Again a full proof is given by [Lamport82]. Unforgeable signatures can be achieved with digital signatures generated with asymmetrical key cryptography. Messages signed with a private key can only by authenticated with the corresponding public key. If the message is tampered with then the authentication will automatically fail. It is thought that the cost of computing digital signatures outweighs the benet of only requiring 2f + 1 processes. It is interesting to note that Byzantine agreement pre-dates the FLP result and so the metaphor includes no reference to the timing model. Practical implementations that stem from Byzantine agreement assert upper bounds for processor and transmission times [Castro01, Castro02].

Randomised Byzantine Agreement Ben-Or [Ben-Or83] describes a randomised method of reaching Byzantine agreement without any synchrony assumptions. Like the original [Lamport82] the method can only reach binary Byzantine agreement. Messages cannot be authenticated and it is inecient, possibly taking as much as varepsilon = 2n to terminate. This work has been extended into

34

Abstractions

Chapter 2:


coin-ipping techniques [Rabin83, Toueg84]. They and their successors use variants of a Shared Secret Scheme.

This line of research is ongoing but because agreement is limited

to {0, 1} it is a cul-de-sac for our work.

2.2.4

Group Membership

Algorithm 1 Group Membership [Cristian91]. Group Membership (GM) Consensus (C) Perfect Failure Detector (P ) Initialisation: viewid = 0 view = (viewid , Π) processesview = Π GM.setV iew(view)

Upon P.crash(pi ) processesview = processesview \ {pi } waitF orLock() lock() C.propose(viewid + 1, processesview )

Upon C.decide(viewnext , processesnext ) viewid = viewnext view = (viewid , processesnext ) unlock()

Most FT protocols assume a xed set of processes Π for the lifetime of the algorithm. In the fail-stop distributed system model process may leave by crashing but none may join. Modern distributed systems such as peer-to-peer of course have transient processes. Group Membership

coined by [Cristian91] allows processes to join an agreement protocol.

It works by processes delivering a group view consisting of (viewepoch , {p1 ...pn }) every time a process joins, leaves or crashes. Group membership has the following properties. Views must be delivered in numerical group view. For Accuracy

completeness

order. Agreement

on the processes {p1 ...pn } contained in a

a new view must not contain a process pi that has crashed.

ensures that if pi ∈ / Π then pi has crashed.

35

Abstractions

Chapter 2:


Group membership is an agreement problem that can be solved by consensus as shown in Algorithm 1. It is only applicable to crashes detected with P in the fail-stop distributed system model. With weak group membership [Chandra96b] allows suspected processes to be removed from a new view, waving the accuracy property. A process may later join in the next view following the fail-recovery model. Weak group membership circumvents the FLP result with a weak partial synchrony W .

Figure 2.6: Rampart Normal Operation [Reiter94]. Rampart

[Reiter94, Reiter95] is a well known platform that integrates group mem-

bership protocols and Byzantine agreement for high-integrity services. It's model is a combination of authenticate Byzantine and viewstamped replication [Oki88]. Rampart authenticates all messages using digital signatures. During normal operation a leader broadcast a view identier and message to all backups. They echo the message with a digital signature and digest hash. The leader then aggregates the signatures and hashes broadcasting them. The normal operation of Rampart is shown in Figure 2.6 where x is the view, d(n) is the hash and sig(i) is the signature of pi . Rampart replicas queue authorised commit messages until a stable group is obtained. The requests add(x, pnew ) and remove(x, pbad ) establish the new view. When

n 3

+ 1 processes echo a command it is

accepted by all correct replicas. Once the new view is installed all of the queued messages are delivered. View-synchrony

is a special case of group membership. It is rst seen in the ISIS

platform [Birman87]. In addition to group membership view-synchrony supports the following properties.

Virtual Synchrony

requires that before a new view is installed all correct 36

Abstractions

Chapter 2:


processes must deliver the same set of messages in the old view [Vitenberg99]. View Delivery

Sending

states that if a message is sent in a view it must be delivered by all cor-

rect processes in the same view. View synchrony requires blocking during view transitions [Guerraoui06]. The HORUS toolkit [Friedman96] succeeds ISIS by allowing

weak virtual synchrony

where if pi and pj both deliver mx they do so in the same view. This means the blocking protocol can be replaced with a scheme that allows

suggested-views

to precede regular

views allowing messages to be delivered during the view change. The performance of HORUS is suciently good to be practical [Renesse95], however, the membership can become partitioned [Vitenberg99, Sussman00]. Another problem with HORUS is that new processes cannot immediately join during a view change transition.

2.2.5

State Machine Replication

State machine replication or SMR is a means of coordinating inputs to replicas so that they can maintain the same internal state. It was rst discussed in [Lamport78b] but later surveyed in [Schneider90]. At the heart of SMR is atomic broadcast as shown in Algorithm 2, all replicas

must

deliver the same inputs to guarantee safety.

Algorithm 2 State Machine Replication with Atomic Broadcast. State Machine Replication (SMR) Atomic Broadcast (AB) Initialisation: state = stateinitial

Upon SMR.execute(ci ) AB.broadcast(ci ) Upon AB.deliver(ci ) transition = getT ransition(state, ci ) transition.execute() ∀ valueoutput ∈ transition : SM R.output(valueoutput )

37

FT Protocols

Chapter 2:


There are several important observations about SMR [Schneider90]. The client is responsible for collecting f + 1 identical responses to ensure that one, therefore all, is correct. We cannot assume a client is correct. They too can be replicated so the same request may be received from two clients. Finally, when a set of replicas access external services failure of those externals can produce inconsistent state. In this case, all outgoing requests are captured by a external service until if receives f + 1. It then processes the request sending the response back to all replicas. Eectively, all external services are treated like clients.

2.2.6

More FT Abstractions

We have not review other FT abstractions as their application is out of the scope of the the research discussed in this thesis.

Checkpoints

are persisted representations of an

application's state that can be used to restore that application after a crash. Elnozahy et al [Elnozahy02] provide a comprehensive survey of checkpoint techniques. We do no include them in our survey as as the overlying application must support persistence of it's state, something we do not expect in service oriented applications.

Shared-memory

is a

virtualisation layer for consensus it's domain is in distributed storage where clients are too many and ephemeral for message passing abstractions alone.

2.3

FT Protocols

FT or replication protocols cross the dierent models of computation to solve well known reliability problems. They leverage the abstractions described earlier in this chapter and are presented next.

2.3.1

Recovery Blocks

Recovery Blocks

or RB is one of the earliest forms of FT [Randell75]. It is an example of

passive replication because only one replica is polled at any one time. Failures can occur either by fail-stop or through an acceptance test. Each failure results in selection of the next

38

FT Protocols

Chapter 2:


replica as shown in Figure 2.7. The correctness properties are relatively weak.

Integrity

means that messages are not duplicated or fabricated. Any message sent is subsequently delivered by one replica for validity. The client receives a response of notication of failure ensuring

termination.

Figure 2.7: Recovery Blocks [Pullman01, Randell75]. The original version of RB is not equiped for distributed computing. Adaptations include an specic intermediary process or proxy to coordinate the protocol. Unfortunately, a proxy is a

single point of failure.

Other implementations embed the proxy inside the

client itself [Dobson06]. In general the order in which the replicas are chosen is arbitrary. The original pre-dates timing research though termination is clearly not possible with asynchrony. In many examples partial synchrony is implicit, provided by the underlying TCP/IP connection. Figure 2.7 shows RB includes check-pointing, this is generally omitted from more modern counterparts [Dialani02, Dobson06, Jayasinghe05, Fang04, Sommerville05]. RB can incorporate variants, but in practice diversity is of little appeal because commonmode failures are usually in the value domain, so would go undetected. It is then pot-luck as to whether a correct variant is chosen or not. The protocol has strong resilience where

f = n − 1 for n replicas. RB is fast, delivering the message to the replica in ϕ = 2 assum-

39

FT Protocols

Chapter 2:


ing an intermediary. In the worst case when f processes fail sequentially ε = f + 1 and

ϕ = 2f + 2.

2.3.2

N-Version Programming

N-Version Programming

or NVP is an example of using active replication to leverage

diversity [Avizienis85]. The simple protocol is shown in Figure 2.8. It introduces agreement where a majority of correct variants must agree on the same output.

Validity

ensures that

all variants receive the input.

Figure 2.8: N-Version Programming [Avizienis85]. In common with RB, an intermediary can be used to perform the concurrent invocation. Logically, this is often grouped with the output combinator [Sommerville05]. Other examples embed both the invoker and combinator in the client [Looker05]. Agreement depends on the scheme used for selection.

First Past the Post

is where the rst response

is used and the rest discarded. This provides resilience to f = n − 1 failures in the timing domain but is generally considered a waste of resources.

Voting

is a selection technique

where the responses are examined and if a majority agree with a certain tolerance then the correct outputs are merged. There are several variants on the voting theme including formalized majority voter, generalized median voter, formalized plurality voter, and weighted averaging techniques [Torres-Pomales00]. Voting allows diversity to be realised with a resilience of f =

n−1 2 .

Another esoteric selection method

f n−1 -Variant programming

[Xu97] replaces voting with system level fault diagnosis. It is a statistical technique that assess the diagnosable measure

f n−1

based on a set of outputs. This measures is then used 40

FT Protocols

Chapter 2:


to nd at least one correct process that is subsequently nominated to produce the output. The complexity of

f n−1 -Variant

programming grows linearly with n.

NVP does not rely on synchrony for safety. Instead bounds are required to ensure liveness. These can be built into the selection component where the timer is triggered by the rst response for a given messages [Hall07]. When voting is used for agreement NVP can tolerate Byzantine failures. There are three caveats: Variants must be stateless between operations of a lack of total-ordering guarantees; the selection algorithm must be above reproach because it is a single point of failure; only

bona de

variants participate in

the vote, this replication can be fooled by spoof responses. In a Byzantine failure model

3f + 1 variants must participate to ensure 2f + 1 responses enter the vote that in turn ensures f +1 correct responses. NVP is the only known method of tolerating common-mode failures. Assuming an intermediary for both invocation and selection, NVP can deliver a correct response in ϕ = 2n + 2. This performance is an invariant as it does not grow with

f.

2.3.3

Combining Active and Passive Replication

Self Checking NVP

[Torres-Pomales00] applies acceptance tests to self-check each variant

before results are sent to the selection algorithm. Acceptance tests are of limited practical use because they can only check for extreme erroneous results.

Figure 2.9: Consensus Recovery Blocks [Torres-Pomales00]. Consensus Recovery Blocks

[Scott87] uses NVP in the rst instance to try and produce

a response. When the NVP fails the protocol reverts to a RB scheme as shown in Figure 41

FT Protocols

Chapter 2:


2.9. The authors of the protocol claim consensus recovery blocks is more reliable than either RB or NVP. It is true that it is more likely to produce an output than NVP alone but the usefulness of the acceptance tests is questionable and erroneous outputs are possible. It would complete in approximately ϕ = 2n + f + 2.

2.3.4

Paxos

Paxos is family of protocols developed by Lamport [Lamport01] based on the rithm [Lamport98].

Classic Paxos

synod

algo-

guarantees agreement, termination, and validity, con-

sensus properties. When Paxos instances are strung together they provide total-ordering as required for

state machine replication.

The protocol is shown in Algorithm 3.

The clarity of Paxos is achieved by separating processes into specic roles. A process may take one, some or all roles. A

proposer

advocates a clients request by assigning a

sequence number to it. The natural ordering of sequence numbers denes the sequence by which messages get delivered and, hence, produce a total order of those messages. Proposers are chosen by an Ω leader election to be

distinguished

so that livelock cannot

occur when proposers duel. Acceptors

provide the quorum by which proposals are advocated by broadcasting ac-

cepted messages to all processes. In a crash-stop failure model a quorum is dened as

f + 1 processes that advocate. During the Ω phase an acceptor promises not to advocate any sequence less than the epoch sent by the provisional distinguished proposer. It may of course advocate a higher sequence number at any time. Learners

receive accepted messages that advocate a proposal. When f + 1 are gained

the message can be decided. It is possible that proposals get decided out of order so semantics are required to order the messages before delivery. Once a message is delivered the response is sent back to the client. During a leader election the distinguished proposer receives the last accepted value that has not been delivered from each acceptor. It then must

choose

one of these values

if there is a majority. If not the proposer is free to advocate any incoming client request. 42

FT Protocols

Chapter 2:


Algorithm 3 Paxos [Lamport01, Lamport04, Turner07]. Classic Paxos (PAXOS) Fair Loss Links (FLL) Eventual Leader Election (Ω) Initialisation: pproposer = p1 proposali = startpoint(pself ) promisecurrent = −1 acceptors ⊆ Π learners ⊆ Π quorom =

|acceptors| 2

//For Proposer to generate next sequence

+1

Upon Ω.trust(pz ) //ALL pproposer = pz if pproposer = pself then ∀px ∈ acceptors : P L.send(prepare, px , proposali ) end if

Upon PAXOS.propose(m) //PROPOSER received[received.lenth() − 1] = m if pproposer = pself then ∀px ∈ acceptors : P L.send(prepare, px , proposali ) proposali = proposali + 1

end if

Upon FLL.deliver(prepare, i) //ACCEPTOR if i > promisecurrent then promisecurrent = i send(promise, pproposer , i, accepted[i]) else send(nak, pproposer , promisecurrent ) end if

FLL.

//OR ⊥ if no messages accepted

FLL.

Upon FLL.deliver(promise, i, m) //PROPOSER preaccepted[i] = preaccepted[i] ∪ {m} promisecount [i] = promisecount [i] + 1 if promisecount [i] ≥ quorum then mchosen = choose(preaccepted[i], received[received.length() − 1]) ∀px ∈ acceptors : F LL.send(accept!, px , i, mchosen )

end if

Upon FLL.deliver(nak, pi , promisecurrent ) //PROPOSER if pproposer = pself then proposali = promisecurrent + 1 ∀px ∈ acceptors : F LL.send((prepare, px , proposali ) end if

Upon FLL.deliver(accept!, i, m) //ACCEPTOR if i >= promisecurrent then

accepted[i] = m ∀px ∈ learners ∪ {pproposer } : F LL.send(accepted, px , i, m) end if

Upon FLL.deliver(accepted, i, m) //PROPOSER + LEARNER if pself ∈ learners then acceptedcount [i] = acceptedcount [i] + 1 if acceptedcount [i] ≥ quorum) then

PAXOS.decide(i, m)

end if end if

43

FT Protocols

Chapter 2:


The actual semantics of the choose function are left open-ended provided the choice is deterministic [Lamport01]. Paxos tolerates f =

n−1 2

crash failures with the fail-silent distributed system model.

The failure of a process that is not the distinguished proposer is simply tolerated provided

f + 1 are still correct. When the proposer crashes the underlying Ω election causes a new distinguished proposer to be elected, a reconguration. It is possible that proposers can duel leading to a livelock situation, a failed proposer may recover and start proposing while a next distinguished proposer has been elected [Turner07]. Classic Paxos consensus takes four-step ε = 4 and ϕ = n2 + 3n to complete, assuming that all processes are acceptors and learners. In worst case where f proposers fail sequentially it can take ϕ = n2 + n(2f + 3).

Figure 2.10: Multi-Paxos [Turner07]. Multi-Paxos

is a simple optimisation that strings classic Paxos instances together but

only requires a prepare phase when a proposer has crashed. It does not sacrice safety. Proposers in steady-state simply send proposals to the acceptors. Multi-Paxos is shown in Figure 2.10. Multi-Paxos decides a message in three-step ε = 3 and ϕ = n2 + n + 1.

44

FT Protocols

Chapter 2:


When proposers fail it performs the same as classic Paxos. Multi-Paxos has been used by Google in their locking distributed le system called

Chubby

[Chandra07]. From now on a

reference to Paxos is a reference to Multi-Paxos unless otherwise stated. There are many optimisations to Paxos in recent times that try to increase the resilience or reduce the number of steps taken to reach consensus. The simplest one is IP multicast to replace n message exchanges. Multicast in conjunction with Paxos is rst mentioned in [Lamport05]. Dynamic Paxos

[Lamport04] is where the SMR itself executes a proposer request to

recongure the processes {p1 ...pn } in view in a way similar to group membership. The proposer achieves this by proposing a sequence of i − α, where α is a constant with a message containing G the set of acceptors. The state after i − α contains the set of acceptors for command i to be correct. Cheap Paxos

than f =

n−1 2 .

[Lamport04] uses dynamic Paxos to allow resilience of f = n − 1, rather

It works by adding f auxiliary processes when a failure of any process is

detected. Up to 2f processes could fail-silent simultaneously and still a quorum can be reached. The quorum in question is for the i − α reconguration request to recalibrate

n removing all the failed processes. The clear advantage is that auxiliary processes are only used when failure occurs so the task could be performed by lesser hardware. The downside is the extra round required to recongure with cheap Paxos deciding in ϕ =

n2 (f + 1) + 2nf + 3n. Fast Paxos

[Lamport06] reduces the number of rounds ε to two in some cases. A Paxos

client is able to propose a request directly to a set of acceptors in a

fast round

saving

one round from client to proposer. Consensus cannot be achieved in less than two rounds [Keidar01]. Fast Paxos starts like Paxos but a distinguished proposer can issue an

any

command with the ith proposal to start a fast round. Acceptors can then expect requests directly from the client. Fast rounds only cause problems when clients issue requests in near concurrency causing collisions

and acceptors receiving messages out of order. This problem can be greatly

reduced by using multicast between client and acceptors [Lamport06]. A coordinated

45

FT Protocols

Chapter 2:


strategy has the distinguished proposer observing the same message mx in two dierent quorums for sequences i and i+1. It then issues a proposal for mx with sequence i+1 ending the fast round. An uncoordinated strategy has the acceptors forming i-quorums allowing them to see if a message was missed in the ith round. The problem with this strategy is that groups of messages may have been missed and their order cannot be determined. The solution is the proposer indicating which i − quorum should be used when it announces the fast round with an any(i). Fast Paxos can decide in ε = 2 and ϕ = n2 + n in a fast round. However, this falls back to Paxos in normal rounds for example when a collision is detected. Generalised Paxos

[Lamport05] promises consensus in two-step by creating partially

ordered sets of commands and batch execution.

Command-structure sets allow non-

conicting commutative commands. Conicting commands are detetermined by a xed commutativity table with

C-Struct sets

that dene a commutative order over sequences.

Near concurrent proposals are also ordered by the commutativity table. This approach to Paxos is used primarily with the shared memory abstraction [Turner07]. Generalised Paxos starts when a distinguished proposer is elected for the ith instance. It sends a startP hase2 to the acceptors who now, as in Fast Paxos, listen for requests directly from clients. When acceptors see a commutative command they learn it without a quorum of acceptances and keep it in the command set, they also broadcast an acceptance for all requests. Non-commutative commands are learnt with a quorum of acceptances. The command set is automatically ordered by the commutivity table semantics. When there are no gaps the command set is decided and delivered. This is the only Paxos protocol where ordering semantics need to be known a-priori. A commutative command can be learnt in one-step ε = 1 and ϕ = n but it must be accompanied with a ε = 2 and ϕ = n2 +n before the command set is decided and delivered.

46

FT Protocols

2.3.5

Chapter 2:


Paxos for Byzantine Agreement

Paxos cannot perform Byzantine agreement. To achieve it the following changes must be made. Firstly, there must be more than twice as many correct processes as faulty ones so n = 3f + 1 [Lamport82]. Secondly, there must be another round of exchanges as a verication stage. Upon receiving a proposal all acceptors broadcast a verify message. Once an acceptor has collected a 2f + 1 quorum of verications it is assured that a message has been endorsed, even if the proposer send corrupt messages to f acceptors. Now the protocol proceeds like Paxos, broadcasting accepted messages to be learnt. Safety is assured by the verication step. Byzantine Paxos

[Turner07] can be tricked by processes sending messages under the

guise of other processes, or messages being tampered with in transit. To prevent this problem messages must be signed and authenticated. Section 2.3.6 describes CLBFT, a protocol similar to Byzantine Paxos but with authenticated messages. Byzantine Paxos can decide a message in three-step ε = 3 and ϕ = 2n + 1 assuming multicast and without including the client request. Paxos at War

n+3f 2

[Zielinski04] can decide in an optimal two-step in an optimistic case when

acceptors accept the same proposal. It does not use digital signatures for failure free

runs and multicast for all communications. The basic operation is demonstrated in Figure 2.11. After a proposer has been elected proposals are broadcast to all acceptors. They weakly

accept it. If a quorum of

n+3f 2

accept the proposal the it is decided immediately.

If x acceptances are received be an acceptor such that 2f + 1 ≤ x
3f and only liveness is aected if n < 5f + 1. The protocol requires that acceptors endorse a proposal with a digitally signed accepted message. It allows acceptors collect a commit proof of

|acceptors|+f +1 . 2

Where the acceptor

and learner are co-located the commit proof is used to decide the command in two-step. If not the acceptors transmit an accepted round to the learners of course this means the protocol decides in three-step. Learners can only decide if they are in possession of a commit proof.

49

FT Protocols

2.3.6

Chapter 2:


CLBFT

CLBFT [Castro99, Castro01, Castro02] is a FT protocol that achieves SMR and Byzantine agreement in three-step with a resilience of f =

n−1 3 .

In normal operation CLBFT is

similar to Byzantine Paxos [Li07] but has a stricter policy on message authentication. It exclusively uses multicast. CLBFT provides state-recovery, garbage collection and viewstamped replication. There are real implementations of this protocol for web services [Merideth05, Zhao07]. CLBFT introduces it's own terminology. A poser or leader. A Authentication

backup

primary

is the equivalent of a Paxos pro-

is a replica process that is not the primary.

is crucial to CLBFT. Unlike other Byzantine protocols [Li07, Martin06,

Turner07, Zielinski04] all messages are authenticated including those from clients. Two fair assumptions are made about CLBFT cryptography. Adversaries are computationally bound and cannot subvert the cryptographic signatures applied to messages. Messages may be cryptographically hashed making it impossible to hide tampering. An early version CLBFT-PK [Castro99] uses DSA 128-bit asymmetric key encryption. This is extremely expensive in terms of computational complexity [Castro01, Chandra07]. Message Authentication Codes

or MACs are used instead of DSA for normal-operation

in CLBFT and can be computed three orders of magnitude faster [Castro99]. MACs are generated o a message and a key shared between two parties. They work like this. Two parties A and B share a secret key at the beginning of a session using asymmetrical cryptography. Party A computes µa = mac(m + ka ) for message m with key ka . A sends

m with µa to party B. B receives m with µa . B computes µb = digest(m + kb ) with key kb . B veries that µa = µb if it does ka = kb = k . From this B can deduce if the message was sent by A and if it has been tampered with. Digest makes discovering the secret key in transit impossible. Unfortunately, MACs are not as powerful as digital signatures [Castro01]. They cannot be used to convince a third party without sharing the secret key. CLBFT cannot use authentication proof in normal operation unlike CLBFT-PK. When multicasting messages 50

FT Protocols

Chapter 2:


vectors of MACs must be computed by the sender, one for each destination. These vectors are called

authenticators.

A solitary MAC used for a unicast between pi and pj and

appended to m is designated mµpi pj . An authenticator used in a multicast from pi and appended to m0 is designated m0αpi . Viewstamped Replication

[Oki88, Liskov07] is one of the key abstractions of CLBFT.

Throughout it's life it moves through a set of congurations called views. Backups observe sequence numbers to watch for irregularities in order. They also have a ∆ to ensure that the primary is making progress. Backups indicate that a view-change is needed by broadcasting a digitally signed message. When a new primary has collected a weakcerticate of f + 1 view-changes it starts the new view. Viewstamped replication ensures ordering by combining a view designator v with a logical clock timestamp ts. When a view change occurs gaps in sequence numbers are lled with special no-op commands. In any view the process chosen as primary is pi where i = v modulus n. Logging

to persistent state is used in CLBFT. But before a message can be logged the

following must apply. The message must be authenticated with a MAC or digital signature. It must have a view number assigned that is the current view. Finally, it must be assigned a sequence number sn that is h < sn < H where h and H is the low and high water marks respectively. A

garbage collector

is used to prevent message logs growing without bound. It cannot

remove entries relating to executed commands as some of those entries, for example the

preparedcertif icate may be needed for safety after a view-change has taken place. Instead the checkpointing routine shown in Algorithm 4 is used. Every K th time a command is executed by the SMR the replica multicasts and logs

(checkpoint, sn, digest(SM R.getState()), pi )αpi . The state is application specic and responsibility is delegated to the SMR itself. Replicas log all authentic checkpoint messages from other replicas. Once a replica collects a quorum 2f + 1 of checkpoint messages for the same sequence it gains a stable certicate for sequence number sn. This allows a replica to advance h, the low water mark, to sn and remove all entries that have a sequence number

51

FT Protocols

Chapter 2:


Algorithm 4 Checkpointing Garbage Collector in CLBFT [Castro02]. Castro-Liskov BFT (CLBFT) Multicast (M) Authenticator (A) State Machine Replication (SMR) Initialise: K = 128

Upon CLBFT.decide(v, sn, m, client, timestamp) if sn mod K = 0 then d = digest(SM R.getState()) log = log ∪ {(checkpoint, sn, d), pself )} M.broadcast((CHECKP OIN T, sn, d), pself )αpself )

end if

Upon M.deliver(checkpoint, sn, d, pi )αpi if Authenticator.accept(αpi ) ∧ ¬(∃(checkpoint, sn, d, pi ) ∈ log) then log = log ∪ {(checkpoint, sn, d, pi )} if ∃C, C ⊆ log : |C| ≥ 2f + 1 ∧ (∀(checkpoint, sn0 , d0 , p?) ∈ C : sn0 = sn ∧ d0 = d) then log = log ∪ {(stablecheckpoint, sn, d)} //stablecertif icate for all entry ∈ log : entry.sn ≤ sn) do if (entry.type = checkpoint ∧ entry.sn < sn) ∨ entry.type 6= checkpoint then log = log \ {entry}

end if end for h = sn

end if end if

less than sn. Variations of this checkpointing technique are presented by [Castro01].

Normal Operation Figure 2.12 and more formally in Algorithm 5 the normal-case operation is shown. Clients multicast a (request, command, timestamp, client)αi to all replica processes which log them it if they are authentic. A client will retransmit a request if a response has been received in bound ∆. The client awaits f + 1 (reply, v, timestamp, client, sn, result)αi responses from the state machine. Having f + 1 identical response guarantees on response is correct and by inference all identical ones are also. Multiple clients can work in tandem to prevent faulty clients being a point of failure.

52

FT Protocols

Chapter 2:


Figure 2.12: CLBFT Normal Operation [Liskov07]. Request.

When a primary process receives a message (request, m, timestamp, client)αi

the message is authenticated and the timestamp checked. It then multicasts (preprepare, v, sn, digest(m)) where sn in the next sequence number. Digests are sent to reduce communication complexity. The primary also logs the Pre-prepare.

pre-prepare

message.

When a replica receives an authentic pre-prepare message that message is

logged. The replica now multicasts (prepare, v, sn, digest(m), pi ) to all replicas and logs the pre-prepare and prepare messages. Prepare.

When a replica receives an authentic prepare message that message is logged.

If a replica collects 2f prepare messages for the same v , sn and digest from dierent replicas then it has gained a prepared certicate. It can now multicast a (commit, v, sn, pi ). A prepared certicate guarantees the safety of a quorum agreeing to assign v, sn to m. Commit.

Replicas may collect prepared certicates for the same sn but in dierent

views and for with dierent requests. The commit phase solves this possible violation of ordering by requiring a commited certicate. When a replica receives an authentic commit message that message is logged. When 2f + 1 commit messages are collected for the same

v, sn and m from dierent replicas a commited certicate is gain. The consensus decides m and it is executed by the SMR. The SMR executes a command and sends (reply, v, timestamp, client, sn, result)αi to the client. The client is obtained from the corresponding request entry in the log. In 53

FT Protocols

Chapter 2:


Algorithm 5 Normal Operation of CLBFT (without recovery) [Castro02]. Castro-Liskov BFT (CLBFT) Multicast (M) Authenticator (A) Initialise: viewcurrent = 1 h=1 H = h + |log size| sngenerator = 1

Upon M.deliver(request, m, ts, c)αpi if

Authenticator.accept(αpi ) then

log = log ∪ {(request, m, ts, c)} if isP rimary?(pself , viewcurrent ) then broadcast(preprepare, vcurrent , sngenerator , digest(m))αpself ) sngenerator = sngenerator + 1 log = log ∪ {(preprepare, vcurrent , sngenerator , digest(m))} end if end if

M.

Upon M.deliver(preprepare, v, sn, d, psender )αpi if ¬(isP rimary?(pself , vcurrent ) ∧ Authenticator.accept(αpi ) ∧ (h < sn ≤ H) ∧ ¬(∃d0 : d 6= d0 ∧ (preprepare, v, sn, d0 , ?) ∈ log) then if ∃m : (request, m, ts, c) ∈ log ∧ digest(m) = d then M.broadcast(prepare, vcurrent , sn, d, pself )αpself )

log = log ∪ {(preprepare, v, sn, d, psender )} log = log ∪ {(prepare, v, sn, d, pself )}

end if end if

Upon M.deliver(prepare, v, sn, d, psender )αpi if


log = log ∪ {(prepare, v, sn, d, psender )} if ∃p, p ⊆ log : (∀(prepare, v 0 , sn0 , d0 , ?) ∈ p : v = v 0 ∧ sn = sn0 ∧ d = d0 ) ∧ |p| ≥ 2f then broadcast(commit, v, sn, pself )αpself ) log = log ∪ {(commit, v, sn, pself )} end if end if

M.

Upon M.deliver(commit, v, sn, psender )αpi if


log = log ∪ {(commit, v, sn, psender )} if ∃c, c ⊆ log : (∀(commit, v 0 , sn0 , ?) ∈ c : v = v 0 ∧ sn = sn0 ) ∧ |c| ≥ 2f + 1 then r = extractRequest(v, sn, log)

CLBFT.decide(v, sn, r.message, r.client, r.timestamp)

end if end if

normal operations all 3f + 1 replicas execute a request in their local state machine but the rst f + 1 identical responses are accepted by the client.

Recovery CLBFT recovery is shown in Figure 2.13. When a replica notices gaps in sequence numbers or timeouts before committing a request it suspects the primary of being faulty.

54

FT Protocols

Chapter 2:


Figure 2.13: CLBFT Recovery Protocol [Castro02].

Algorithm 6 Calculating a View-Change in CLBFT [Castro02]. Upon T imeout∆ ∨ (sn 6= snprevious + 1) P =∅ Q=∅ C = {(sn0 , d0 ) | ∀(checkpoint, sn0 , d0 , p0 ) ∈ log}

for all sn ∈ h < sn ≤ (h + |log size|) do if ∃(prepared, v, sn, d, p) ∈ X, X ⊆ log : |X| ≥ 2f + 1 ∧ (∀(prepared, v 0 , sn0 , d0 , p0 ) ∈ X : v = v 0 ∧ sn = sn0 ∧ d = d0 ) then if ∃(v 0 , sn, d0 ∈ P ) then P = P \ {(v 0 , sn, d0 )}

end if

P = P ∪ {(v, sn, d)}

end if if ∃(preprepare, v, sn, d) ∈ log then if ∃(v 0 , sn, d0 ∈ Q) then Q = Q \ {(v 0 , sn, d0 )}

end if

Q = Q ∪ {(v, sn, d)}

end if end for

M.broadcast(viewchange, v + 1, h, C, P, Q, pself )αpself for all entry ∈ log : entry.type ∈ {preprepare, prepare, commit} do log = log \ {entry}

end for

It computes a view change message (viewchange, v+1, h, C, P, Q, pi )αpi as shown in Figure 6. The index h is the low water mark, C is a set of (sn, dcheckpoint ) elements representing logged checkpoints. The set P is (v, sn, d) elements representing all requests that have gained a prepared certicate and, nally, Q a set of request for which a pre-prepared messages has been received. Once these sets are calculated all pre-prepare, prepare and commit messages are removed from the log. 55

FT Protocols

Chapter 2:


Replica processes collect authentic and dierent view-change messages for new view

v + 1 accepting them if the tuples in P and Q contain view numbers ≤ v . They acknowledge by sending a (viewchangeack, v + 1, pi , psender , digest(m))µpi message, for

m = (viewchange, v + 1, h, C, P, Q, pi ), to the primary for v + 1. The acknowledgements allows the new primary to check the authenticity of view-change messages. A new primary collects view-changes and acknowledgements from other replicas. Once

2f − 1 acknowledgements pertaining to psender have been gained then the view-change generated by psender is added to a set S . The digests in the acknowledgement are used to authenticate the acknowledgement against the view-change from psender . A view-change in S with 2f − 1 acknowledgements including one generated by the new primary forms a view-change certicate.

Algorithm 7 Primary Decision for CLBFT View-Change [Castro02]. Upon view-change message added to S

1:

Cviewchange = {(sn, dcheckpoint )|∃X, Y ⊆ S : |X| ≥ 2f + 1 ∧ |Y | ≥ f + 1 ∧ (∀m ∈ X : m.h ≤ sn) ∧ (∀m0 ∈ Y : (sn, dcheckpoint ) ∈ m0 .C)}

2: 3: if ∃(h, dcheckpoint ) ∈ Cviewchange : (∀(sn0 , d0checkpoint ) ∈ Cviewchange : sn0 ≤ h) then 4: R = ∅ 5: for all sn ∈ h < sn ≤ h + |log size| do 6: if ∃m ∈ S : (v, sn, d) ∈ m.P ∧ (∀m0 ∈ M, M ⊆ S : |M | ≥ 2f + 1 ∧ m0 .h < sn ∧ ∀(v 0 , sn, d0 ) ∈ m0 .P

: v0 < v ∨ (v 0 = v ∧ d0 = d)) ∧ (∀m00 ∈ N, N ⊆ S : |N | ≥ f + 1 ∧ ∃(v 00 , sn, d00 ) ∈ m00 .Q : v 00 = v ∧ d00 = d) then R = R ∪ {(request, m : digest(m) = d, ts, c)} // d FROM LINE 5

7: 8: else 9: if ∃m ∈ P, P ⊆ S : |P | ≥ 2f + 1 ∧ m.h < sn ∧ (v?, sn, d?) ∈/ m.P then 10: R = R ∪ {(request, no − op)} 11: end if 12: end if 13: end for 14: end if 15: if |R| ≥ |log size| then 16: V = ∅ 17: for all entry ∈ S do 18: V = V ∪ {(entry.psender , entry)} 19: end for 20: X = Cviewchange ∪ R 21: M.broadcast(viewchange, v + 1, V, X)αpself 22: return 23: end if

Every time a new view-change is added to S the decision dened in Algorithm 7 is 56

FT Protocols

Chapter 2:


made. The starting point is to form a set Cviewchange of (sn, dcheckpoint ) from view-change messages in S where a quorum certicate (2f + 1) is granted for view-change messages with a low water mark h such h ≤ sn. Also a weak certicate (f + 1) is granted for viewchange messages that contain (sn, dcheckpoint ), it ensures Cviewchange only contains correct checkpoints. From Cviewchange the highest sequence number possible is assigned to h. This corresponds to lines 1,3 of Algorithm 7. The decision iterates for an possible sequence numbers starting at the highest correct checkpoint number h up to h + |log size|, line 4. Note how this corresponds to the low and high water marks used by the garbage collector. If a request m was committed in the previous view v then that request must be selected, line 5. The synopsis of line 5 is as follows: Ensures that the primary selects a client request that some replica in the quorum claims to have prepared in view v [Castro02]. Alternatively, for a given sn there is a quorum (2f + 1) of replicas that did not prepare any requests, line 8. In this case a special request,

no-op

is selected which performs a NULL operation in the SMR.

The decision algorithm terminates for primary v + 1 when a request has been selected for every sn such that h < sn < h + |log size|. This may take for |S| > n − f but a primary is always able to complete once it receives all view-change messages sent by correct replicas. A primary may now multicast (newview, v + 1, V, X) where V is all the view-change messages in S but indexed by process identier. X is the union of the checkpoints Cviewchange and the requests R selected by the primary. All the view-changes in V are called the

new-view certicate.

The primary creates a pre-prepare message in

the log for very request message in X numbering them from h onwards for view v + 1. If a primary does not have a record of every request message in X it initiates a fetch checkpoint protocol. For brevity we have omitted this protocol but it is presented at length in [Castro02]. We assume that the primary has all these requests. It now logs all the request in X as pre-prepared for view v + 1 and h < sn ≤ H . When a backup collects view-change and new-view messages for v + 1 until it has matching view-change messages for every entry in V . If a backup is missing a view-change message in it's log for a given process pi in V then it can use collected f + 1 view-change57

FT Protocols

Chapter 2:


acks, pertaining to pi , to form a weakcertif icate to vouch for the view-change message. If there are not enough view-change-acks the backup can request the primary for v + 1 to multicast the view-change for psender once again. All backups must acknowledge this with acknowledgement messages. A backup may bypass the authentication for the viewchange message it received from the primary since the signature is for psender . It logs the view-change message from psender , sent via the primary, and continues. Finally, the backups must validate the decisions made by primary for v + 1 to do this they re-run the decision algorithm (lines 1-14 of Algorithm 7) to validate V, X against the view-change they have received S 0 . If the validation is successful the backup logs a preprepare message for each request m in X in common with the primary and also multicast a prepare message for m in v + 1. Normal operation is then resumed. If the validation on V, X fails then the replica multicasts (viewchange, v + 2, h, C, P, Q) and the recovery protocol starts again!

Performance CLBFT can deliver messages in ε = 3 and ϕ = 2n with multicast. The recovery protocol takes ε = 3 and ϕ = 2(n − f ). In both cases there is a high computational complexity associated with computing the MACs for every exchange. Additionally, the recovery protocol is extremely complex with a number of nested loops this only serves to increase the time taken to decide a message. The authors realising the overheads of the CLBFT protocol suggest the following optimisations [Castro02]. A client may specify a distinguished replica that sends a full response to the client whilst the rest send

digest replies.

If the distinguished replica is faulty the re-

quest is retransmitted without a distinguished replica specied.

Tentative execution

can be

used to decide requests in two-step. A replica may tentatively decide a message after gaining a prepare certicate only resulting in a (replytenative, v, timestamp, client, sn, result)αi response to the client. A client must receive 2f + 1 identical tentative responses otherwise it retransmits. A replica state-machine that has executed a tentative message must be

58

FT Protocols

Chapter 2:


rolled back to last checkpointed state when a view-change occurs. Finally, if an operation is

read-only

then it can be executed by the state-machine once the request has been

authenticated. A client must wait for 2f + 1 identical responses.

BASE The authors of CLBFT have created a C++ library implementation BFT, it is used by Thema [Merideth05]. It has been adapted into

BFT with Abstract Specication

or BASE

[Rodrigues01] a library for opportunistic NVP. It allows replicas state machines to be implemented with components o the shelf or COTS. Integrating COTS is dicult because of the non-deterministic dierences between each other. Of course determinism is crucial to SMR. BASE requires a common specication of all variant implementations and a set conformance wrappers. A similar approach is used by [Sommerville05] but with stateless variants. BASE like NVP requires 2f + 1 responses to produce a meaningful vote of f + 1. However, SMR based protocols only guarantee to produce f +1 identical responses, similar responses in the case of BASE. An optimistic approach is for the selection algorithm to expect 2f + 1 responses but timeout if they are not gained and use x responses instead where f + 1 ≤ x < 2f + 1. The pessimistic alternative is to perform selection on just f + 1.

2.3.7

FT Protocols Summary

The FT protocols surveyed are presented in Table 2.3. From this it can be seen that there is a clear division between the simple protocols providing passive and active replication and the more complex protocols based on atomic broadcast. This is most clearly demonstrated by the performance expressions, initially the expressions are simple to quantify but the later protocols are complex.

59

As Paxos As Paxos

Crash

Crash

Crash

Byzantine

Fast-Paxos [Lamport06]

Generalized-Paxos [Lamport05] Byzantine-Paxos [Turner07]

60

Byzantine

Byzantine

FaB-Paxos [Martin06]

CLBFT [Castro02]

As Paxos

As Paxos

As Paxos

As Paxos Multicast FLL Ω Election Byzantine-Agmt SMR FLL View-Change Byzantine-Agmt SMR FLL View-Change Byzantine-Agmt SMR Multicast View Stamped Repl. Authentication Byzantine-Agmt SMR

As Paxos

FLL Ω Election Consensus SMR As Paxos

FLL

FLL

Fair Loss Links

Abstractions

Table 2.3: FT Protocols Summary.

f = n−1 3

f f = n−2t−1 3

f = n−1 5

c Opt: f = 2a−n 3 Pes: f = n−1 3

f = n−1 3

f = n−1 2

f = n−1 2

f = n − ax − 1a

f = n−1 2

f = n−1 2

f = n−1 2

f =n−1

Resilience

ε=3 ϕ = 2n

ε=2 d ϕ=n+1 ε=3 e ϕ = 2n + 1 ε=2 ϕ = 15f 2 +13f +2

ε=3 ϕ = 2n

ε = 3(f + 1) ϕ = 2(n(f + 1) − f 2 )

ε = 5f + 2 ϕ = 30f 2 + 16f + 2

ε = 4f + 3 ϕ = (f + 1)(2n + 1)n

ε = 4(f + 1) ϕ = n2 + n(2f + 3)n ε = 3f + 1 ϕ = 2(n(f + 1) − f 2 )

ε=1 ϕ=n

ε=2 ϕ = n2 + n

ε = 4(f + 1) ϕ = n2 (f + 1) + 2nf + 3n ε=f +4 ϕ = n2 + (f + 3)n ε=3 b + 1) ϕ = (n − 1)( 2n 3

ε=f +4 ϕ = n2 + n(f + 3)

ε=f ϕ = 2n + f + 1

ε=1 ϕ = 2n

ε=f ϕ = 2f + 2

Performance Worst

ε=2 ϕ = n2 + n

ε=4 ϕ = n2 + 3n

ε=1 ϕ = 2n

ε=1 ϕ = 2n

ε=1 ϕ=2

Performance Best

BASE g

8

8

8

8

8

8

8

4

4

Limited

Diversity

Chapter 2:

a With ax auxilary processes. b When requests conict. c Where a is the number of acceptors d Optimistic execution n ≥ a. e Pessimistic execution n < a. f Parametrised FaB Paxos where t is the number of failures allow two-step execution. g BFT with Abstract SpEcication. Using opportunistic N-Version Programming.

Byzantine

Paxos-at-War [Zielinski04]

As Paxos

Agreement Termination Validity Total-Order As Paxos

As NVP

Cheap-Paxos [Lamport04]

(Multi)Paxos [Lamport01]

Consensus-RB [Scott87]

Crash Byzantine Common-mode Crash Byzantine Common-mode Crash

NVP [Avizienis85]

Integrity Validity Termination As RB Agreement

Correctness Properties

Crash

Failure Model

RB [Randell75]

Name

FT Protocols Reliable Distributed Computing

Summary

2.4

Chapter 2:


Summary

This chapter has surveyed the fault tolerance abstractions and protocols in the distributed computing domain. It started by presenting the concepts as a set of challenges such as the FLP result and Byzantine failures. We looked at the failure, timing and process models that form the distributed system model. Specic problems were addressed in the form of abstractions, we discussed failure detectors, broadcast including the problem of total order. We discussed consensus and in particular Byzantine agreement. This section also describes higher level abstractions of state machine replication and group membership. In the third part of this chapter we have incrementally surveyed replication protocols that provide FT. These start from the humble Recovery Block to the extremely complex CLBFT.

61

Chapter 3

Service Oriented Computing This chapter surveys the aspects of Service Oriented Computing (SoC) that impact on FT. SoC [Papazoglou03] extends SoA also addressing overarching concerns such as coordination, security and SLAs. It is represented by pyramid in Figure 3.1. We assume the reader is familiar with the concept of Web services and the principles/technologies/specications that support them including XML, SOAP and WSDL. The early part of this chapter is a review of service oriented architectures. We focus on messaging patterns and service discovery as these are the most pertinent to our research. The later part of this chapter is an exposition and survey on Peer to Peer (P2P) a paradigm closely related to SoC but used to provide FT by decentralisation.

3.1

Service Oriented Architecture

Service oriented architecture

(SoA) is a paradigm for organizing and utilizing distributed

capabilities that may be under the control of dierent ownership domains. It provides a uniform means to oer, discover, interact with and use capabilities to produce desired eects consistent with measurable preconditions and expectations [MacKenzie06]. SoA instances are generally domain specic but they derive from the general paradigm shown in Figure 3.2.

62

Service Oriented Architecture

Chapter 3:

Service Oriented Computing

Figure 3.1: SOC the extended SoA [Papazoglou03].

Figure 3.2: Service Oriented Architecture. A

provider

hosts a service. Service descriptions in WSDL are published to

providers whom also accept binding from

consumers.

63

brokers

by

A broker accepts descriptions and

Service Messaging

Chapter 3:


publishes them, they also allow discovery requests from consumers. Consumers, or clients, discover services from a broker and then bind to those services with the provider. SoA raises three important questions. How is information exchanged with other institutions? How are those other institutions and the services they oer discovered? Finally, what are the other institutions? These questions are discussed in the following sections.

3.2

Service Messaging

A review follows covering the dierent messaging schemes available with SoA. Messaging is the protocol used for exchanging information between institutions. When agreement is reached on the protocol a service is said to be

bound.

It is more than simply sending a

message down the wire. Message exchanges are inuenced by the underlying transport patterns.

3.2.1

REST

REpresentational State Transfer

or REST was introduced in [Fielding00]. It is the em-

bodiment of synchronous messaging relying on a retrospective abstraction of the principles that make the World Wide Web scalable [Muehlen05]. The scheme is exclusively tied to the ubiquitous HTTP protocol. REST does not require any message format such as SOAP because this is left to the implementing application. Resources are the central concept. Represented as URIs, they are by accessed by sending a command, the

verb,

to a resource

URI. A command derives from HTTP where command ∈ {get, put, post, delete}. The client

gets

the resource as a document, typically XML, which it does not understand but

the application does as shown in Figure 3.3. To eect a change a client

posts

an amended

document to the server. REST brings useful properties to a SoA. It is highly performing because of the limited amount of information processing that is required by HTTP. It is provably scalable because it is based on the WWW. It can be tunnelled for security and web cached for read-only performance. REST is limited when considering integration with FT. Firstly, it requires 64

Service Messaging

Chapter 3:


Figure 3.3: REST [Fielding00]. synchronous interactions because of it's HTTP heritage thus there is an imposed synchrony between client and service (client blocks execution tills a response or timeout is received). Secondly, because of the lack of a state container within the protocol itself meta data cannot be transferred between service. Meta data transfers are needed for FT protocols to establish causality or sequence. Lastly, REST is tied closely to the application using it meaning FT infrastructure would need to reside in the application itself.

3.2.2

MOM

Message Oriented Middleware

(MOM) is a paradigm that allows asynchronous, document-

oriented interactions [Brambilla04]. Based on the

broker pattern

(an intermediary) it

models another great paradigm of the Internet, electronic mail. MOM is an important renement on the SoA shown in Figure 3.2 putting the service broker in-line between the consumer and provider as part of the bind process. Possible MOM interaction patterns are shown in Figure 3.4. Polling consists of two synchronous exchanges, the rst to obtains a

ticket

65

and the second obtains the response with

Service Messaging

Chapter 3:


the ticket. Callback is a true asynchronous one-way exchange where both parties provide services. It is the basis for peers in the

Peer-to-Peer

as discussed in Section 3.5. Callback

is more ecient than polling as it halves the number of messages exchanged. Subscribe

is a generalisation of callback where a node can receive several

Publish-

notications

per

subscription.

Figure 3.4: MOM Exchange Patterns [Brambilla04]. MOM oers several benets to those requiring a FT SoA. A broker may ensure correct message delivery oering reliability and exibility [Banavar99, Brambilla04]. Unlike REST where a client has to be aware of the location of a service, a MOM client only needs to know the location of a broker. Initially, the broker does not know the destination of a received message but has a message ow graph topology [Banavar99] that will deliver a message, adapting to failures and load. The broker may transform the message as necessary for imprecisely matching services. MOM may route messages to queues based on QoS matching [Erradi05]. All interactions in callback MOM are based on asynchronous exchanges therefore removing arbitrary timing assumptions from a FT perspective. MOM explicitly requires the message queue as the nal destination for a brokered message. Though this may be to stable storage (thus making messages survivable) a message queue is inherently a singleton component. To overcome this problem a message may be delivered to multiple destinations. When compared to REST, MOM intermediaries perform more processing on a routed messaging aecting the performance of the system. The standard Web Services stack does not directly refer to MOM as a means of messaging. However, SOAP [Mitra07] and WS-Addressing [Gudgin06] both have provision for MOM 66

Service Messaging

Chapter 3:


in their specication. Additionally, WS-Reliable Messaging [Davis07] supports MOM-like asynchronous messaging. Java Messaging Service

(JMS) [Hapner02] is a concrete but vendor independent imple-

mentation of MOM. It supports a common set of interfaces that allows developers to use the same API to access dierent MOM systems. It supports publish/subscribe and pointto-point queuing. It can support FT broadcast primitives through a virtual channel called a topic. JMS has been used in real-world systems such as the New York Stock Exchange to implement a service-oriented architecture for their trading systems [Wang03]. However, it is Java specic so does not t into the web service stack. The key advantage of JMS is the predened adaptors for a range of other MOM and other service related paradigms such as Narada Brokering and JXTA. Other MOMs such as Microsoft's MSMQ are even more proprietary being limited to programs operating on the window platform. There are open standard examples for messaging.

Extensible Messaging and Presence Protocol

or

XMPP is used for instant messaging and real-time chat applications. It does not have wide acceptance because it must compete with the proprietary OSCAR, TOC and MSNP protocols. Advanced Messaging Queue Protocol

or AMQP [Dattatri06] is an attempt to standard-

ise MOM protocols. It consists of a model of the messaging capabilities, a set of abstract bindings to storage and routing components. AMQP also has a wire-protocol that describes how message producers interact with the model. The authors of AMQP counter intuitively use the term client for the consumer. Routing by AMQP is based on criteria in the routing key and information in message headers. AMQP supports queue callback, publish-subscribe and store-forward where consumers are chosen in a round robin fashion. The AMQP specication is conceived to support the semantics and performance required by the nancial service industries [Dattatri06]. Apache Qpid is the open-source implementation of AMQP that has bindings to many common programming environments including JMS but not Web Services directly. Web Service Bus

(WSBUS) [Erradi05] is MOM framework designed to operate speci-

cally with web services SoA, it is shown in Figure 3.5. It evolved from an earlier framework 67

Service Messaging

Chapter 3:


Figure 3.5: WSBUS [Erradi05]. WSMQ [Maheshwari04]. It specically uses the SOAP as wire protocol and has ports to support JMS and MSMQ MOMs. Messages get added to correct queues by the information held in the routing key (within the SOAP header). Uniquely for a MOM, WSBUS can use external QoS data such as monitoring to inform queue (and hence service selection). This is the rst time we see informed service matching an important aspect of integration between FT and SoA. Additionally, WS-BUS supports admission security to prevent erroneous consumers and clients. Narada Brokering

[Fox02, Pallickara03] is a network of MOM brokers, providing an

asynchronous publish-subscribe system, that overlay a "P2P like" network. The broker organisation protocol manages the addition of new brokers and oversees connections to them. This organisation allows the creation of

broker network maps

(BNMs). BNMs are

based on DNS domains and create a hierarchy of brokers. The structure of the hierarchy generated by BNMs prevents message ooding by including IP discriminators, geographical location, cluster size and concurrent connection thresholds. Messages are wrapped within event containers that include Narada specic headers. These headers help brokers route messages to their eventual consumer queue. Unlike other MOM, Narada Brokering is focused on scalability. Message routing typically takes O(log n) thanks to the small world nature and network graph optimisation of BNMs. Every broker maintains a snapshot of 68

Service Messaging

Chapter 3:


the overlay network, making ecient routing decisions thus transferring messages with the shortest number of hops. Another advantage of network optimisation is the pre-emptive application of FT where single points of failure are avoided. Narada Brokering supports conduits to both JMS and the JXTA P2P standard.

3.2.3

SOAP with WS-Addressing

We assume the reader is familiar with SOAP as an XML-based wire protocol for web services [Mitra07]. When SOAP is bound to HTTP through the protocol binding framework it is typically restricted to simple request-response synchronous exchanges as seen in REST. However, the SOAP specication [Gudgin07a] includes a processing model that supports an asynchronous multi-hop messaging environment as seen in MOM. Nodes participating in SOAP messaging routing may take a role as designated by URI such as http://www.w3.org/2003/05/SOAP-envelope/role/ultimateReceiver. Another processing action that a node may implement is the

mustUnderstand

attribute. This dictates

that a payload can only be delivered if the node understands an attached block to the attribute. This block may be a digital signature for authentication, for example. Though SOAP is more often than not bound to HTTP the specication provides a framework for binding to other protocols such as SMTP or TCP/IP. The problem with SOAP is the poor performance associated with XML processing for little technological gain over a REST based implementation. Though SOAP is fully capable of providing any Message Exchange Pattern it is tied by implementations to limited RPC programming environments such as .NET. This criticism is being address through the SOAP extensibility mechanism that allows new schemes such as WS-Addressing [Gudgin06] to allow SOAP to operate as extensible MOM. WS-Addressing [Gudgin06] provides a container within the SOAP header for application or middleware routing information, as shown in Figure 3.6. It uses URIs for identication and

endpoint references

(EPRs). An EPR allows meta-data to be associated

with a URI. For example there may be a clause telling a node to process the message at

69

Service Messaging

Chapter 3:


http://example.com/6B29FC40-CA47-1067 http://example.com/business/client1 http://example.com/fabrikam/Purchasing http://example.com/fabrikam/SubmitPO http://example.com/6B29FC40-CA47-1066 ...

Figure 3.6: SOAP Message with WS-Addressing [Gudgin06]. a certain time in the future. WS-Addressing allows arbitrary clauses for any application through the XML extensibility mechanism. SOAP with WS-Addressing currently lacks an implementation such as JMS but has the advantage of being more extensible and not tied to message queues. The implementation is free to provide optimisations, such as those seen in Narada Brokering such as scalability and FT.

3.2.4

WS-Reliable Messaging

WS-Reliable Messaging

(WS-RM) [Davis07] is a standard from the OASIS consortium.

It is designed to work in conjunction with SOAP and WS-Addressing to provide the FT abstractions of stubborn and perfect links in addition to FIFO order guarantees. The protocol provides delivery assurances for end to end (potentially multi-hop) unicasts. The standard works by the sender indexing outgoing messages whilst requiring the receiver send acknowledgements to all messages. This is clearly demonstrated in Figure 3.7 where the FIFO order of messages is maintained despite message 2 being initially lost. The receiver simply indicates that it has only received messages 1 and 3 causing the sender to retransmit message 2. Using these simple mechanisms the following guarantees can be made, once, at most once, exactly once

or

at least

order they were sent.

WS-Addressing (and therefore SOAP) is used as the container for sequence headers. 70

Service Messaging

Chapter 3:


Figure 3.7: WS-RM [Davis07]. Sequences must be started and terminated, they are identied by an unique URI and make the context of a set of FIFO indexed messages. To achieve "at least once" delivery a sender retransmits a message with index i until i is acknowledged. "At most once" delivery extends this by ensuring the receiver only delivers the message once. WS-RM is unique amongst our messaging schemes in that it extends a basic SoA to provide FT by guaranteeing message delivery. It demonstrates one of the goals of SOAP extensibility (and WS-Addressing), to provide further delivery semantics than simply down the wire. It sits neatly in the web service stack. There are drawbacks to WS-RM. Firstly, it does not support any broadcast primitives required for state machine replication for example. There are no reference implementations of WS-RM therefore it is left to the service developer to provide such as scheme. WS-RM provides poor performance because of the overhead of starting/terminating sequences and 71

Service Messaging

Chapter 3:


the acknowledgements. This scheme is inecient for short sequences. Additionally, it would provide little advantage when used with FT protocols such as Paxos as they work by majority and, therefore, do not require perfect link abstractions.

3.2.5

Discussion

We present the review of messaging schemes in table 3.1. The advantages/disadvantages are in relation to using the scheme with FT and SoC. To support our objective to adapt a SoA to support FT with SoC we need to cherry pick the following advantages: The asynchronous messaging from MOM (FT process model); QoS based selection from WSBUS; lack of message queues from SOAP/WS-Addressing. Counter-intuitively, we do not need FT policy enforcement (as provided by WS-RM) since no assumptions about FT should be made to enable support for the widest range of FT protocols. This supports our objective of increasing the range of FT protocols available to SoC. Finally, we need the performance, scalability and distribution of Narada brokering to support our objective of addressing the lack of decentralisation in current frameworks. Scheme REST

Messaging Pattern

Adapts SoA?

MOM

Synchronous (RPC). Asynchronous, Polling, Callback and P2P.

4

JMS

As MOM.

4

MSMQ AMQP

As MOM. As MOM. Store Forward. As MOM.

4 4

Narada Brokering

As MOM.

4

SOAP/WSAddressing

(A)Synchronous

WS-RM

Asynchronous

WSBUS

4

4

Advantages

Disadvantages

Performance, Scalability and Simplicity. Adaptation, reliability and exibility of in-line brokering. Conforms to FT Process model. Has reference implementation. Adapters to other MOMs. Real world implementations. Has reference implementation. Standardised.

HTTP Blocking, Lack of State Handling, FT in application. Performance compared to REST, Message Queues, Singleton components, no WSstandard Java specic, no WS- standard

QoS based selection, Supports SOAP, MSMQ and JMS. Performance and scalability O(log n), supports SOAP, JMS and JXTA WS- Standards, No message queues, Conforms to FT Process Model WS- Standards. Enforces FT policy.

No WS- standard

Table 3.1: Service Messaging Review. 72

Proprietary, windows specic. Web Services not supported.

No WS- standard No reference implementation. No compatible with FT protocols such as Paxos.

Service Discovery

3.3

Chapter 3:


Service Discovery

Service discovery is the interaction between the consumer and broker to match suitable services. Service publishing is usually included in discovery scheme because of the relationship between the two interactions. The following is a review of discovery schemes currently available for web services.

3.3.1

Universal Description and Discovery Interface

Universal Description and Discovery Interface or UDDI [Clement04] is a standard for web service discovery supported by the OASIS consortium. The anatomy of an UDDI repository is shown in Figure 3.8.

Figure 3.8: Anatomy of UDDI.

73

Service Discovery

The

Chapter 3:

data model

White pages


is how UDDI organises it's information. It consists of the following.

provides semi-structured indexes to for humans to nd businesses and ser-

vices. A structured taxonomy of business and service classication primarily for machine interactions provides the

yellow pages.

Extensibility allows schemes such as the North

America Industry Classication Scheme (NAICS).

Green pages

are where the technical

models containing WSDL descriptions are held. Coloured pages represent how the data is conceptually accessed. The data is stored as XML representing world and software entities. All entities have unique URI keys generated to globally identify them within the UDDI repository. Every entity has a bag of cross referencing keys such as the D&B Duns number. The entities are as follows. A entity

business

is white pages information about the business. Logically grouped services are repre-

sented by a

service entity

which references a business entity key to enforce a many-to-one

relationship between the two. A binding

template schema

maps service entities to concrete

implementation details in the green pages. Abstract representations of a service are formed from WSDL are called schema. Finally, a

tModels.

They form the technical ngerprint of a binding template

publisher assertion

denes a reciprocal relationship between businesses

allowing search by association. SOAP based services are used to access the UDDI data model. The publication service allows entities to be added, updated and removed through a save_xx() and corresponding delete operations. Security only allows the publisher to change an entity. Though the publisher itself may change through the custody transfer service. The UDDI inquiry service provides service discovery to consumers. The browse pattern allows a consumers to get lists of entities keys matching search terms. A drill-down pattern can use the entity keys to get binding templates and WSDL. By getting a binding template a consumer can perform the invocation pattern. This would be used when a service with a cached template is inaccessible. UDDI stops short of the broker pattern where it would invoke service on behalf of a consumer, preferring synchronous message exchanges. A primary problem with UDDI is that it tries to address the gap between human based and autonomous service discovery. The white pages provided by UDDI are intended for 74

Service Discovery

Chapter 3:


humans to locate businesses and the services. These are more complex than using a simple directory web site such as http://www.xmethods.net/. However practice, the yellow pages are insucient for machine based discovery of arbitrary services because computers cannot yet fully recognise semantics [Alonso02, Garofalakis04, Makris06, Shirky02]. Shirky [Shirky02] calls this problem the

semantic horizon

and it applies to all discovery

schemes. True semantic based discovery is a future aspiration for SoA though we later review semantic frameworks that apply semantics to QoS based matching. UDDI does support service entities which are uniquely identied by URI. These represent logical grouped services based on their interface. By locating these

abstract

services

in the yellow pages it possible to discover concrete implementations through the binding template schema. This follows the pattern known as the well known service (or interface) [Alonso02]. Such a pattern allows abstract services to be bound at design time but the implementations are bound autonomously at runtime. In eect this is true late-binding. UDDI does not enforce that logical grouped services provide the same interface but the scheme can be extended so this is true. The well known service pattern allows the FT SoA to side step the overarching question of semantic based discovery. UDDI can be critiqued further with regards to its application in FT with SoA. Like the REST messaging pattern, discussed is section 3.2.1, UDDI is limited to synchronous message patterns thus cannot be used as an intermediary broker as is seen in MOM schemes. Finally, UDDI is designed as a monolithic repository without distribution. In addition to being a performance bottleneck [Li04], it's failure presents a problem to service based applications that depend on it. Dierent repositories can share data through UDDI cloud service interfaces but the decentralisation is not systematic. Despite the problems with UDDI, version 3 is now an accepted standard and is included in the WS-I proles.

3.3.2

Alternative Standards for Discovery

WS-Discovery

is a technical specication supported by the OASIS consortium that is

used to nd web services over the local network. The process consists of the consumer

75

Service Discovery

Chapter 3:


multicasting a SOAP message with a LDAP style query indicating what is being searched for and WS-Addressing identiers to route and correlate responses. Listening nodes with matches multicast a SOAP response back with a WS-Addressing

relates-to

set to the

identier of the original probe. The obvious problem with this approach is the limited scope of requests (IP multicast generally goes no further than the local network). However, the broadcast pattern is a useful one if it could be applied to an inter-institutional scenario. Web Service Inspection Language

(WSIL) [Ballinger01] is an early proposal for dis-

covery. WSIL takes a bottom-up approach of aggregating references to services into one document that is polled by consumers to locate services. The advantage of WSIL is that service descriptions can be aggregated along with more abstract entities such as UDDI service entity keys and even other WSDL documents. WSIL does not provide structured access to arbitrary service discovery such as UDDI or even WS-Discovery but is useful for collating transient runtime service information in paradigms such as Grid. The web counterpart of WSIL is Rich Site Summary (RSS) or ATOM, these are used for aggregating web pages or even REST based services. WSBUS

[Erradi05] is primarly a MOM but it also provides service discovery. Unlike

UDDI, a MOM provides the broker pattern where discovery is performed in-line, with the consumer blindly binding to the provider. This approach is optimal in an asynchronous messaging environment. The routing is based on the routing key, a named index (URI) that enables the broker to identify that message. UDDI provides indexes for locating services but not messages, in fact UDDI never makes the assumption that the message itself can identify potential services. The second contribution from WSBUS is the use of QoS based metrics to provide informed service discovery. [Erradi05] do not provide implementation details on the QoS matching except to say the metrics come from a QoS monitoring console.

3.3.3

Semantic based Discovery

UDDI uses boolean keyword matching to locate appropriate services [Garofalakis04]. This is far to imprecise for autonomous discovery. Machine based discovery can only be achieved

76

Service Discovery

Chapter 3:


with service descriptions grounded in ontological descriptions. All research in this area is focused on the

Web Ontology Language for Services

(OWL-S) [Bechhofer04] supported

by the W3C consortium. OWL is a XML/class based ontological language that denes taxonomies of denitions. OWL-S joins service denitions (OWL-S process model) to the a fuller semantic description of the domain (OWL-S service prole). This is joined to a WSDL description using groundings. OWL-S services must be atomic but then are combined by the process model to form composite services. The interrelations between the dierent aspects of OWL-S are shown in Figure 3.9.

Figure 3.9: Service Description with OWL-S. Matching Engine

[Garofalakis04] is system that matches web services using descriptions

written in DAML-S (forerunner to OWL-S). The system requires that requirements and advertisements are suciently similar, this is of course an ambiguous match. The algorithm determines a match as a relative score by recursively decomposing the description into it's ontological parts. Because these scores are ambiguous in a global sense they can only be used to rank matches for a dened set of services. This approach integrates with UDDI by adding service attributes to a given the tModel. Unfortunately, the tmodels can only match on keywords and indexes so the approach provides a parallel site where the DAML-S descriptions are hosted, this site also hosts the matchmaker services that

77

Service Discovery

Chapter 3:


provides the matching engine algorithm. Overahage [Makris06] is similar to the matching engine but extends UDDI by adding a blue pages to associate OWL-S with service entities rather than tModels.

METEOR-S

[Patil04] is a matching algorithm that converts OWL-

S, WSDL and any other XMLSchema documents into an intermediate format known as SchemaGraph. The actual matching can use dierent techniques from corpus linguistics including WordNet, Porter Stemmer and NGram matching to provide more meaningful scores. There are problems with semantic matching techniques with regards to FT and SoA. Semantic matching is an expensive operation, matching one request and several candidate services will cause an undesirable delay. This assertion is reected by the fact that all the matching engines provide a human rather than machine interface. [Garofalakis04, Makris06, Patil04]. The matching provided by these schemes are approximate rather than denitive therefore agents are unable to use them to match services based on functional (application) requirements. However, these schemes are useful in a smaller, well-dened, domain namely Quality of Service (QoS). QoSOnt

[Dobson05, Dobson07] is an ontology described in OWL for QoS attributes

such as reliability and performance. These attributes are measured in a way described by metrics. The ontology has a relatively complete set of domain independent semantics for dependability concepts.

Service QoS Requirements Matcher

or SQRM [Dobson05] is a tool

that performs service dierentiation using QoSOnt. Clients insert their own requirements expressed in QoSOnt for example M T T F ≥ 10 days and M eanAvailability > 98%. A Java based user interface allows a human to see the requirements as a logical tree. These ontologies are stored in a repository independent of UDDI. Discovery starts as a keyword search of the repository that nds a set α of candidate services. This processes assumes the services found conform to the idea of a

well known service.

The rst round matches

the QoSOnt requirements with the descriptions in the services in α nding an intersection of of attributes and metrics. A second round scores rates the intersection of requirements and descriptions. It orders α into a set of pairs (service, rating) based on the strength of the matches. 78

Service Discovery

Chapter 3:


SQRM, like other semantic matching schemes, oers poor performance when used in a SoA so is limited to being a design-time tool for human inspection. A authority

QoS certication

is proposed by [Ran03] to enhance the SoA. It introduces a quality information

entity to UDDI that sits adjacent to the binding template schema for a service. This contains QoS statements such as availability = 0.91. A query of the UDDI registry fetches the QoS information with the WSDL. The QoS statements are accompanied with a signature that is used to verify the statement with the authority services. This scheme is novel in that it uses certication to enable the consumer to trust that the QoS information is correct and from a reputable source. No mention is made to how the QoS information is obtained but we assume it comes from service monitoring. QoWS

[Makris06] is a novel approach that proposes service discovery based on mon-

itoring, relative, QoS metrics. Service hosts si are assesed by network distance di from the assessor and the number of services provided fi . Candidate servers are stored in set

Γ where si ∈ Γ. Using contour selection based on Cartesian coordinates (−di , fi ) some servers are pruned from Γ. QoS measurements are taken for servers in Γ by an agent that monitors all transactions between the assessor and those servers. A consumer may query a local or remote agent for QoS ratings associated with candidate services. QoWS provides live QoS information unlike [Dobson05, Erradi05, Ran03].

3.3.4

Distributed Discovery

Apart from the semantics of service matching, the other concern is the potential single point of failure that occurs with UDDI and other initiatives. An alternative is build decentralised repositories that are optimised for small worldliness as we have already seen in Narada Brokering [Pallickara03]. To ensure that a distributed lookup has acceptable performance it needs to scale O(log n) where n is the number of services, notably UDDI cloud services does not scale [Garofalakis04]. The general approach is to build the repository of P2P protocols, these are specically surveyed in section 3.5. Peer-to-peer Web Service Discovery (PWSD) [Li04] is based on the Chord P2P protocol.

79

Service Discovery

Chapter 3:


Service description keywords are hashed to provide keys to index services in a distributed hash table. When consumers want to locate services, their requirement keywords are hashed in attempt to perform a lookup of matching services. Like Chord, PWSD performs lookups in O(log n) thus the scheme is fully scalable. To further rene service discovery, PWSD performs a simple node splitting method to form a NVTree data structure from the WSDL, from this the key/service pairs are drawn. NIPPERS [Makris05] uses a dierent P2P overlay topology based on the static interpolation search tree to provide the same distributed hashing as PWSD but with lookups only taking O(log(log n)). The advantages of these P2P based service discovery is the true decentralisation they provide. When service discovery data is replicated between adjacent nodes as can be achieved in most distributed hashing schemes the discovery mechanism itself becomes immune from single points of failures. This approach has been adopted by Resilient Web Service Infrastructure (RWSI) [Norcross05] to ensure that services (in this case mobile implementations) are replicated on adjacent nodes so the failure of one node means the service is ported to the next. RWSI is reviewed as a FT framework for SoC in chapter 4. Using keywords to create indexes is a less precise means to match services than using semantics. HyperCuP [Schlosser02] proposes distributing semantic repositories over a P2P network. The P2P topology used is the Hypercube, an approach similar to CAN [Ratnasamy01]. An ontology is broken along it's intersections to generate a set of concept coordinates. Nodes, services and queries all have ontological representations that generate coordinates. The network is kept stable by a closeness matching of ontological information. Queries are broken down into a graph called mini-terms, these are then used to match nodes. The strength of this approach is that the search terms are themselves routing principals. The limitation of HyperCuP is the use of it's own proprietary ontological representations. Edutella/JXTA [Qu04] is a scheme to integrate the industry standard JXTA P2P platform with the RDF based Edutella, a subset of the OWL language. The scheme works by integrating the Edutella search service into the set of JXTA core services. All appropriate JXTA services are endowed with RDF based descriptions which the search service uses 80

Service Composition

Chapter 3:


to locate them based on queries composed in RDF-QEL. Eectively, JXTA services are discovered using Edutella ontologies. The scheme is scalable because of it's JXTA foundations. Edutella/JXTA does not directly integrate with Web services but the authors propose a series of service adaptors to ensure compatibility. Like the centralised semantic approaches [Garofalakis04, Makris06, Patil04], HyperCuP and Edutella do not provide rich enough semantic information to address the semantic horizon [Shirky02].

3.3.5

Discussion

The review of service discovery is summarised in table 3.2. To support our object making SoA less orthogonal to FT we require that service discovery is autonomous at runtime. It is impractical to use semantic based solutions for matching services because of the lack of rich semantic descriptions for services. Keyword based solutions such as PWSD are too inaccurate for an autonomous SoA. The best solution is a combination of indexed abstract services from UDDI and indexed messages from WSBUS. This follows the pattern of well known service [Alonso02] where service/message matching is designated at design time. To address the objective of providing QoS based dierentiation between FT services we need to adopt the QoS matching semantics of SQRM. Since QoS is a limited domain it is safe to apply the current approximate matching of semantic discovery. To address performance the QoS denitions need to be limited. It would be advantageous to FT that QoS information was live as proposed by QoWS. There is a clear performance advantage to using the broker pattern from WSBUS to perform service discovery since it generally reduces the steps need for a consumer to invoke. To support our objective of decentralising the FT framework combining service discovery with P2P protocols is clearly advantageous.

3.4

Service Composition

Composition is a system representation of system consisting of atomic services to achieve a higher purpose [Peltz03]. Abstract composition is called 81

choreography

it describes the

Service Composition

Scheme

Chapter 3:

Class

UDDI

Default

WS-Discovery

Multicast

WSIL

Aggregation

WSBUS

MOM

Matching Engine

Semantic

Overhage METEOR-S

Semantic Semantic

SQRM

Semantic/QoS

QoWS

QoS Monitoring

PWSD NIPPERS RWSI HyperCuP

Distributed/Keyword Distributed/Keyword Distributed/Index Distributed/Semantic

Edutella/JXTA

Distributed/Semantic


Advantages

Disadvantages

WS- Standard. Indexed abstract services. WS- Standard, simplicity, performance. Simplicity.

Centralised, synchronous, poor human and machine support. Limited to local network (enterprise) hence not scalable. Unstructured discovery. No longer supported. No WS- standard. Message Queues. Semantic horizons. Approximate matching. Poor performance. As matching engine. As Matching Engine.

Broker pattern, indexed messages, QoS based selection. Semantic matching. UDDI integration. UDDI integration (blue pages). Multi-format matching. Improved matching. Semantic horizon limited to QoS. Provides monitoring to provide live QoS information. Scalable, resilient. Double scalable, resilient. Resilient mobile services.

Standards based ontology.

Poor performance.

Imprecise service matching. Imprecise service matching. Semantic horizons. Nonstandards based ontology representation. Poor integration with web services SoA.

Table 3.2: Service Discovery Review. message exchanges required to achieve a goal state.

Web Service Choreography Interface

or

WSCI [Arkin02], supported by W3C, denes relationships in terms of WSDL operations. It also describes transactional boundaries, exception handling, threads, operational context and dynamic participation. Orchestration

is a concrete composition that creates an new service or business pro-

cess. This interacts with services to achieve the goal state. This approach has subsumed choreography because it is simpler to coordinate from one process. There are many orchestration proposals including XLANG, WSFL, PDL, XPDL, BPSS, BPML, WSEL, ecXML, WS-Coordination, WS-CAF and WS-BPEL [Peltz03]. WS Business Process Execution Language

or WS-BPEL [Alves07] has become the stan-

dard for orchestration. It is described in XML and references a set of WSDL descriptions. Compositions can be abstract or concrete by asserting partner relationships with services. When concrete, the partner link denes the possibly asynchronous connections between 82

Service Composition

Chapter 3:


the business process an other services. A hosting environment such as Apache Axis allows execution of a concrete business process.

Figure 3.10: WS-BPEL Activities are actions taken by the business process. Partner activities send or receive messages using in the invoke and reply primitives. WS-BPEL may contain other processes within the workow and store state in the form of instantiated XSD data-types. A process in instantiated when a receive activity, with an instance attribute, is called. Dierent processes on the same host may interact. Control activities coordinate other activities.

Sequence makes nested elements activate sequentially, ow concurrently. Threads are only terminated with boolean join condition to allow them to correlate chosen data. When a join condition evaluates to true a fault is raised. Dead-path elimination allows completed threads to be abandoned. Link activities are used to synchronise points in concurrent ow. One thread will block until another has reached the corresponding link. Other activities are more like traditional programming languages including variable assignment, if and

while. All events are triggered by metronomes, receive activities or variable state changes. A scope activity supports WS-Transactions. It allows state changes to be committed or rolled-back. Lastly, WS-PBEL includes a fault handling mechanism with compensation activities triggered by faults raised by the business process or other partners. 83

Peer-to-Peer

Chapter 3:


There is a clear overlap between a composition language such as WS-BPEL and our research object to adapt SoA to make it less orthogonal to FT. FT services are themselves compositions of application services. WS-BPEL supports our requirement for asynchrony with the FT process model. There are a few problems with WS-BPEL with regards to FT. Firstly, it only supports UDDI as a service discovery medium with little integration available for QoS solutions. Next, WS-BPEL oer no support for distributed protocols such as P2P thus making our decentralising objective dicult. WS-BPEL does not scale well [Milanovic04]. Addressing our objective to simplify deployment of FT protocols WS-PBEL code if dicult to program and hard to read [Milanovic04]. Supporting a pluggable system of FT protocols would require new denitions within BPEL. It is dicult to extract an abstract representation for the purpose of verication. More abstract representations such as π calculus and petri-nets oer more veriable constructs.

3.5

Peer-to-Peer

The Peer-to-Peer (P2P) paradigm is distinct from service oriented computing. It describes ways of overlaying physical networks and the Internet to allow survivable and scalable dissemination and messaging. The inclusion of P2P in this thesis is because of the overlap with SOC, in particular with asynchronous messaging patterns. Several of the frameworks reviewed in Chapter 4 are made scalable by overlaying them over P2P. Early P2P initiatives were related to le sharing started by the inception of Napster. However, the messages of these systems have MOM like properties. They are asynchronous and containing routing semantics such as

Time to Live

or TTL, the number of peers a message can hop. Asyn-

chronous messages are possible because peers act like both clients and services. Peers, or nodes (we use the terms interchangeably), are given equality within a decentralised network. Flooding

is a simple approach used by networks such as Gnutella to search for les.

It works as shown in Figure 3.11. It is pure P2P because all peers participate equally. 84

Peer-to-Peer

Chapter 3:


Figure 3.11: P2P Messaging (Flood). Gnutella nds a set of matches in O(n) [Chawathe03] but unlike the protocols in Chapter 2 n is arbitrarily large. It is of the order of two million peers as of 2005. Gnutella proves ooding oers excellent survivability but with poor scalability and performance. Messages are prevented from propagating too far by a TTL header. Random walking

[Chawathe03] propagates search messages to a random selection of

peers. Messages may get held up with heavily loaded nodes damaging response times. It is possible the message does not reach peers with matches. Gnutella2 and Gia optimise their network by having for

leaf

hubs,

high capacity peers, that are the sole connection points

peers. A query is propagated by a peer to a set of hubs.

Biased random walk

targets the most linked and highest capacity hubs reducing the number of hops a message needs to take, maximising search results. These approaches improve scalability because performance is decoupled from the network size. Topology adaptation

chooses hubs to send a search based on their capacity. Each peer

maintains a cache Γ of (IP, port, capacity) entries pertaining to known hubs. These are source from web lists an exchanges with other peers. The peer also keeps a cache Ψ of hubs that search messages will get propagated to. Periodically, will compare ∀p : p ∈ Γ ∧ p ∈ /Ψ with Ψ. It ensures that the highest capacity hubs are in Ψ. It continues until the optimal topology is reached. 85

Peer-to-Peer

Chapter 3:


One-hop replication

makes indexes of content in addition to peers. Gnuttella2 calls this

query routing table

or QRT. Searches are terminated at hubs rather than leaf peers.

the

A start up a peer shares it's QRT with any hubs it connects to. Subsequent changes to the QRT means connected hubs are notied. Finally,

ow-control

ensures a peer can only

propagate a search to hub η if that hub has been given a token. Tokens are distributed with a start time fair queuing scheme.

3.5.1 The

Distributed Hash Table Distributed Hash Table

or DHT [Balakrishnan03, Stoica03, Wiley03] is a distributed

way of storing information with scalable lookups. All DHT peers provide the primitives

lookup(name) and store(name, value). Instead of keywords, data is indexed by keys created by hashing, typically SHA1 [Stoica03]. All data resides in a keyspace as do peers, their keys produced by hashing their IP addresses. Most DHT protocols represent the keyspace as a ring topology like Figure 3.12. DHT relies on a distance function between two keys. A simple DHT is shown in Algorithm 8. It's ineciency derives from the fact that a search goes from peer to peer round the ring. It may traverse the whole ring meaning the search takes O(n). This can be improved.

Figure 3.12: Chord Ring Topology [Stoica03]. Chord

[Balakrishnan03, Stoica03] improves the DHT by providing a more ecient

lookup scheme and ring stabilisation algorithm. This algorithm allows peers to join, leave and fail. Chord maintains a nger table as shown in Figure 3.12 that keeps the key and 86

Peer-to-Peer

Chapter 3:


Algorithm 8 DHT Lookup [Wiley03]. Distributed Hash Table (DHT) Fair Loss Links (FLL) Hashing algorithm (SHA1) Initialisation: pself .id = SHA1.hash(myIP Address) hashlocal = ∅ psuccessor = // Dened by stabilisation

algorithm

function distance(key1 , key2 ) // Based on Chord if key1 ≤ key2 then return key2 − key1 else return (2m ) + (key2 − Key1 ) end if

function search(key, poriginator , value) if distance(pself .id, key) > distance(psuccessor .id, key) then send(SEARCH, psuccessor , key, poriginator ) else if value =⊥ then

FLL.

FLL.send(F OU N D, poriginator , key, hashlocal [key])

else

hashlocal [key] = value

end if end if

Upon DHT.lookup(name) search(SHA1.hash(name), pself , ⊥)

Upon DHT.store(name, value) search(SHA1.hash(name), pself , value)

Upon FLL.deliver(SEARCH, key, poriginator , value) search(key, poriginator , value)

Upon FLL.deliver(F OU N D, key, value) DHT.f ound(key, value)

address of peers

1 th 1 th 1 th 2 way, 4 way, 8 way,

and on in powers of two, round the ring. When

searching for a peer indexed by k , the highest key in the nger table that does not exceed

k indexes a peer that is at least halfway along the remaining keyspace towards k . This divide and conquer ensures the O(log n) lookup. The graph of nodes forms the overlay network. Stabilisation maintains the chord ring in a peer's successors list and nger table. The algorithm relies on the nd successor function. A peer joins by asking a member of the ring to nd it's successor. This successor sets it predecessor as the incoming peer and vice-versa for the predecessor. Operations between peers are reexive. The nger table is 87

Peer-to-Peer

Chapter 3:


Algorithm 9 Chord Search [Stoica03, Wiley03]. Chord extends DHT Fair Loss Links (FLL) function search(key, poriginator , value) dispatched? = f alse for i = m down to 1 do if f inger[i].key ≤ key then FLL.send(SEARCH, psuccessor , key, poriginator , value) dispatched? = true end if end for if ¬dispatched? then if value =⊥ then

FLL.send(F OU N D, poriginator , key, hashlocal [key])

else

hashlocal [key] = value

end if end if

then populated by a routine that performs a peer walk. Pastry

[Rowstron01] chooses random keys for peers to ensure graphic diversity. The

overlay consists of a leaf table with half the closest peers having a higher key and the other half having a lower key. A routing table consists of log2b n peer keys where b is a tuning parameter. The ith key shares the same i bits with the owners key. When a peer receives a lookup for ks request it attempts to match it in the leaf table. Failing that it nds the entry in the routing table with the longest shared bit prex with ks . For example 11010001 and 1101111 have a symmetrical distance of four because they share the 4-bit prex 1101. This distance function is called tree like routing because of the structure of the prexes. It guarantees lookup in log2b n.

Tapestry

[Zhao01] is a similar scheme to Pastry.

Leaf tables are required because the distance function can match k entries. A networkproximity metric can be used to prune the k entries to enable proximity routing. Lastly, the distance function is symmetrical so when a peer receives a query from an unknown peer it can add it to it's routing table. This means no stabilisation algorithm is required. Kademlia

[Maymounkov02] provides a distance function with performs an exclusive-OR

between peer keys. This function is not only unidirectional but also symmetrical. No leaf table or stabilisation protocol is required. Content Addressable Network

or CAN [Ratnasamy01] uses a d-dimensional Cartesian

coordinate space to partition peers. Each peer has it's own zone in the coordinate space. 88

Peer-to-Peer

Chapter 3:


It maintains a routing table of peers that share a (d − 1)-dimensional hyperplane with it's zone. A lookup follows a linear route from the start peer to the peer containing the query key. It involves the query being passed to peers whose zone the route traverses. Ties are 1

broken arbitrarily. The cost of lookup is O(dn d ). A peer px joins with random coordinates in the keyspace. It asks a member peer to join. This member searches for a node po that owns the zone containing the coordinates of px . Kindly, po splits it's zone in two giving half to px . The routing table of px is initialised from po because px .table ⊂ po .table. Finally,

px announces itself to peers in it's routing table.

3.5.2

JXTA

JXTA [Brookshier02, Oaks02] is a set of industry standard XML-based abstractions for P2P interactions. It includes a set of programming bindings for Java and C. The intention is for JXTA to be language agnostic. A

Peer

is the building block of JXTA. It has a logical mapping of one-to-one with

physical machines. Peers do no map to users. They are accessed through

endpoints

over a

variety of transports including HTTP, TCP and UDP for IP multicast. Peers are identied by optional names and globally unique names. Each peer can provide

peer services

that

reside on only that peer. Super peers perform extras roles related to discovery or message routing. An

edge peer

Peergroups

is any non-super peer.

are a forum for peers to interact. They cannot without being in the same

group. To bootstrap peers together the Net group is used. From this a peer can create, discover and join a user dened peergroup. The Net group is seeded with super peers from

http://jxta.org.

Group services

operate in the context of a group with implementations

on multiple peers. The following are core group services. A

resolver

allows generic queries

to be sent between peers with replies routed back to the originator. Peers provide handlers that map to queries and send responses. Other group services extend the resolver. routing

Endpoint

creates the illusion that peers have a direct connection. In fact it queries adjacent

peers using the resolver to determine the next hop to the endpoint. To build these routes

89

Peer-to-Peer

Chapter 3:


on a per message basis is tremendously inecient, therefore the protocol caches the route for the next message. A cache has a lifetime of 15-20 minutes, optimal for performance versus dynamism [Oaks02]. The core services are enhanced with the

standard

nd entities such as peers, groups and pipes. The

set as follows.

membership

Discovery

is used to

service provides authenti-

cation and authorisation for group access. Peers may be required to provide the service with password or digital signature credentials. It awards a JXTA credential that validates all group communications. A service provider peer may query the access interface to determine is a client peer has rights of access.

Pipes

are virtual communications conduits

than are independent of peer endpoints. The

information

service is used to nd runtime

information for other peers. urn:jxta:jxta-NetGroup urn:jxta:uuid-DEADBEEFDEAFBABAFEEDBABE0000000010206 NetPeerGroup NetPeerGroup by default

Figure 3.13: JXTA Net Peer Group Advertisement [Oaks02]. Advertisements

are XML-based descriptions as shown in Figure 3.13 that represent

peers, peergroups, pipes, routes and other internal JXTA entities. They combine names, unique identiers, endpoints and implementation specic information. A advertisement has a lifetime after which it is removed by all peers.

Discovery The rendezvous is a special super-peer that propagates resolver query messages beyond the local network. They maintain a DHT, the holding a

Rendezvous Peer View

Shared Resource Distributed Index

or SRDI by

or RPV of known rendezvous peers order by their UIDs.

The SRDI is loosely consistent because the RPVs can become un-synchronised until the underlying ring stabilises and they again converge. Stabilisation occurs by a rendezvous periodically sharing an arbitrary subset of it's RPV with a set of randomly selected peers. 90

Peer-to-Peer

Chapter 3:


Any peer can be or change to be a rendezvous at runtime. Edge peers connect to a rendezvous using a lease mechanism called the protocol.

rendezvous

An edge peer sends a message to a rendezvous through the propagate service. The

rendezvous nds the closest matching rendezvous by identier in the RPV and forwards the message. If the second rendezvous knows about the destination edge peer it again forwards a message to it. If a message does not have a destination it is broadcast to all rendezvous peers in the RPV eventually reaching all edge peers within hopping distance less than the TTL parameter. Messages are uniquely identied, so duplicate messages are ltered by the rendezvous. Indexes are replicated between adjacent rendezvous making SRDI fault tolerant. Lost rendezvous peers are tolerated provided enough survive between stabilisations. If the SRDI fails to deliver a message it assumes the SRDI is presently unstable and performs a random walk round the rendezvous peers to attempt a matching index.

Figure 3.14: JXTA SRDI [Li03]. Discovery is about sharing and searching for advertisements with the primitives

share

and search. All peers index known advertisements in their local cache. Sharing sends locally cached advertisements to other peers within scope. Searching can use any index including named-based wildcard matches to instigate responses from other peers containing advertisements. All new advertisements are cached. Discovery has two models. IP-multicast is uses to directly share and search advertisements between peers on a local network. The alternative is the use the SRDI as shown in Figure 3.14. When an edge peer has a lease with 91

Peer-to-Peer

Chapter 3:


a rendezvous it periodically shares it's cached advertisements with the SRDI. In periods of stabilisation all advertisements are stored in the SRDI. Pipes

are virtual communication channels that not only abstract multi-hop transport

but endpoints also. They use the endpoint routing service to calculate and cache a route. The pipe

binding protocol

allows peers to discover pipe advertisements and bind to them at

both ends. This means that a message sender and receiver do not know each other, though pipes can be directly associated with peers by embedding the pipe in a peer advertisement. Pipes come in the following avours. The Secure

unicast

Unicast

pipes are one way unicast communications.

pipe uses asymmetrical cryptography to secure the transfer.

pipes broadcast messages to multiple receivers. Finally, a

duplex

Propagate

is used for a two-way

point to point message exchanges. This type of pipe is also called a JXTA socket. Gateways

are a type of super-peer that relays messages to peers without direct connec-

tions. Often peers are hidden behind rewalls or are network translated making initiating a connection to them impossible. Instead the gateway stores messages in a queue that is polled periodically by the hidden peers. Any peer can automatically start as a gateway unless blocked. Hidden peers automatically connect to a gateway.

Performance JXTA is supposed to be scalable thanks to the SRDI and ecient routing protocols that use direct communications where possible. Caching is also ubiquitous. To oset this it uses large XML messages that often embed routing information for high message complexity. It's performance and scalability have been studied [Antoniu05, Halepovic03]. The following issues are raised. Pipe latency is of the order of 12-20 times slower than a TCP socket. Unicast pipes are unable to transfer in the sub-millisecond range because of large, often surplus, XML message headers. JXTA 2.0+ provides very good message size scalability. The n-scalability is O(log n) where n < 32 thanks to the SRDI. Overall, JXTA has continually improving throughput largely thanks to a move from a random peer walk topology to a ring-based DHT. The nature of JXTA makes it good for slower speed networks

92

Summary

Chapter 3:


or the Internet. It is also suitable for large data transfers such a Grid.

3.6

Summary

This chapter has given a literature survey of service oriented architecture focused on service messaging, service discovery and composition. We have discussed the schemes currently available and critiqued them with regards to our research objectives. The second part of this chapter was a background to P2P a paradigm closely related to SoC that enables FT by allowing infrastructure to be decentralised.

93

Chapter 4

Existing Approaches to FT-SoC This chapter is a review of frameworks specically designed to apply FT protocols and techniques to Service oriented systems, we call this FT-SoC for short. We start by dening our criteria for the review, we then discuss the operation of each framework and follow that with a critique. The last part of this chapter is a review summary and discussion of the ndings.

4.1

Criteria for Review

The criteria for this review stem from our broad research objectives:


What kind of replication is provided? Are the messages ordered? What failure models is the framework tolerant of? For clarity we limit these to crash and Byzantine.

Does the framework support diverse service implementations? Can the framework support group membership protocols? • Adapt current SoAs to make less orthogonal to FT. 94

Generic FT Container

Chapter 4:

Existing Approaches to FT-SoC

Does the SoA within the framework make timing assumptions or is it asynchronous?

Does the SoA within the framework support QoS based service selection? Does the SoA within the framework support dynamic FT service selection (Late protocol binding)?

• Address the lack of decentralisation in current FT-SoC approaches.

Is the framework decentralised? Is the framework scalable?

4.2

Generic FT Container

The Generic FT Container [Sommerville05] is a mediated approach to synchronous SOAP interactions. It supports the principle of the

well known service

[Alonso02] but requires

service endpoints to be hard-coded into the framework. It is pluggable, supporting several both passive and active replication. FT protocols are represented by a combination of XML and Java classes to produce each model. Models are instantiated within the container. The container uses indirection to receive messages destined for functional services. Messages are conceptually passed through the model invoking the procedure classes as they go. The leaves of the model are proxy classes that send messages to destination services. All interactions are synchronous so the functional services send responses back through the container. These again get mediated and possibly combined to form reliable responses to the original client.

• Passive replication tries destination services sequentially until one produces a timely response.

• First passed the post redundancy ensures a message is concurrently sent to all destination services with the fastest response being sent back to the client.

95

FAWS

Chapter 4:


• Active replication combines responses using a simple majority voting algorithm. Failure detection is implicit because it depends on timeout ∆ of underlying TCP connections, this can be changed by the

keep-alive

property. The container can be informed

of crashes by a plug-in heartbeat detector. The framework is limited to supporting passive and active replication for crash and Byzantine (for state less transactions) failures. FT container can support diverse service implementations but has no facility for group membership because endpoints are hard-wired. An SoA provided by the framework lacks any sophistication. The underlying messaging model is synchronous, based on SOAP. A lack of service discovery and any QoS metrics prevents late or optimised service selection. Lastly, the framework is not decentralised but is eectively scalable by eciency and simplicity.

4.3

FAWS

FAult tolerance for Web Services

proxy component

FT-Front

or FAWS [Jayasinghe05] is a mediated framework. The

routes SOAP messages to replicas using the same well known

service interface. A management component

FT-Admin

provides service selection infor-

mation over Java RMI to the fronter. All replicas are monitored using a failure detector, though the author omits how this works. This information is passed back through the FT-Admin component that in turn instructs the fronter. At time of publication the author describes passive replication only but suggests active replication is possible. The framework is limited to crash failures only. Diversity and group membership are not supported. Like the FT container, the underlying SoA lacks substance for example there is no obvious means to dynamically discover services and no provision for QoS data. Not enough details are provided to ascertain whether asynchronous messaging is available, it is suspected not. The framework is centralised using a proxy component. Again, like the previous framework FAWS is scalable because of it's simplicity.

96

FT with WS-BPEL

4.4

Chapter 4:


FT with WS-BPEL

A scheme integrated into WS-BPEL provides FT inside a business activity [Dobson06]. It consists of special failure handlers, compensation handlers and scope constructs. Passive replication is achieved as follows. When the rst service is invoked a failure handler catches the service's failure. When this happens the failure handler calls a compensation handler that invokes the next service. The strength is that the compensation handler resides within the scope construct of the rst invocation, it allows that state can be held between invocations. Active replication is also supported. The ow activity is used to perform concurrent tasks. A pick activity is used to perform rst-past-the-post redundancy with fastest variant response. Voting is performed on 2f + 1 responses. The algorithm is implemented as a series of nested switch activities or combined into one data-type. XPath is used to combine responses, it is used in conjunction with an assign activity. This framework is the rst to use WS- standards to provide FT. FT/WS-BPEL can support active and passive replications with coverage of crash and Byzantine (for state less transactions) failures. Because it is a mediated approach group membership is not supported neither are diverse implementations. The underling SoA is not clearly dened, because operations reside at the application level, but apparently the SoA can support asynchronous interaction. No support is made for dynamic FT service discovery or QoS based service selection. The framework is not decentralised but is scalable due to eciency.

4.5

Fault Tolerance Connectors

Fault tolerance connectors are provided by the Infrastructure for Web Services Dependability (IWSD) [Salatge07]. A connector is essentially a mediating web service, as shown in Figure 4.1. IWSD allows the connector to represent an abstract web service with its own WSDL description. This service abstracts the operations a set of unreliable equivalence services. IWSD, therefore, supports the well-known service notion [Alonso02]. Abstract

97

Fault Tolerance Connectors

Chapter 4:


services are generated by a design time tool provided with the framework. Each abstract service is made concrete within the IWSD by adding runtime assertions (acceptance tests) to check for errors and choosing a recovery strategy (FT process model). A connector provides a pre and post operation in addition to the recovery strategy and exception reports. Specic FT connectors (SFTC) are deployed to a host proxy by a management service.

Figure 4.1: IWSD [Salatge07]. The authors are not explicit about asynchronous messaging capabilities. Passive and active replication enables crash and Byzantine (state less transactions) failure coverage. The framework provides message journalling or triggering save/restore operations where available on target services. Journalled messages can be replayed to subsequence services in a passive replication scheme. IWSD supports an SoA that swaps FT connectors (the FT protocol) at runtime though this does not support QoS based service selection. Human intervention is required to decide the appropriate FT connector to be chosen. Group membership is not supported but some FT connectors allow diverse implementations. The framework is not decentralised but is scalable due to it's eciency.

98

Client based Frameworks

4.6

Chapter 4:


Client based Frameworks

The

WS-Fault Tolerance Mechanism

or WS-FTM [Looker05] is a limited implementation

providing diverse active replication. The concurrent invoker and selection algorithm are hosted inside the client space, Java based Apache Axis. A client stub invokes n diverse service operations to collect r results required for a

r 2

+ 1 decision. The values n and r

are dependent on the topology of the implementation. Voting uses the Java comparable interface to delegate the voting to the overlying application. The advantage of WS-FTM is performance, it requires only 2n exchanges to achieve selection. This framework is designed to provide coverage of Byzantine failures with stateless interaction using active replication. It works with diversity but has no autonomy to provide group membership. It's middleware is based on a simple synchronous SoA embedded in the application that is not QoS enabled, the FT protocol is xed. Finally, the framework is not decentralised but does provide a scalable solution due to it's relative simplicity. A Middleware for Replicated Web Services (MidRWS) extends WS-FTM by adding a ordered broadcast to the active replication [Ye05]. To improve scalability MidRWS uses a probabilistic multicast, through the TOPBCAST protocol, provided by the JGroup package. The system works by wrapping services in a proxy web service site (PWSS). These proxies communicate through the TOPBCAST protocol with the client and each other. When a message is decided by the PWSS it is delivered to the real web web service. Unlike, IWSD for example, the PWSS is not an well-known service [Alonso02, Salatge07], it operates orthogonally to the web service stack. MidRWS provides one FT protocol a probabilistic total order state machine that can approximate to full Byzantine coverage. It also supports asynchronous exchanges. The authors suggest that group membership could be enabled but it was not available as presented. Target services can be diverse. The SoA is wrapped in a Java API that endows clients with the ability to invoke groups of services. No autonomy over FT choice using QoS metrics or otherwise can be applied since the framework is limited to one protocol. The framework is decentralised as clients themselves communicate to a set of services 99

CORBA Based FT

Chapter 4:


(group) using the Java API. Scalability is achieved because the framework is limited to a set number of services.

4.7

CORBA Based FT

CORBA is an open distributed component model with it's own FT scheme called FTCORBA. This scheme has been adapted by two FT-SoC frameworks.

FT-SOAP

[Fang04]

provides FT by creating service groups mediated by a replication manager. This orchestrates the creation of services in a group then inserts the current state into a UDDI repository. A service group is set of services comforting to the well known service notion. Unlike other mediated approaches the application possesses a manage component that is independent of the SOAP client. FT-SOAP is shown in Figure 4.2. The SOAP engine inside the client is responsible for deciding which services in a group will receive a message. It is informed by the application administration module, an embedded mediator in the SOAP client.

Figure 4.2: FT-SOAP [Fang04]. Hosts contain message logging and recovery mechanisms, these persist messages to a 100

CORBA Based FT

Chapter 4:


reliable central stable storage from which any service host can try and recover. When a service host cannot recover, failure notication is routed to the replication manager through an arrangement of local and global failure detectors. This recongures the service group without the the failed services, the group is updated in UDDI. From here a client can see changes to service group and change the destinations for the next message. Host recovery is coordinated by the replication manager using the local message logs and recovery mechanisms. FT-SOAP supports passive and active replication so has coverage over crash and Byzantine (for state less transactions) failures. Diversity is supported. Group membership protocols are not directly supported but services can eectively join and leave during execution by their insertion in the service group in the UDDI repository. The underlying SoA provided by Apache Axis v1.1 is limited to using synchronous RPC interactions. The FT protocol cannot be autonomously at runtime through the SoA and there is no provision for QoS based selection. Lastly, FT-SOAP relies on singleton components such as UDDI and the replication making it a centralised solution, this prevents the solution from being scalable. FTWeb

[Santos05] shares logical components with FT-SOAP. However, the arrange-

ment favours a classical mediated approach without many of the singleton components of FT-SOAP (except UDDI). A client discovers the intermediary service, the

WSDispatcher

from UDDI. Every WSDispatcher has backups if it fails. Synchronous SOAP requests are sent to the dispatcher which through an invoker sends messages to destination hosts. The WSDispatcher can monitor destination hosts to change routing decisions, therefore services can be dynamically composed. FTWeb supports active and passive replication in addition to a new hot-passive replication. This replication technique maintains backup replicas in the same state as the primary but ignores their outputs. It should be noted that no message order is enforced to the backups so FTWeb cannot support SMR. Crash and Byzantine (for state less transactions) are covered. Diversity is supported through a response analyser that provides voting operations. Group membership is not supported. The derived SoA is limited to 101

Transparent FT

Chapter 4:


Figure 4.3: FTWeb [Santos05]. synchronous RPC interactions in common with FT-SOAP. QoS selection and dynamic FT service selection are not available. The framework uses a classical intermediary approach but the intermediary itself is passively monitored by backup intermediaries. During operation all service group information is loaded by primary and backup intermediaries thus the framework is decentralised. This dynamism avoids the overheads of UDDI and therefore makes the solution scalable.

4.8

Transparent FT

Transparent FT [Dialani02] is a framework for OGSI Grid services. It supports the crashrecovery failure model using checkpoints of state data provided by OGSI. Grid services have the ability to take checkpoints from their hosting environment. Transparent FT has infrastructure split between an application and service layer. The

messenger

is a modied

SOAP layer of a service context. It logs and responds to queries about messages the service has received. This layer interacts with global fault managers to receive checkpoints to gain a globally consistent state. A local fault manager coordinates the messenger to blocking 102

Transparent FT

Chapter 4:


or non-blocking recovery through the replay of service messages. Failure detectors are external entities that are registered with services and the application. The global fault manager mediates detections into service and application notications. A service itself decides whether the failure can be recovered from a checkpoint and subsequent replay of messages. This takes place within a sandbox. A service noties the global fault manager when it is returned to being correct. The fault manager noties the application.

Figure 4.4: Transparent FT [Dialani02]. The aims of this framework are to provide checkpointed recovery thus the requirement for enhanced Grid services. It supports passive (with checkpointing) and active replication. It covers the Crash and Byzantine (for state less transactions) failure models. Support for diversity cannot be ascertained but group membership is not supported. The SoA is set at design time within the fault manager component, therefore FT protocols cannot be changed at runtime. Though not asserted by the authors checkpointing implies that message delivery is asynchronous. No support is made for QoS metrics. The framework is centralised relying on the fault manager as the intermediary and singleton components for failure detection. Transparent FT is not scalable.

103

Group Membership based Frameworks

4.9

Chapter 4:


Group Membership based Frameworks

Replication for web services (WS-Replication) is a mediating framework that integrates group membership protocols and multicast to provide service replication [Salas06]. Like FT-SOAP, this framework assumes the client must be local to a replica service, however all replicas can mediate. Upon a client invocation a proxy component is generated that forwards the request to the dispatcher and the response back to the client. The dispatcher sends a message to the real web service and also multicasts to all other replicas. To ensure full decentralisation WS-Multicast is provided through a group communication stack, JGroups. This group membership protocol makes all replicas are aware of each other enabling an eective multicast primitive. WS-Replication only provides one FT protocol, active replication, eectively covering crash and Byzantine (for state less transactions) failures. The solution enables diversity and is based on group membership protocols (JGroup). The SoA provides asynchronous interactions but is unable to provide dynamic service or QoS based selection. The group membership enables the framework to be full decentralised and the WS-Multicast enables scalability. Replication based Middleware for web services (RMWS) integrates FT into the Apache Axis 2 platform [Osrael07]. Like WS-Replication, it provides logical group membership support but it does this through the package Spread. This facility allows all replicas share a common view of the partition. Spread is also responsible for multicasting messages between replicas, though unicast is used between the client and primary replica. Currently, the framework only supports the hot passive FT model where only the primary instance communicates with the consumer, this means that RMWS is limited to tolerating crash failures. The authors claim that more FT protocols will be added in the future. Interactions can be stateful provided the application web services support save and restore checkpointing. The authors propose and integration with resource addressing frameworks such as WS-RF [Czajkowski04]. RMWS does not support message ordering and so cannot achieve SMR. The framework does not support diversity because is only 104

WS-BUS

Chapter 4:


provides passive replication. Of course group membership is at the heart of the protocol. The framework supports asynchronous interactions but without dynamic FT service or QoS based selection. RMWS is decentralised with recovery and group membership in every replica. The scalability of this solution is eected by the scalability of the Spread protocol.

4.10

WS-BUS

WS-BUS [Erradi05, Maheshwari04] has been previously reviewed in Chapter 3 as both a service messaging and discovery approach. It's inclusion in this chapter is based on the unique facilities it oers as a FT-SoC framework. The core of WS-BUS is a QoS based mediator. Based on [Yu04] clients are prioritised into a discrete set {gold, silver...} that is used to partition service capacity. Services are monitored for QoS metrics such as current response times. Client requests are then processed on the basis of their priority and the QoS resources available. The implementation uses a weighted round-robin scheme to ensure that services with the greatest capacity consume the most and highest prioritised requests. WS-BUS is primarily a MOM and so it only supports implied replication techniques. The message queue weakly models passive replication whereas publish-subscribe supports a more active replication. This implies coverage of crash and Byzantine (for state less interactions). Diversity is implied but group membership is not supported. The SoA formed by the broker pattern mediation of WS-BUS is purely asynchronous. WS-BUS oers a QoS based service selection though this is limited to application services. The (implied) FT protocol is set at design time so late protocol binding is impossible. A mediator pattern means that FT is centralised and message queues are not considered scalable.

105

FT-Net Traveler

4.11

Chapter 4:


FT-Net Traveler

FT-Net Traveler [Caituiro-monge07] is a recent proposal that provides decentralised asynchronous message brokering using P2P protocols. It is a framework of replicating web service brokers to forward messages to implementations of well known services. FT is implied by having many instances of brokers and scalability is implied by using P2P protocols. The paper focuses on the distribution of brokers into a Cartesian graph topology with replication as one coordinate and load distribution as the other in a method similar to the Content Addressable Network [Ratnasamy01]. Information on this framework is extremely limited, it's inclusion is based on the novelty of the cartesian graph topology and associated decentralisation and scalability. The authors weakly imply that services are replicated with backups invoked on the failure of the primary node this provides a passive FT model that tolerates crash failures of nodes only. Diversity and group membership are irrelevant because only passive replication is supported. The underlying SoA supports asynchronous multi-hop exchanges because of it P2P foundations. QoS metrics are not provided and the FT protocol is xed. The solution is optimal it terms decentralisation and it inherits double logarithmic O(log(log n)) scalability from CAN [Ratnasamy01].

4.12

RWSI

Resilient Web Services Infrastructure

or RWSI [Norcross05] is a service-oriented framework

that overlays a DHT-based P2P network and has the ability to shift service implementations between hosts. The JChord DHT enhances Chord by replicating data between adjacent nodes in the successor list. When a node fails another node becomes the primary and has all the same information. RWSI uses a scheme called the the interfaces of arbitrary components as services.

Cingal

RRT

to expose

bundles components with XML

bindings. These bundles can be move to and deployed on any Cingal host. Every bundle has a unique key that is used to disseminate it on the JChord network. Every JChord

106

RWSI

Chapter 4:


node is also a RRT and Cingal host. A service s has an unique URI identier U RIs . This is SHA1 hashed to produce the key KEYs . Such a key is used to nd the endpoint URLs of instances of s, {s1 , s2 ...sn }. A

service directory

implemented on every node indexes endpoints against KEYs . This

directory is itself exposed as a service through RRT. JChord replication ensures the service directory is never lost whilst there is one or more nodes. Lookups work as follows. A client searches for a service known by U RIi . It queries an arbitrary node that forwards to the node holding KEYi . This node looks in it's service directory to nd the endpoints of service instances. Lookups are guaranteed to provide correct service endpoints eventually. Hosts are deployed as follows: Firstly, a host registers with a RWSI node using a JChord join. The new node gets inserted into the ring and it populates it's successor list. Certicates prevent arbitrary hosts from joining. The

host directory

maps hosts to keys

generated of the certicate, KEYh = SHA1.hash(CERTh ). A host directory allows replication of hosts for deployment purposes. RWSI allows autonomic Autonomic Manager

service deployments.

An

observes the number of hosts that have a service instantiated. If due

to failures this number falls below a threshold, the manager nds the corresponding bundle on the network, then deploys it on available hosts from the host directory. Autonomic managers are also registered in the network. This allows service instances to interact with them through an RRT interface to pass information such as load. It is the responsibility of the autonomic manager to remove defunct entries after hosts fail. RWSI is limited to an implied passive replication though this extended through the notion of mobile service implementations so that services can survive failed hosts (to the extreme of a network of one host only). The framework has coverage of crash failures only. Passive replication does not support the notion diverse services or group membership protocols. The SoA adopts the asynchronous messaging pattern of the underlying Chord protocol. QoS metrics are not supported and the FT protocol is xed. RWSI like Chord is decentralised and scalable with O(log n).

107

Byzantine FT Frameworks

4.13

Chapter 4:



The frameworks surveyed thus far have provided limited support for the Byzantine failure model. Whilst active replication provides a weak tolerance of Byzantine failures for stateless services, complete stateful tolerance requires total ordering of authenticated messages to achieve state machine replication [Schneider90]. We now present two frameworks that provide state machine replication [Schneider90] in the Byzantine failure model. Thema

[Merideth05] is an implementation of the CLBFT protocol [Castro99, Castro01,

Castro02] that uses the BASE [Rodrigues01] libraries. The architecture is shown in Figure 4.5. There are three libraries that extend BASE to provide integration with gSOAP and Apache Axis.

Thema-C2RS

allows clients to access Byzantine FT services. The server

libraryThema-RS facilitates the management of these services. And nally,

Thema-US

wraps third party services outside the Byzantine environment.

Figure 4.5: Thema [Merideth05]. The primary role of Thema-C2RS and Thema-US is to route HTTP transported messages to the unreliable IP-multicast. For a full discourse on the CLBFT protocol underpin-

108


Chapter 4:


ning Thema refer to Chapter 2. Thema operates as follows. A client sends a SOAP message to the Thema-C2RS library. This bundles the message into a BASE request and transfers the message to all BFT replicas using multicast. The collection of BFT replicas perform CLBFT agreement to assign a sequence number to each request. When 2f + 1 replicas agree they deliver the message the client request to their own instance of the application service. If an application service needs to access an external, non-replicated, service it's outbound message is intercepted by Thema-RS which bundles it into a BASE request and forwards it to the Thema-US library on an external system. Thema-US waits for f + 1 identical requests and then forwards one only the non-replicated service. This the sends a reply mnof t back to Thema-US that in turn multicasts mnof t back to all BFT replicas. Thema-RS treats mnof t as another client request and uses CLBFT agreement to assign a sequence number to it. When 2f + 1 replicas decide mnof t it gets delivered to the SOAP engine. Finally, each replica service sends a SOAP message response mf t back to the original client using multicast. Thema-C2RS intercepts this. When f + 1 identical mf t messages are received by Thema-C2RS is delivers one mf t to the SOAP client. Thema provides the greatest reliability available provide no more than

n−1 3

replicas fail. The price

for this arrangement is a lack of exibility. Replicas must share the same local network to provide multicast. The complexity of the CLBFT protocol in addition to authenticated communications means Thema will have poor performance in relation to simpler schemes. BFT-WS

[Zhao07] is another Byzantine FT framework that addresses the exibility

issue if Thema by applying CLBFT over the top of reliable messaging WS-RM [Davis07]. In consists of a series of BFT-WS handlers within the Apache Axis2 framework as shown in Figure 4.6. The security handler

Rampart

uses RSA asymmetrical keys to authenticate

all interactions between clients and replicas. BFT-WS operates as follows. A client handler BFT-Out processes outbound messages. It generates a create-sequence message when the rst message from the client is sent. A corresponding terminate-sequence is also sent by Universal Unique ID

BFT-Out.

The client generates a

or UUID with the create sequence message. A multicast-sender maps 109


Chapter 4:


Figure 4.6: BFT-WS [Zhao07]. the sequence between the client and service provider to a virtual sequence between client and replicas. Messages are actually left in an out-queue assigned to a group endpoint. The multicast-sender polls the out-queue and broadcasts the next message to all replicas in the group, without multicast. A message is kept in the out-queue until acknowledgements from all destinations. The Global-in handler lters all incoming messages to the group endpoint for duplicates. On the server side application messages are stored in the 110

in-queue

for processing. All

Discussion

Chapter 4:

message queues are persisted to stable storage by the


storage manager.

A

Total-order-

runs on a separate thread polling the in-queue that are in order within a sequence.

invoker

These messages are passed to the

total-order-manager

for the CLBFT agreement between

the replicas. Eventually messages get delivered to the application services. Responses are notied to the total-order-invoker which delivers them to the client. On the client messages are moved from the global-in to the

voter.

When f + 1 identical messages are gained they

are passed to the in-queue for nal delivery to the client application. BFT-WS is interesting for two reasons. Firstly, it guarantees FIFO-total order. Secondly, it partially removes the need for multicast. The performance is expected to be poor because of the use of RSA based authentication for all message exchanges. Also there are many extra sequencing and acknowledgement messages for WS-RM, in addition to the messages for CLBFT agree. Despite this the authors provide attering performance metrics. We present the critique of Thema and BFT-WS together as they provide the same BFT protocol using dierent implementations. BFT is xed in supporting only the CLBFT protocol. Through total order messaging, BFT provides true stateful Byzantine failure coverage (and by implication all nested failure models). It has no associated group membership protocol. The SoA of BFT follows the FT process model from CLBFT and is therefore asynchronous. There is no provision for QoS and of course the protocol is xed. The framework is decentralised with the clients participating as services. BFT is not scalable because of the high number of messages exchange in each iteration.

4.14

Discussion

We have presented our survey of FT-SoC frameworks in Table 4.1. The criteria that form the elds in this table are discussed at the beginning of this chapter in section 4.1. From our survey it can be seen that support for our criteria and objectives is patchy. Every criterion is met by at least one framework but no framework meets more than two thirds of the criteria. 111

Intermediary Intermediary Intermediary Connector Client Client CORBA CORBA Recovery Group Mem. Group Mem. MOM P2P P2P CLBFT CLBFT P2P

Class

112

4 4 4 4 4 8 ∼ 8 8 8 8 4

8 4 4 4 8 4 ∼ ∼ ∼ 8 8 4

assumptions!

4 8 4 4 4

b

AR

4 4 4 4 8

a

PR

4 4 4 4 4 4 4 4 4 4 4

4

4 4 4 4 4

Crash

∼ ∼ ∼ ∼ 8 ∼ 8 8 4 4 ∼ 8

TOTAL TOTAL

8 8 8 8 8 8 8 8

PROB. TOTAL

8 8 8 8 8

∼ f 8 ∼ ∼ ∼ ∼

Ordering

Byzantine

8 8 4 4 4 4 4 4 4 4 4

4

8 8 4 8 8

8 4 8 4 4 8 4 4 4 4 4

4

8 8 8 8 8

Asyncc DCd

Table 4.1: FT-SOC Frameworks.

8 8 8 8 8 8 8 8 4 4 8

4

8 8 8 8 8

SMR

∼ 8 8 4 4 8 8 8 8 8 8

8

8 8 8 8 8

GMe

8 8 8 8 8 4 8 8 8 8 8

8

8 8 8 8 8

QoS Matching

8 4 8 4 4 8 4 4 8 8 4

4

4 4 4 4 4

Scalable

4 ∼ ∼ 4 8 ∼ 8 8 8 8 4

4

4 8 4 4 4

Diversity

8 8 8 8 8 8 8 8 8 8 4

8

8 8 8 8 8

Late Protocol Binding

Chapter 4:

a Passive Replication. b Active Replication. c Asynchronous - No implied timing d Decentralised e Group Membership f ∼: Implied or Partial Support

Container [Sommerville05] FAWS [Jayasinghe05] WS-BPEL FT [Dobson06] IWSD [Salatge07] WS-FTM [Looker05] MidRWS [Ye05] FT-SOAP [Fang04] FTWeb [Santos05] TransparentFT [Dialani02] WS-Replication [Salas06] RMWS [Osrael07] WS-BUS [Erradi05] FT-Net [Caituiro-monge07] RWSI [Norcross05] Thema [Merideth05] BFT-WS [Zhao07] WSPBFT [Hall07]

Framework

Discussion Existing Approaches to FT-SoC

Discussion

Chapter 4:


An enhanced FT framework for SoC must support multiple FT protocols with coverage over passive, active and state-machine replication. None of the frameworks reviewed provided this. Mediated approaches [Dialani02, Dobson06, Jayasinghe05, Looker05, Salatge07, Santos05, Sommerville05] support activate and passive replication only. Only [Merideth05, Zhao07] support true SMR replication but this is at the cost of simplicity, performance and scalability. Diversity is supported by those frameworks that provide the active replication that can utilise it. Group membership protocols that allow nodes to join in the FT protocol at runtime are only supported by [Osrael07, Salas06]. Our framework must provide all the FT protocols available to other frameworks whilst enabling group membership. To support multiple FT protocols the underlying SoA of a framework must emulate the FT process model, this is achieved by messages being exchanged exclusively asynchronously thereby removing any underlying timing assumptions. Frameworks based on a SMR protocols such as BFT [Merideth05, Zhao07] are asynchronous, as are those based on Group Membership [Osrael07, Salas06], P2P [Caituiro-monge07, Norcross05]. The use of QoS is limited to WS-BUS [Erradi05] (technically a MOM rather than FT framework). This is not surprising given all the asynchronous presented have hard wired-endpoints (keys in the P2P frameworks). Our framework needs to support the in-line broker pattern from MoM (and WS-BUS) to allow services to be discovered at runtime. This is not currently provided by any other true FT framework. The ability to choose an appropriate FT protocol implementation at runtime (late protocol binding) is not supported by any reviewed framework because this requires in-line service discovery. Our framework must support late protocol binding. Many of the frameworks cannot provide true FT because their very infrastructure is composed of singleton components that are not immune from failure. In most cases the singleton component is the intermediary (or proxy) indirecting exchanges between the client and service [Dialani02, Dobson06, Jayasinghe05, Looker05, Salatge07, Santos05, Sommerville05]. In other cases management components are singletons especially where UDDI is used [Dialani02, Fang04]. Some frameworks provide decentralisation by the SoA 113

Summary

Chapter 4:


being derived from the FT protocols provided by the framework [Merideth05, Osrael07, Salas06, Zhao07]. These solutions are in general not scalable because of the many message exchanges required by the protocol whilst having no topology optimisation. Solutions using P2P give the best scalability [Caituiro-monge07, Norcross05] because they address how SoA should be logically distributed. Our FT framework should employ P2P protocols to make it not only decentralised but scalable also.

4.15

Summary

This chapter provides a detailed literature review of existing frameworks that address FT in SoC. We started by discussing each framework and provided an in depth discussion including a table to summarise the review.

114

Chapter 5

An Integrated FT-SoC Framework In this chapter we present an integrated FT framework for Service Oriented Computing. Our framework is known as WS Peer based Fault Tolerance (WSPBFT). Figure 5.1 shows the extensive architecture from a high-level overview.

Figure 5.1: Architechure of WSPBFT. The frameworks starts with a new SoA, Late Asynchronous Message Brokering (LAMB). This SoA, like an MoM, is based on the broker pattern that routes a message to it's destination, in this case a web service rather than a message queue. To support this SoA (as well as FT protocols) we provide Sandbox is an introspective container for services that 115

Chapter 5:

An Integrated FT-SoC Framework

oers special facilities for FT services including failure detection, synchrony and a binding to LAMB itself. Services are distinguished as those that provide the application (these are called functional in requirements parlance) and those that provide FT. All FT services in our framework give an extended version of the interface provided by the application services because FT mediate application services. Every host that forms part of our architecture possesses one instance of LAMB and Sandbox. Communications between hosts (also called nodes or peers) takes place over protocols provided by the JXTA P2P protocols discussed in section 3.5.2. We adopt JXTA to provide a distributed information model based on adapted WSDL (with bindings to LAMB and QoS metrics). This is achieved by wrapping WSDL in a new type of JXTA advertisement the WS-Advertisement forming the basis of service discovery. We also make use of JXTA's ability to share peer information and self organize (with DHT and rendezvous). Finally, we use the JXTA pipe which abstracts unreliable asynchronous unicast and broadcast primitives. JXTA supports implementation bindings but these are passive. To link the LAMB and Sandbox infrastructure to JXTA we provide the WSPBFT platform. This code organises the JXTA as well as providing a single codebase for our entire architecture. The platform additional supports a web server that in turn provides a HTML user interface. This chapter is organised as follows. We initially describe LAMB with it's dening propositions, bindings to SOAP, WSDL and JXTA. This includes the simple priority based QoS selection scheme. The chapter then describes the operation of Sandbox with references to the FT facilities oered to simplify the development of FT replication. Next we discuss the WSPBFT platform with reference to peer and service management, owing on to the HTML based user interface. Finally, this chapter discusses the adapted FT protocols derived from those presented in Section 2.3. Throughout this chapter we provide a set of simple scenarios based on the illustrative example of an IT help desk within a university. The help desk consists of an administrator that passes incoming help requests (metaphor for messages) to IT experts (metaphor

116

Late Asynchronous Message Brokering

Chapter 5:


for services). These experts may perform tasks like relling printers or resetting passwords. These illustrative scenarios are used where necessary to illustrate a proposition or architecture.

5.1


Late Asynchronous Message Brokering (LAMB) is a SoA that routes SOAP messages asynchronously based on their content to suitable services. LAMB is based on a model where messages are alternated between services and FT models [Hall07]. Using our IT help desk example, LAMB can be seen as an administrator that takes incoming requests for help and distributes them to the IT experts that he knows can help. LAMB is based on the broker pattern described by MOM frameworks [Erradi05, Fox02, Hapner02, Pallickara03] perhaps most closely resembling Narada Brokering. However, the LAMB SoA diers from MOM by rstly not support message queues, secondly LAMB adheres to this following set of propositions.

Proposition 1 interface ix where

ex

A message

mx 7→ {i1 ...in }

7→ {s1 ...sn } where sx

where

ix

is a service interface.

is a service denition. Finally, the service

In turn the

sx 7→ {e1 ...en }

is a service endpoint.

A message has a name that identies the content. This name is used by LAMB to identify candidate services that are able to consume the message. A WSDL dened service provides a set of endpoints to which the message can be routed. The assumptive part of this proposition is that a message will always uniquely identify services. It is practical to make this assumption. Namespaces diversify identical message names by prexing a URI. All names are qualied. We follow the

well known service

notion [Alonso02]. Finally,

if a message is consumed by two dierent interfaces then they intersect to form another interface, mi 7→ ix ∧ mi 7→ iy ⇒ mi 7→ iz : iz ≡ ix ∩ ßy . All LAMB entities {mx , ix , sx , ex } are identied by URI. For services and interfaces this is their name combined with target namespace for the WSDL description. SOAP message URIs are formed from the rst element of the body content and it's namespace. This URI 117


may be over-ruled by the

action

Chapter 5:


an attribute set in a special LAMB header. The action is

akin to the SOAP action using in HTTP bindings for SOAP [Gudgin07b] or a routing key in MOM. Taking our IT help desk example, this proposition can be seen as an email message requesting help with printing. The administrator sees that the problem is with printing and looks at his list to see who has expertise in printing and forwards the email message to them. Uniquely for a SoA, LAMB allows one service to be co-hosted, providing dierent endpoints. When a LAMB broker chooses a service it routes a message to

all

endpoints

relating to it. If there is a transport optimisation such as multicast or JXTA pipe then LAMB will use it, otherwise it broadcasts to all related endpoints. Using our example this is seen as the administrator forwarding the printing problem message to a mailing list consisting a panel of printing experts.

Proposition 2

Everything is exposed as a service that consumes one-way (asynchronous)

messages.

From the perspective of the LAMB SoA there only exists messages and services. Services may be functional, i.e. an application service, or they may be a FT service thus FT is not treated as an orthogonal issue. Messages are always one-way and take an arbitrary amount of time to be received. A service must provide a WSDL description or it cannot be brokered by LAMB. Propositions 1 and 2 remove the need for a composition layer such as WS-BPEL. Compositions break down to a set of matching message/service denitions at design time that resolves to service selection (matching) at runtime. Clients in the client-server sense are also services, they must provide interfaces (and WSDL) for responses. With the IT help desk scenario, IT experts will accept mail (post or email) requests but not direct phone calls where they have to solve problems there and then.

Proposition 3

Policy Agnostic.

A LAMB broker is not restrained by any specic policy (other than these propositions). 118


Chapter 5:


Unlike MOM [Erradi05, Fox02, Hapner02, Pallickara03], LAMB does not assume a publishsubscribe mechanism and certainly does not support message queues. Of course the LAMB SoA could use a message queue if it was provided as a service. The advantage of this agnosticism is that any FT protocol could be plugged in for another (if they provide the same interfaces). This is shown in Figure 5.2

Figure 5.2: FT Agnostic Flow in LAMB. LAMB

may

provide a modicum of FT through smart routing decisions and co-hosted

services. To ensure LAMB is independent of any one scheme the actual service selection process is pluggable. A version of the selection process may include arbitrary, roundrobin or QoS matching based on priority information [Yu04, Makris06]. A footnote to this proposition is that fail-stopped services (where a crash has been detected) are automatically removed for the LAMB registry. Within the IT help desk example this proposition is described as the administrator not having any specic dictates for whom to choose to eld, for example, a printing problem. The administrator is free to try several ideas for whom is the best export to choose given a specic topic. As a footnote, the administrator is fully aware of experts that are on holiday and he will not forward help requests to them.

Proposition 4

Transport Agnostic.

A LAMB broker uses any transport protocol as dened by endpoint URI. The usual suspects are HTTP, TCP/IP, IP Multicast and the JXTA pipe. Using the IT help desk scenario this proposition can be explained as accepting help requests by post, email, web interface, in person or by telephone.

Proposition 5

Optimistic Brokering.

The LAMB SoA is best-eort but it makes guarantees about message delivery thus we call it optimistic. Any guarantee would require a specic policy for enforcement thus break 119


Chapter 5:


proposition 3. Guarantee and recovery is the responsibility of the services themselves providing the need for FT services. With the IT help desk example it is not the responsibility of the administrator to ensure that requests for help are dealt with. They simply forwards the requests to the matching IT expert. The experts themselves may implement a scheme that ensures that help requests are not lost or incorrectly answered.

Proposition 6

Interoperability with SOAP and WSDL.

LAMB is interoperable though bindings to SOAP and WSDL. In common with WSAddressing [Gudgin06], LAMB annotates SOAP messages with a special header. WSDL descriptions can be annotated with LAMB mark-up. For example, LAMB message denitions can be used for shorthand within WSDL. These bindings are described in Section 5.1.1. This can been seen as the administrator of the IT help desk making annotations on incoming help requests and the CVs of all experts to make the matching process simpler for themselves.

Proposition 7

Support for stateful services.

Services may hold state between interactions. It may be appropriate to revisit a service in the same context because it contains relevant state. LAMB assumes this always the case, unless otherwise instructed. It is up to the developer to ensure services are partitioned on state. LAMB keeps a history of services visited by a message or any of it's causal [Lamport78b] ancestors. Before delivering a message m to service sy a LAMB broker records (m, iy , sy ) in the causal history of the LAMB header. If later a causal descendant of m, m1 , gets matched to iy then sy must be chosen irrespective of selection. The causal history is ordered. If an interface appears twice in the history the most recent service of the two is chosen. To explain this using the IT help desk example all communications should be seen as email. When emails are forwarded or replied to they embed the previous conversation 120


Chapter 5:


in them. If all communications are through the administrator then they can inspect the previous to determine which IT expert had previously been dealing with a printing help request (provided they have expertise in printing). This allows the expert to build up feedback information to aid in the resolution of the problem.

Proposition 8

All brokers are equal.

LAMB brokers can reside on any host. All brokers are equal, within a domain they will see the same set of services. There are no topological or geographical weightings like [Fox02, Pallickara03]. A LAMB broker

may

forward a message to other brokers but

this behaviour is discouraged as it leads to ooding. Instead, LAMB relies on a scalable underlying dissemination mechanism such as a DHT to ensure full-consistency between brokers. The IT help desk example is extended so that every department has their own IT help desk. Within the university all IT help desks have access to all IT experts even if they are in other departments.

Proposition 9

LAMB Enabled Services.

A service must be adopted to work with LAMB. All responses, outbound messages relative to a service, must be passed to a LAMB broker. This applies to clients. Lastly, services must keep the causal and contextual relationship between messages by coping the LAMB header from incoming to outgoing messages. Without these proposition 7 could not hold. Within our IT help desk scenario an IT expert will only accept help requests that have come from the administrator and in turn will only respond to the client through the administrator. All expert responses keep the same information that was in the request so that the administrator can correlate the client and IT expert in subsequent requests. The same applies to the client if they want the problem dealing with successfully.

Proposition 10

Zero Recursion

LAMB prevents a broker from selecting the service a message has just come from otherwise recursion

is the likely outcome. This may happen with the intermediary pattern. A 121


Chapter 5:


message may go to third party and then return. This proposition can be bypassed by directly addressing messages. Within the IT help desk example imagine if an IT expert (expert 1) does have the knowledge to deal with a printing help request. So they send the request back to the administrator. They would not expect the request to be immediately returned to them (expert 1). Of course the request may go to another expert (expert 2) who explains what is needed to x the problem so they request goes back to the original expert (expert 1) to action. LAMB addresses our research objectives in two ways. Firstly, it enforces an asynchronous messaging environment that then later facilitates the introduction of FT protocols that are based on the FT process model. Secondly, LAMB is provides a fully autonomous runtime where messages are brokered to service providers on behalf of the client.

5.1.1

System Model

LAMB comprises of three core tasks as shown in Figure 5.3.

Discovery

takes a URI

for mx and procures a set of WSDL descriptions for services {s1 ...sn }. Every service exposed to LAMB must provide the message URIs it supports.

Selection

takes a set of

WSDL descriptions and chooses the endpoints that are most appropriate, taking the causal history into account. The mode used for selection is dictated by a directive attribute in the LAMB header. Finally, delivery sends a message to a set of endpoints. It always calculates the most ecient transport (for example a JXTA pipe). Delivery is also responsible for updating the causal history and the service URI. LAMB exposes any combination of discovery, selection and delivery including all three in the order they are presented. The discovery and delivery tasks have xed operation. LAMB selection can be controlled through the

directive

element in the header, thus can

be controlled by a client. LAMB provides the following directives. The

default

directive is

assumed without being set on the header. It states that one service should be chosen, give preference to the last matching service in the causal history. When a match is not found

122


Chapter 5:


Figure 5.3: Anatomy of LAMB. in the causal history use current selection plugin to nd new service. Our demonstration selection plug-in provides priority based matching, described later.

New

default except any service in the causal history cannot be chosen. The

is similar to

local

directive is

similar to default except only a service with a local endpoint is chosen. A message is only then sent to that endpoint. forward

Broadcast

selects all matching services and endpoints. Lastly,

sends the message to another LAMB broker with a TTL.

The IT help desk example explains discovery as the collating of CVs of candidates with relevant expertise in IT. Selection is the administrator making an arbitrary choice of IT expert based on what they think is the best criteria. Finally, delivery is the sending on of the request to the IT expert (or panel) using the addresses listed in their CV.

5.1.2

Service Bindings

The LAMB header is a block of XML, similar to WS-Addressing, that contains routing information required by brokers. It is shown in Figure 5.4. We have previously described

123


Chapter 5:

some of the elds such as directive and action. endpoints visited from causal ancestors. Finally, the

context

Priority

Casual


is the container for the history of

eld is used for QoS matching of services.

is used to scope system state for application and FT services.

new http://trading.com/::commit http://trading.com/clbft::Andros ..

Figure 5.4: LAMB SOAP Bindings. WSDL Service descriptions can be annotated with LAMB bindings as seen in Figure 5.5. Message

is syntactic sugar to allow a message URI to be registered within an interface (port

type) without having to dene the associated WSDL operation and message blocks. LAMB has specic bindings for endpoints. These are used to reference new transports without existing WSDL bindings (for example JXTA pipes). Endpoints can also be expressed as SOAP addresses for traditional protocols such as HTTP.

5.1.3

LAMB with JXTA

JXTA P2P overlay framework (described in Chapter 3) is used to support the implementation of LAMB brokers for the following reasons. JXTA is policy agnostic, it does not apply message queues or geographical optimisations [Erradi05, Fox02, Hapner02, Pallickara03]. Information is globally disseminated as is consistent with LAMB proposition 8. Brokers can take advantage of a scalable, O(log n), and survivable storage system provided by the JXTA SRDI. Finally, JXTA provides pipes, useful transport abstractions to aid message delivery. SoC has been integrated with JXTA before. JXTA Bridge [Hajamohideen03] and 124


Chapter 5:


Figure 5.5: LAMB WSDL Message Bindings. JXTA-SOAP [Amoretti08] are both examples. They use a JXTA abstraction called modules to represent services. Modules are the JXTA equivalent of WSDL, they are disseminated as advertisements. JXTA-SOAP inserts a WSDL description into a Advertisement

Module Specication

or MSA. This approach is limited as MSAs are only indexed on their unique

identier. Instead we create our own custom JXTA advertisement, WSAdvertisement, that has extensible indexes required by LAMB. We show in Figure 5.6 how it is created by extending the JXTA class vertisement.

ExtendableAd-

The WSAdvertisement class in instantiated with a WSDL description that it

essentially wraps. Message URIs are extracted from the WSDL, combined into a comma separated string and placed in the indexed eld M . It is then possible to search for matching WSAdvertisements using a wild-carded message URI with the JXTA discovery service. The XML of a WSAdvertisement is shown in Figure 5.7. WSAdvertisement has two other indexes. One associates each WSAdvertisement with a peer. A second provides an unique identier for each service instance so that it cannot be discovered twice. The unique service identier consists of the peer identier and the service URI. Publishing

is performed through the LAMB discovery task in conjunction with the

125


Chapter 5:


Figure 5.6: Advertisement Classes for JXTA Service Discovery. urn:jxta:uuid-00CFA::http://tradefloor.org/::TFSourceAndrosRSAService urn:jxta:uuid-00CFA http://tradefloor.org/andros/::FetchIndicator, http://tradefloor.org/andros/::preprepare, http://tradefloor.org/andros/::prepare, http://tradefloor.org/andros/::commit,

Figure 5.7: A Web Service Advertisement. JXTA discovery service. A WSDL description is wrapped in a WSAdvertisement and stored in the local JXTA cache. The advertisement is then pushed to adjacent peers. Distribution of the advertisement is then the responsibility of the SRDI. Eventually, all peers in the JXTA network receive the advertisement in their local cache. Periodically, every peer transmits a pull request for any WSAdvertisement, this accelerates distribution times. WSAdvertisements have an innite lifespan, they are only removed from a peer's cache if a service is undeployed or it's host is known to have crashed. Discovery

also uses the JXTA discovery service. Message URIs are matched against

service WSAdvertisments in the local cache using the M eld. The WSDL document is extracted from matching WSAdvertisements and presented to the client. WSDL documents 126


Chapter 5:


are additionally cached by the LAMB discover task because serialising them from JXTA is an expensive operation. A fail-stop detector, Eternity, is used to lter results of crashed hosts. Service lookups are a scalable O(log(n)). The framework is further optimised because the LAMB infrastructure periodically polls for new WS-Advertisements from the JXTA DHT. Brokering only needs to search the local cache for services making discovery extremely fast. The integration of LAMB with JXTA solves our research objective of addressing the lack of decentralisation with many FT frameworks. Trivially, it does this because JXTA is based on the P2P paradigm. All centralised components are removed because all hosts in the network formed through JXTA possess a LAMB broker of equal ability as required by proposition 8.

5.1.4

QoS Selection

The problem is that given a set interfaces that successfully match a message it is impossible to dierentiate an application (functional) service for one that supports FT. In [Hall07] this was solved by separating services and FT models but in LAMB all entities are services.QoS markup on both type of service provides the means to distinguish the two thus the service matching becomes QoS enabled. Examples of service selection with QoS include [Dobson05, Makris06, Ran03, Yu04]. QoS based selection addresses our research objective of enabling service dierentiation thereby enabling the SoA's integration with FT. We can model QoS selection using out IT help desk example. It is the same as the administrator inspecting the resumes of IT experts looking for how much experience an expert has. The more experience a IT expert has the less availability they are expected to have. So given a help request the administrator can choose between a likely better or faster resolution. This choice may be inuenced by the fact that the help request may have urgent emblazoned on it. We propose a simple QoS matching scheme based on two metrics for reliability (0 ≤

rel ≤ 100) and performance (0 ≤ perf ≤ 100) as shown in Figure 5.8. The scheme

127


Chapter 5:


Figure 5.8: LAMB Selection Priorities. works as follows. A service denes a priority (perfs , rels ) as shown in Figure 5.5. The LAMB annotated message denes a priority (perfm , relm ) as shown in Figure 5.4. Each match is scored as µ = (101 − perfm ) ∗ |perfm − perfs | + (101 − relm ) ∗ |relm − rels |. The service with the lowest µ is selected. The scheme not only matches priorities it also allows the message to favour reliability or performance metric instance of the service under inspection. Reliability versus performance allows for specic FT protocols to be chosen at runtime based on the clients needs. Of course having only two metrics is limiting but it is sucient to demonstrate the approach.

5.1.5

Delivery

LAMB delivery takes a set of endpoints and routes a message to them. Endpoints fall into two categories. A network

service

provides and endpoint described by a URI, it is accessed

by a protocol such as HTTP. We provide a proxy that supports duplex communications to synchronous network services. This proxy takes a response, populates it with the context from the request and passes it back to the LAMB broker as per proposition 9. Alternatively, a

framework service

uses a JXTA addressing mechanism. We describe these services in

Section 5.2. Framework services receive communication in two ways. If no endpoint is specied in the WSDL then LAMB sends the message optimistically to the peer hosting the service. If a pipe abstraction is specied within an endpoint then that is used. The advantage of

128

Sandbox

Chapter 5:


a pipe is that it can can be used to propagate messages opportunistically over multicast. To ensure that services bind to a propagate pipe a seed attribute is set, this is shown in the pipe element within Figure 5.5.

5.2

Sandbox

Sandbox

is a container for FT framework services. It provides a direct interface to the

adjacent LAMB SoA broker supporting asynchronous message delivery only. In addition it gives the hosted service access to a number of facilities that simplify the task of developing FT protocols such as explicit synchrony and failure detection. The container aspect stems from [Hall07, Sommerville05], these both use a mixture of XML and Java (without other facilities) to provide a FT service. This approach has proven unnecessarily complex and slow. Instead, Sandbox supports services written purely in Java. To demonstrate the need for Sandbox we can look towards the IT help desk illustrative example. A FT service hosted in Sandbox can be compared to an IT expert working from his oce with a range of computers to provide diagnostic tests and communication to the outside world. Alternatively, a FT service not hosted in Sandbox is like an IT expert working on the street. Sandbox borrows from the Appia framework [Carvalho03, Guerraoui06]. This uses layers of FT abstractions to achieve a FT protocol. The functionality of an abstraction is provided in a specialised session class, messages are represented by an event class. Layer classes represent the capabilities of a session. Every session must provide a handle(event) method that delegates to functionality based on the message in the event. By merging layer instances a QoS object is created. The Appia kernel uses the QoS object to route an event. Internal events to session objects are passed back to the Appia kernel. The advantage of Appia is that it provides predened abstractions such as P and P detectors. Unfortunately, Appia is over-verbose and is not thread safe leading to synchronisation problems especially with more complex protocols. Sandbox improves Appia by amalgamating session and layer classes into a service class. 129

Sandbox

Chapter 5:


Figure 5.9: Sandbox Service Class Inheritance Example. Unfortunately, Java prohibits multiple inheritance and Ruby-like mix-ins so abstractions must be delegated out to external facilities. RWSI [Norcross05] provides a facility called the RRT that exposes arbitrary Java objects as services. Sandbox also provides this facility using Java introspection, as used in JavaBean technologies. This exposes any public method on the service class. Sandbox has two conventions: 1. Method names match message names (without the qualier). 2. All exposed methods receive one parameter, the SOAP message. The service class diagram is shown in Figure 5.9.

5.2.1

System Model

The system model for Sandbox is shown in Figure 5.10. Sandbox itself has two roles. Firstly, it routes messages. Sandbox inspects an incoming message for a service URI and context identier. The service URI is used to lookup a deployed service class. The context identier is to lookup an instance. If an instance is not present then Sandbox creates one from the service class. Finally, the message is passed as a parameter to a method of the same name. The second role provided by Sandbox is to manage the lifecycle of service instances. A service must provide the init and destroy methods. When initialising a service the WSDL description is passed along with context identier and a reference to Sandbox itself. Destroy is called when the service is undeployed. Sandbox additionally provides logging, authentication, fail-stop detection, synchronisation and view-change facilities. 130

Sandbox

Chapter 5:


Figure 5.10: Anatomy of Sandbox. To ensure proposition 7 Sandbox supports stateful services. Sandbox reects the abilities of WS-RF services [Czajkowski04]. State management works as follows: Every LAMB enhanced SOAP message may contain an optional context element with a path attribute; when a stateful message reaches Sandbox this path is used in conjunction with the service URI to form an instance identier. Sandbox checks to see if an instance has been register against this identier; a new service instance is created if there in no current one for this identier. A context path may be set by a client or any service wanting to partition the state of subsequent service invocations. From [Hall07] we inherit correlation identiers. These are unrelated to state partitioning provided by Sandbox but instead causally relate a set of messages. Correlating messages is useful to gather metrics between the application start and end point so we know how long operations are taking and how transactions are being completed successfully.

131

Sandbox

5.2.2

Chapter 5:


Logging with Domesday

Domesday

is a logging facility that extends Java data structures that provides timestamped

storage of messages and other arbitrary objects. The log is critical to more complex FT protocols [Lamport01, Castro99]. The original version of Domesday [Hall07] used a XML object model to provide logging. However, this system provided unacceptable performance so we have create a similar model based on Java lists and maps. A service may instantiate as many Domesday logs (DLogs) as it wishes but these are unrelated. The interface for DLog is shown in Figure 5.11. Domesday does not provide persistence to stable storage so cannot be used for crash-recovery.

Figure 5.11: Domesday Log Interface. In the IT help desk illustrative example, logging can be seen as an essential activity for IT experts to not only to aid their work organisation but for their work to be assessed. Domesday could be seen as a computer system with which IT experts log all their work so that information can be accessed again at a later time.

132

Sandbox

5.2.3

Chapter 5:


Authentication with Gatekeeper

Gatekeeper tension

provides message and host authentication through the

Java Cryptography Ex-

or JCE. It can generate asymmetrical keys, symmetrical keys and digests. Gate-

keeper provides two ways to authenticate messages.

Digital Signatures

use asymmetrical

cryptography to universally guarantee a message is from a sender and it has not been tampered with.

Message Authentication Codes

or MACs provide the same services as digital

signatures but between two known parties only.

Figure 5.12: Gatekeeper Startup Sequence. Before Gatekeeper can provide authentication it must be initialised. This process is discussed further in Section 5.3. When peers hosts know about each other they perform a key exchange as shown in Figure 5.12. This works as follows. Peer pa generates a public/private key pair using a KeyPairGenerator obtained from

java.security.

The

algorithm used is RSA. All asymmetrical keys generated are cached in the KeyStore because of generation overheads. When peer pb is discovered pa sends a public key with it's identier to it. Peer pb responds with it's own public key and both peers store each others keys. A random string, the

secret

is generated by pa , encrypted using the public 133

Sandbox

Chapter 5:


key of pb (with a Cipher object) and sent to pb . This guarantees only pb will be able to read the secret. Peer pb acknowledges pa with a null message. Finally, both peers store the secret in the KeyStore where it is used to generate MACs. Digital signatures are generated using the Signature object, the peer's private RSA key and the body of the message to be signed. The signature is stored in a gatekeeper header of the message as shown in Figure 5.13. In addition the peer identier is included. The peer identier attribute is used by a receiver to look up the public key of the sender. A public key and message body are passed to the Signature object which indicates whether the message is authentic. AAAAB3NzaC1kc3MAAACBALUlSwMrerx9XU8cRdRaVKfaq8RluC9Qq1HdlFVsSGplCEbM3YJUhaC/Ro6rzY8Nrbq4mGZ/ oMtfstz8RTCbgynUsuCuIoEs71yfpc+IValrV3KgpmowPi1lbJGns9b+0uaBqUoObvCGS8PhknlDaxK1lyl2Rcjw/CVgv Jbl65ttAAAAFQCpip+qgkHTlZoH81DCY5EEneGXAQAAAIBNnyyBlLJm72SIqQj9oNYIEAlfaODifsYkeDCV3q2m99pVdS= j6lwx3rvEPO0vKtMup4NbeVu8nk= hft5t9gf8g90engt9ne7dendfao=

Figure 5.13: Gatekeeper Header. MACs are generated by the sender using a Mac object, secret key and the message body. When generating a MAC a sender needs to know the secret key of each of the recipients. The algorithm used is is HMAC with SHA1. Many MACs are stored in the gatekeeper header as shown in Figure 5.13. When a MAC is received the secret key is found from the sender peer identier. A receiver recreates a MAC with the Mac object, secret key and message body. If the sender MAC and receiver MAC are equivalent then the message is authentic. Gatekeeper is accessed by framework services through the interface shown in Figure 5.14. This includes methods for authentication, encryption and digest generation.

134

Sandbox

Chapter 5:


Figure 5.14: Gatekeeper Interface to FT Services.

5.2.4

Crash Failure Detection with Eternity

The Eternity Failure Detector is presented in [Hall09]. It provides detection of crashed or partitioned peers. Services can register a listener with entering through Sandbox and get notied when a peer has crashed. The eternity model is described in Figure 5.15.

Figure 5.15: Eternity Interface. In the IT help desk example Eternity can be seen as a system that informs the administrator and IT experts when their colleges are not available through holiday or perhaps departments shutting down.

5.2.5 The

Synchronisation with Clockwork Clockwork

facility provides synchronisation by logging an

a timeout notication if a corresponding

out

in

event and generating

event is not logged within a certain upper

bound. The two messages are tied together by a correlation identier in the context of a LAMB header. Every service has it's own instance of clockwork that is congured by a binding in the WSDL document. This is shown in Figure 5.16. 135

Sandbox

Chapter 5:


Figure 5.16: Clockwork Bindings in WSDL. To receive notication of a synchronisation failure a service implements the Clock-

workSyncFailListener that it registers with the Clockwork object as shown in Figure 5.17. The service is free to choose where the synchronisation points are, usually they are associated with incoming and outgoing messages. When the syncIn() method is called a countdown timer is started for the time dened in the bindings. If a syncOut() method for the correct message URI and corresponding correlation identier is called the countdown is cancelled. If not, all listeners registered for a synchronisation failure event are notied. A failure event usually causes some form of reconguration but this is FT service specic.

Figure 5.17: Clockwork Interface. A problem with the basic Clockwork system is that it can raise too many failure events. 136

Sandbox

Chapter 5:


If a FT service becomes overloaded throughput may be reduced to a point where every message raises a synchronisation event. This will typically cause reconguration after reconguration preventing the service from recovering. Therefore, Clockwork allows a

limit

to how often synchronisation failures can be raised. This is set in the WSDL bindings. If limit is set to 10000 then one synchronisation failure is allowed every ten seconds. Clockwork provides a reset method that allows all current countdowns to be forgotten. This may be used after a reconguration to allow clearance of a backlog of messages. Our framework supports the

liveness

property through a combination of Eternity and

Clockwork. Our FT servicess dened in section 5.4 where appropriate listen for triggers from Eternity and in particular Clockwork to enforce a change in protocol topology such as eventual leader election or view change. This operation ensures that no protocol remains deadlocked i.e. eventually something good will happen.

5.2.6

Deterministic View Changes with Viewpoint

Viewpoint

provides a deterministic way of deciding which node in a group should be pri-

mary. This is required by abstractions such as Ω Eventual Leader Election and viewstamped replication [Oki88, Liskov07]. Viewpoint consists of a population of identiers and a view epoch. If these are the same on two peers then they are both guaranteed to choose the ith peer where i = view % |population|. We ue JXTA peer identiers as the population.

Figure 5.18: Interface provided by Viewpoint. Viewpoint may be used in two ways. In the

internal

form the population and view

epoch are managed by a Viewpoint instance. Changes in view are notied by advancing 137

Platform Model

Chapter 5:


the view as shown in Figure 5.18. Changes in population are similarly notied but these cause the view to be advanced again. A service may register a listener to be notied when the view and therefore primary changes. Alternatively, the service may manage the view identier and population itself. In this case it accesses Viewpoint statically as a singleton to hint the primary given a view and population.

5.2.7

Summary

Sandbox addresses our research objectives as follows. It greatly simplies deployment of a FT service because it pre-solves reliability problems such as failure detection and explicit synchrony. Materially a service consists of a Java class and associated WSDL description. FT protocols are pluggable, many can be hosted simultaneously and then chosen using the QoS selection scheme. By supporting these objectives it goes along way to increasing the utility of dierent FT models with SoC.

5.3

Platform Model

Our framework,

Web Services Peer Based Fault-Tolerance

or WSPBFT, provides LAMB

brokerage and Sandbox framework services on top of a platform that uses JXTA as it's overlay network as shown in Figure 5.1. JXTA is the best choice to provide decentralisation, scalability and programmability. In this section we discuss the platform including webbased user interface. Each WSPBFT peer contains a code base consisting of JXTA and our platform libraries. They are able to join the

platform peergroup

by the virtue of this code base. When a peer

starts it rst joins the JXTA netgroup from which it attempts to locate another peer hosting the platform group. Some peers will automatically become rendezvous peers if their endpoints are listed in a conguration le called rendezvous.txt in the code base. If the platform peergroup is found over rendezvous or on the local network then the peer joins the group. If after a xed period no platform peergroup is found then one is created locally. Once the platform peergroup has been established locally a 138

Group Service

object

Platform Model

Chapter 5:


Figure 5.19: Platform Model. is created. A Group Service is responsible for the WSPBFT object model. This includes LAMB, Sandbox, Gatekeeper, Eternity and the

peer manager

and

service manager

as

shown in Figure 5.19.

5.3.1

Managing Peers

The JXTA model is passive. A peer manager is responsible for discovering and maintaining representations of all known peers. Each peer representation has the facility of outbound unicast inter-peer communications based on the JXTA pipe abstraction. A local peer representation has a JXTA pipe for incoming communications to which listeners can be added. Finally, the peer manager keeps a representation of the peergroup that is used to broadcast messages group wide over a JXTA propagate pipe, the peer manager allows listeners to be added the group pipe. Therefore, the peer manager provides a fully connected asynchronous inter-peer communication subsystem. A peer manager holds several maps of known peers. These include the global peer table, a mapping of peer identier to the

Channel

object that handles the sending or

receiving of messages. Secondly, there is a global lookup of peer names to identiers and vice-versa. Finally, the peer manager holds a blacklist that prevents known crashed peers, from Eternity, from rejoining the platform peergroup. A thread is used to periodically discover new peers from the JXTA SRDI. Before a discovered peer is accepted into the 139

Platform Model

Chapter 5:


global peer table an exchange must take place in conjunction with Gatekeeper preliminaries in Section 5.2.3.

Figure 5.20: Peer Discovery Sequence. An exchange consists of the following as shown in Figure 5.20. A new peer identier is checked against the global peer table and blacklist to ensure it is new and valid. A new peer channel is created and bound to the input end of the JTXA pipe. Public keys and shared secret keys are exchanged in conjunction with Gatekeeper. Failure of the key exchange results in the peer being blacklisted. Valid peers are added to the global peer tables. Lastly, the Eternity failure detector [Hall09] is restarted using the new global peer table. Peer advertisements are embedded in all exchanges so that both peers are able to set each other up concurrently.

5.3.2

Managing Services

All services are represented as WSDL documents. The service manager maintains all actions associated with services described as WSDL. It's primary role is to interact with Sandbox and LAMB to deploy and publish services respectively. It is controlled by the WSPBFT user interface as shown in Figure 5.21. All services are cached in stable-storage 140

Platform Model

Chapter 5:


to persist between restarts.

Figure 5.21: Service View through User Interface. A WSDL document is uploaded to the service manager through the user interface. Once uploaded it is cached and remains in an undeployed state. The local peer identier is added to the WSDL because it is needed for the unique service URI. Services may be removed locally or globally from the user interface. Other peers may receive the service WSDL through a

push

mechanism.

A service is deployed the following happens. If it is a framework service then the WSDL is passed to Sandbox. Sandbox then looks for a class denition as shown in Figure 5.5. The local JVM is searched for a class of the same name, when found it is loaded and held in reference by Sandbox. A deployed service class gets instantiated at point of need when a corresponding message is received. A network service omits this step. The service is then published to LAMB using the discovery task. A service may be undeployed by notifying both Sandbox and LAMB discovery. The service manager records whether a service is deployed or not by moving the cached WSDL between two folders

available

and

deployed.

If a peer host is restarted then all

previously deployed services get redeployed.

5.3.3

User Interface

Every WSPBFT enabled peer is endowed with an embedded Apache Tomcat web server. The conguration can be accessed in a browser pointing to http://localhost:1984/

localjsp/index.jsp. It is split into two portions. A navigation pane contains links to other peers, views and various trace options. The view pane shows all the tools for 141

Platform Model

Chapter 5:


conguring dierent aspects of WSPBFT such as Services as shown broadly in Figure 5.21.

Figure 5.22: UI forwarding HTTP requests between Peers. When the user interface starts it's current context is the local peer. Any views on the right-hand side are to congure the local peer. If the user selects a link to another peer the context changes to the new peer. This resets any views. The user interface allows any peer to be congured even if the host is behind a rewall or NAT. JXTA tunnelling us used to forward web requests over the peer network as shown in Figure 5.22. The navigation pane contains two pointers to the home and current peer context. Clicking on a view link causes a new view to appear in the right-hand pane. All available views may be displayed simultaneously for a given peer. There are many views available (for example the view shown in Figure 5.21). The discovery view allows direct access to the JXTA discovery service to nd advertisements. A peer view displays all the peers visible in the current peer context. The rendezvous view shows the current topology of the SRDI in terms of rendezvous and edge peers. Diagnostics view is a set of tools used to diagnose the P2P network.There are views for doping and metrics, these are described in the evaluation chapter.

142

Fault Tolerance Services

5.4

Chapter 5:



An objective of our research is provide and support a greater range of FT protocols. This is aided by the provision of a consistent SoA, LAMB, and introspective hosting environment, Sandbox, with it's FT abstraction facilities such as failure detection and synchrony. In this section we address a lack of FT range by implementing a set of six protocols covering passive, active and state machine replication techniques. From the literature review in chapter 4 it can be seen that these protocols provide a greater coverage of FT than any other framework to date. This naturally addresses our objective of increasing the utility of FT in SoC. The FT services derived from the protocols in this section have been enhanced for SoA by addressing practical issues, such as optimal communication mediums, multithreading and handling message ooding, that are not addressed in the reference protocols (Paxos, CLBFT). The protocols themselves enforce the

safety

property of FT, i.e. something bad will

not happen. The safety property is decomposed into a set of correctness properties such as maintaining total order message delivery. When these properties are not contravened the protocol is said to be safe. The set of correctness properties that constitute the safety property varies from protocol to protocol. The FT services are based on the following design principles. From LAMB proposition 2, all FT services expose WSDL descriptions and consume SOAP messages asynchronously. Every FT service implementation is generic and not tied to a specic domain such that it can be sub-classed for any application. This precludes the use of transactions, state recovery or checkpoints because these must be enabled by the application services themselves. All protocols will be designed to operate in the Sandbox facility whilst making maximum use of the FT facilities (LAMB interface, Domesday, Gatekeeper, Clockwork, Eternity and Viewpoint). All the FT services use plurality where plausible with distributed protocol instances enabled by service co-hosting (explained in LAMB proposition 1 ). Protocol instances support group membership protocols based on input from LAMB with regards to the 143


Chapter 5:


number of co-hosted service instances belong to a protocol. This allows a FT service to determine Π, n and f and thus recongure to topology changes. Finally, the FT services automatically use any optimisations available. Such optimisations include the use of LAMB to calculate plurality and in particular the use of JXTA pipes to employ opportunistic multicast. Some of the following FT services are used in conjunction with each other. Atakos our active replication protocol can be used in conjunction with our SMR protocols of Ionian, IonianNB and Andros. This enables the use of diversity in the form of NVP in our framework (described fully in section 5.4.3) to prevent the occurrence of common-mode failures where f = n.

5.4.1

Passive Replication with Patmos

Patmos

is a passive replication protocol nominally based on the Recovery Blocks protocol

[Randell75]. The dierence between Patmos and similar schemes [Dialani02, Dobson06, Fang04, Jayasinghe05, Santos05, Sommerville05] is that our protocol is fully decentralised. It supports crash failures in the fail-silent distributed system model and provides a resilience of f =

n−1 2 .

To aid the readers comprehension of Patmos we refer to our IT help desk example. Every IT expert on the team has their own personal assistant. The assistants meet up when necessary to agree which of their IT experts will handle incoming help requests. Their agreement depends on a predened rota. Once agreement has been reached, on the unlucky individual that must handle any help requests, the requests start coming in from the administrator. All incoming requests go to all personal assistants but only the one with the nominated IT expert passes the help request on for solving. At the same time they notify all personal assistants of each request that is being processed. If after a certain amount of time if the personal assistants see that help requests are not being passed to the IT experts because they or their personal assistant are away then a new meeting is arranged where another IT expert is nominated.

144


Chapter 5:


Algorithm 10 Patmos (Part 1) Patmos LAMB Domesday Viewpoint Eternity Clockwork Upon init(sandbox, serviceId, wsdl) promisedV iew = 0 amIDistinguished? = f alse mlast =⊥ endpoints = ∅ in, promised 7→ Domesday Eternity.addListener(this) Clockwork.addListener(this) Viewpoint.addListener(this)

Upon Incoming(min ) endpoints = min .lamb.causal.endpoints[wsdl.serviceU RI] Viewpoint.setP opulation(min .lamb.participants[wsdl.serviceU RI]) mlast = min in.add(min .lamb.correlation, min ) checkIf IamDistinguished() Clockwork.syncIn(min )

function checkIf IamDistinguished() if

Viewpoint.amIP rimary? ∧ amIDistinguished? = f alse then mprepare = mlast mprepare .lamb.action = ” :: P repare” mprepare .lamb.view = V iewpoint.view mprepare .lamb.replyto = wsdl.endpoint LAMB.deliver(mprepare , endpoints)

end if if V iewpoint.amIP rimary? = f alse ∧ amIDistinguished? = true then

amIDistinguished? = f alse

Stop Delivering

end if

Upon P repare(mprepare ) if mprepare .lamb.view > promisedV iew then

promisedV iew = mprepare .lamb.view mprepare .lamb.action = ” :: P romise” LAMB.deliver(mprepare , mprepare .lamb.replyto

else

mprepare .lamb.view = promisedV iew mprepare .lamb.action = ” :: N ack” LAMB.deliver(mprepare , mprepare .lamb.replyto)

end if

Upon P romise(mpromise ) promised.add(mpromise .lamb.view, mpromise ) if promised.size(mpromise .lamb.view) >

amIDistinguished? = true

V iewpoint.population.size() then 2

Start Delivering

end if

Upon N ack(mnack ) if mnack .lamb.view > V iewpoint.view then

Viewpoint.setV iew(mnack .lamb.view)

end if

Patmos satises the following correctness properties. A message received by the protocol is guaranteed to be delivered by one instance to a functional service through the local 145


Chapter 5:


Algorithm 11 Patmos (Part 2) Every deliver() //Once started runs periodically mx = in.pop() mx .lamb.resetAction() mx .lamb.directive = ”local” LAMB.process(mx ) mx .lamb.action = ” :: Delievered” LAMB.deliver(mx , endpoints)

Upon Delivered(mdelivered ) in.remove(mdelivered .lamb.correlation) Clockwork.syncOut(mdelivered )

Upon onV iewChanged(newV iew) //implementing EternityListener checkIfIAmDistinguished() Upon onChange(peerId) //implementing EternityListener Viewpoint.remove?(peerId) Upon doSyncF ail(mf ail ) //implementing ClockworkSyncFailListener Viewpoint.advanceV iew()

LAMB directive, this is

termination.

Messages are delivered in

best eort FIFO order,

that is messages get delivered by an instance in the order in which they are received. The protocol retains

liveness

provided messages continue to be received by using Clockwork

to assert upper bounds on delivery times. Violations cause the protocol to recongure in accordance with an Ω election. Finally, Patmos supports

Group Membership

by enforcing

an Ω election when Π changes. A simplied version of Patmos is presented in Algorithm 10. The actual implementation is presented in Appendix A. Messages are broadcast from the client to all instances of Patmos. Each instance stores the message using a Domesday incoming message queue indexed by the correlation identier. All instances also start a timeout for each message by calling syncIn on Clockwork. Finally, every instance checks to see if should be the distinguished deliverer based on hints from Viewpoint. Before an instance can be distinguished it must solicit an Ω eventual leader election. It sends a prepare message to all other backup instances with the view epoch in which it believe it is primary. If the view epoch is higher than the last one seen by the other instances they send a promise back to the candidate primary. When it collects

146

n 2

+1


Chapter 5:


promises it becomes the distinguished deliverer. If an instance receives a prepare with a view epoch lower or equal to the current it replies with a Nack message. In normal operation the distinguished deliverer periodically delivers messages from the front of the incoming message queue. It also sends a delivered notication to all other backup instances. When an backup instance receives a delivered notication it cancels the timeout in Clockwork using the syncOut operation. A backup instance also removes the message from it's own incoming message log. A view-change and subsequent

Ω election can be triggered at any time by notication of a crash by Endpoint, a Clockwork synchronisation failure or a Viewpoint view-change. All these events are routed through Viewpoint to see if the view epoch changes. If it does the instance checks to see if it should be the new distinguished deliverer. When a new instance joins the protocol incoming messages contain the endpoint of this new instance in their headers. This is fed into Viewpoint through the setPopulation method that in turn causes a view changed event that triggers an Ω election. Group membership is supported by extending the Ω election. When an instance receives a prepare message with a view epoch that is less than or equal to the current view it sends a nack response. This nack message contains a backup instance's last promised view. The purpose of this mechanism is to allow newly joined instances to maintain out of date epochs until they incorrectly believe themselves the distinguished deliverer. Their inaccuracies will be rectied by a nack response.

Discussion In normal-operation Patmos can deliver a message in an ecient one-step. The protocol becomes less ecient in periods of instability because a typical leadership election takes 3n 2

message exchanges. It is conceivable that many elections can occur concurrently taking

more time before a distinguished deliverer can operate. Like all our protocols, Patmos will use multicast if it is available. As far as we know Patmos is the rst protocol that uses the Ω election abstraction

147


Chapter 5:


to provide fully decentralised passive replication. Our approach is clearly not ecient as using a single intermediary but it does guarantee safety provided no more than

n 2

instances

fail removing the f = 0 for a single point of failure. In common with all passive replication models, Patmos provides a best eort FIFO order only. Our protocol provides group membership. This is triggered when a LAMB signals a new instance, or Eternity a lost instance (by leaving or crashing). The group membership causes an Ω election. Unlike the group membership based frameworks [Osrael07, Salas06], our framework does not abstract the GM protocol away from the FT protocols. Lastly, whilst Patmos can only guarantee safety when f =

n−1 2

it may anecdotally tolerate up to f = n − 1 if the distinguished

deliverer remains correct.

5.4.2

Active Replication with Elegance

Elegant

is an active replication protocol that can perform rst-past-the-post redundancy.

What makes Elegant unique is that it requires no protocol implementation at all, hence the name. It relies on the default semantics of LAMB to deliver a message to dierent instances of the same service on n hosts. Responses from replicas get forwarded to a destination service that chooses the rst message it receives. Elegant FT provides tolerance of crash failures in the fail-silent distributed model provided n > f . We refer to our IT help desk example to aid comprehension of Elegant. When an incoming help request is received by the administrator it is distributed to a group of IT experts. They all independently try to solve the problem sending their solutions back through the administrator to the client. The client accepts the rst solution. Elegant provides the following correctness properties. The protocol guarantees nation

termi-

provided n > f . When n > 1 Elegant guarantees full decentralisation with no

delays. The protocol performs in two step to deliver a message to the destination service. Assuming n replicas and k destination service instances Elegant delivers in ϕ = n + k provided n > 1 ∧ k > 1.

148


Chapter 5:


5.4.3

Active Replication with Atakos

Atakos

is an active replication protocol that extends Elegant to allow diversity with NVP. It

is not a selection algorithm but a platform for one. Atakos mediates a destination service to collect a quorum of responses before initialising selection and routing the response to the destination. It nominally tolerates Byzantine failures without ordering guarantees provided it receives enough responses. In practice Atakos provides tolerance of commission and common-mode failures. We refer to our IT help desk example to aid comprehension of Atakos. Like Elegant all incoming help requests pass from the administrator to all IT experts. Their solutions go back to the administrator who then sends them to the help desk manager who is free to use whatever mechanism they choose to decide what should go back to the client. In our reference implementation, if a majority of IT experts agree on a solution that is sent by the manager through the administrator back to the client. Atakos provides the following correctness properties. The protocol guarantees termination

provided r ≥ quorum(n) where r is the number of responses. Termination means that

Atakos will perform a merge not that it will provide an acceptable response. Typically,

quorum(n) =

n 2

+ 1 but this can be changed in instantiations of the protocol. Atakos is

not decentralised in the agreement sense. Atakos does not provide Group Membership but is able to recongure in relation to n. Atakos is presented in Algorithm 12. Additionally, the Java implementation is shown in Appendix A. The protocol starts by calculating the value of n by performing LAMB discovery and counting the results. This approach is the only way to guarantee the correct value of n because SMR protocols might perturb the causal history preventing that method of counting. When the protocol is initialised n is recalculated periodically on a separate thread of execution. In normal-operation Atakos receives incoming responses and logs them to a Domesday log indexed by correlation identier. When there is a quorum of responses for a certain correlation identier in relation to n then all responses are pulled from the log and passed 149


Chapter 5:


Algorithm 12 Atakos Atakos Domesday LAMB Upon init(sandbox, serviceId, wsdl) in 7→ Domesday n = |LAM B.discover(getReplicaServiceU RI())|

Upon Incoming(min ) in.add(min .lamb.correlation, min ) if in.size(min .lamb.correlation) ≥ quorum(n) then mmerge = merge(in.getM essages(min .lamb.correlation)) mmerge .lamb.directive = ”local” mmerge .lamb.resetAction() LAMB.process(mmerge ) end if

Every T n = |LAM B.discover(getReplicaServiceU RI())|

function quorum(n) return

n 2

+1

function merge({m1 ...mk }) return

// Overridden by implementation

function getReplicaServiceU RI() return

// Overridden by implementation

to a merge operation. The implementation of merge is delegated to sub-classes of Atakos but typically these provide voting.

Discussion Atakos only adds the overheads of calculating a merged response to out Elegant protocol. Calculating n is performed on a separate thread to requests so does not inuence the performance. Uniquely, this protocol is hosted with the destination rather than replica service. Finally, Atakos can be co-deployed with SMR protocols to add the potential of diversity.

150


Chapter 5:


5.4.4

State Machine Replication with Ionian

Ionian

is an implementation of the Multi-Paxos protocol [Lamport01] providing SMR for

replica services. Like Patmos, instances of Ionian are co-deployed with the functional services they mediate. As far as we aware Ionian is the only implementation of Paxos for SOC. Ionian provides tolerance of crash failures in the fail-silent distributed system model with a resilience of f =

n−1 2 .

For comprehension purposes we refer the reader to the IT help desk example. Like Patmos, all the IT experts have personal assistants but this time they meet and nominate a personal assistant (rather than IT expert) for triaging incoming help requests. Again this is determined by a predened rota. With each incoming request the triage assistant adds a number ticket, i.e. this is the nth request, the other assistants see each incoming request and gives the triage assistant a certain amount of time to add each ticket to that request. Normally, a triage assistant assigns a ticket to a request and then sends that ticket/request to all other personal assistants who endorse their choice my sending it on to all other assistants. When more than half of the assistants endorse a ticket/request it is passed to the corresponding IT experts for processing. With Ionian, a triage assistant cannot assign a ticket to a request until the last request has been passed for processing. This guarantees that all IT experts receive requests in the same order (other mechanisms are used in later protocols). If the assistants can see that tickets are not being assigned quickly enough or that the triage assistant has left their post they are free to hold another meeting to nominate another triage assistant. Ionian satises the following correctness properties. It guarantees decentralisation provided n > 1.

Agreement

means that if one correct Ionian instance delivers a message mx

then all correct instance also deliver mx . Messages are delivered in

total order

so that if

a message mi is the ith message delivered by a correct instance it is also the ith message delivered by all correct instances. Every message received by a correct node will eventually be delivered providing

termination. Group Membership

is provided by an extended Ω

election when a new instance joins the protocol or when one leaves or crashes. Ionian does 151


Chapter 5:


not support View Synchrony because their is no coordination between message sending and delivery within the context of view changes. Finally, Ionian uses Clockwork and Ω election to maintain

liveness.

Ionian is shown in Algorithms 13 and 14. Additionally, the Java implementation is shown in Appendix A. Ionian protocol starts by electing a distinguished proposer through an Ω election. Once elected the distinguished proposer for a view epoch starts sending proposal messages with a sequence number attached to all backup instances. When instances are sending promises to the new proposer they attach the correlation of the last message they accepted but did not deliver. A newly elected proposer now has to

choose

a message whose correlation is in the majority of promises, this is the FT memory aspect of Paxos. If there is no such majority the proposer is free to choose the next message in it's incoming buer. Unlike Paxos, all instances receive all incoming messages that they FIFO buer, start Clockwork countdowns for each message correlation identier and also set the population within Viewpoint. Instances use the causal history associated with each message to determine the endpoints of all instances in the protocol. These endpoints are used for inter-protocol communications so LAMB discovery is not required. When a backup instance, or Acceptor, receives a proposal it checks to see if the sequence and view epoch are higher than any other previously accepted. If it is then they broadcast to other instances an accepted message. This is the agreement phase. All instances collected accepted messages that are greater than any seen. When they have collected

n 2

+1

accepted messages they learn that message. This means the message can be delivered. Ionian diers from Paxos in the mechanics of it's order guarantee.

Before a distinguished

proposer can propose the next message it must be learnt by a majority of learners.

The

same scheme is used by Fast Byzantine Paxos [Martin06]. It adds an extra message step to the Paxos protocols where backups send learnt notications to the proposer when they deliver the message. A proposer must collect

n 2 +1

learnt messages before it makes it's next

proposal. It guarantees there is only a backlog of one message in the backups at any one time thus reducing the load upon them. There are two clear overheads of this mechanism. Firstly, the proposal cycle is never smaller than the amount of time taken for the protocol 152


Chapter 5:


Algorithm 13 Ionian (Part 1) Proposer Ionian Domesday LAMB Viewpoint Gatekeeper Eternity Clockwork Upon init(sandbox, serviceId, wsdl) in, promised, accepted, learnt, undelivered 7→

Eternity.addListener(this) Clockwork.addListener(this) Viewpoint.addListener(this)

Domesday

Upon Incoming(min ) // ALL instances endpoints = min .lamb.causal.endpoints[wsdl.serviceU RI] V iewpoint.setP opulation(min .lamb.particpants[wsdl.serviceU RI]) mlast = min in.add(min .lamb.correlation, min ) checkIf IamDistinguished() Clockwork.syncIn(min )

function checkIf IamDistinguished() if

Viewpoint.amIP rimary? ∧ amIDistinguished? = f alse then mprepare = mlast mprepare .lamb.action = ” :: P repare” mprepare .lamb.view = V iewpoint.view mprepare .lamb.replyto = wsdl.endpoint LAMB.deliver(mprepare , endpoints)

end if if V iewpoint.amIP rimary? = f alse ∧ amIDistinguished? = true then

amIDistinguished? = f alse

Stop Proposing

end if

Upon P romise(mpromise ) // PRIMARY instance undelivered.add(mpromise .lamb.correlation, mpromise ) if V iewpoint.view = mpromise .lamb.view then promised.add(mpromise .lamb.view, mpromise ) if promised.size(mpromise .lamb.view) >

amIDistinguished = true allowP roposal? = true sequence = 1

V iewpoint.populationSize() then 2

Start Proposing

end if end if

Upon N ack(mnack ) // MISTAKEN PRIMARY instance if mnack .lamb.view > V iewpoint.view then

Viewpoint.setV iew(mnack .lamb.view)

end if

Every propose() if allowP roposal? then

mpropose = choose() mpropose .lamb.view = V iewpoint.view mpropose .lamb.seq = getGlobalSequence(V iewpoint.view, sequence) mpropose .lamb.action = ” :: P ropose” sequence = sequence + 1 allowP roposal? = f alse LAM B.deliver(mpropose , endpoints) end if

function choose() //returns the next message for proposal mchoose = in.getF irst( .. Choose correlation from undelivered undelivered.removeAll() if mchoose =⊥ then if correlationlast 6=⊥ then mchoose = in.getN extM essageAf ter(correlationlast ) else mchoose = in.getF irst() end if end if if mchoose 6=⊥ then

correlationlast = mchoose .lamb.correlation end if return mchoose

where |correlationx | >

153

n) 2


Chapter 5:


Algorithm 14 Ionian (Part 2) Acceptor, Learner and Recovery Upon P repare(mprepare ) //ACCEPTOR instance if mprepare .lamb.view > viewepoch then viewepoch = mprepare .lamb.view mprepare .lamb.undelivered = in.getF irst().lamb.correlation mprepare .lamb.action = ” :: P romise” else

mprepare .lamb.view = viewepoch mprepare .lamb.action = ” :: N ack” end if

LAM B.deliver(mprepare , mprepare .lamb.replyto)

Upon P ropose(mpropose ) //ACCEPTOR instance if mpropose .lamb.sequence > proposedepoch then

Clockwork.syncOut(mpropose )

proposedepoch = mpropose .lamb.sequence mpropose .lamb.action = ” :: Accepted” LAM B.deliver(mpropose .endpoints) end if

Upon Accepted(maccepted ) //LEARNER instance if maccepted .lamb.sequence > acceptedepoch then accepted.add(maccepted .lamb.sequence, maccepted ) V iewpoint.populationSize() if accepted.size(maccepted .lamb.sequence) > then 2 acceptedepoch = maccepted .lamb.sequence in.removeAll(maccepted .lamb.sequence) accepted.removeAll(maccepted .lamb.sequence) maccepted .lamb.action =⊥ maccepted .lamb.directive = ”local” LAM B.process(maccepted ) maccepted .lamb.action = ” :: Learnt” LAM B.deliver(maccepted , maccepted .lamb.replyto) end if end if

Upon Learnt(mlearnt ) //DISTINGUISHED PROPOSER instance if amIDistinguishedP roposer = true ∧ mlearnt .lamb.sequence > learntepoch then

learnt.add(mlearnt .lamb.view, mlearnt ) if learnt.size(mlearnt .lamb.sequence) >

learntepoch = mlearnt .lamb.sequence allowP roposal? = true

V iewpoint.populationSize() then 2

end if end if

Upon onV iewChanged(newV iew) //implementing EternityListener checkIfIAmDistinguished() Upon onChange(peerId) //implementing EternityListener Viewpoint.remove?(peerId) Upon doSyncF ail(mf ail ) //implementing ClockworkSyncFailListener Viewpoint.advanceV iew()

154


Chapter 5:


to decide a message. Secondly, an extra communication step aects latency times. The recovery portion of Ionian is the Ω election triggered by a group membership change, a crash failure detected by Eternity or a Clockwork synchronisation failure. Ionian uses receipt of a valid proposal to synchronise clockwork. All our other recovery protocols use the more intuitive point of message delivery for this purpose. However, when Ionian is in a period of stability no more proposals are made, making the proposal synchronisation point more sensitive. In common with Patmos, Ionian uses the nack mechanism to propagate the correct view epoch to newly joined instances.

Discussion Ionian adds an extra round to Paxos delivering and allowing proposal of next message takes ϕ = n(n + 3). An election typically takes

3n 2

extra messages. Ionian has interesting

behaviour that derives from the the proposal blocking mechanism. It has a theoretical maximum throughput that can never be surpassed. However, it can tolerate spikes in input without overloading instances because agreement always takes place at a constant rate.

5.4.5

A Non-Blocking Variant of Ionian

IonianNB

is a non-blocking variant of the Ionain protocol. It does not require a message

to be delivered by a majority of correct instances before the next is proposed. Instead the sequence number themselves are used to enforce delivery order. IonianNB removes the extra message step of Ionian giving it the same performance as Paxos. It allows backups to be ooded in the agreement phase. IonianNB, like Ionian can tolerate crash failures in the fail-silent distributed system model with a resilience of f =

n 2

+ 1. It shares it's

correctness properties with Ionian. With our IT help desk example, IonianNB is identical to Ionian except that a triage assistant is free to assign a ticket to every incoming help request even before the last one has been processed. It is the responsibility of every assistant to ensure that just before 155


Chapter 5:


a help request is passed to the IT expert processing that all requests with a lower ticket number have already been processed. In this case the delivery order is guaranteed by the order imposed by the ticket number.

Algorithm 15 IonianNB IonianNB extends Ionian Every propose() mpropose = choose() mpropose .lamb.view = V iewpoint.view mpropose .lamb.seq = getGlobalSequence(V iewpoint.view, sequence) mpropose .lamb.action = ” :: P ropose” sequence = sequence + 1 allowP roposal? = f alse LAM B.deliver(mpropose , endpoints)

Upon P ropose(mpropose ) //ACCEPTOR instance if mpropose .lamb.sequence > proposedepoch then

proposedepoch = mpropose .lamb.sequence mpropose .lamb.action = ” :: Accepted” LAM B.deliver(mpropose .endpoints) end if

Upon Accepted(maccepted ) //LEARNER instance if maccepted .lamb.sequence ∈ learnt then accepted.add(maccepted .lamb.sequence, maccepted ) V iewpoint.populationSize() then if accepted.size(maccepted .lamb.sequence) > 2 learnt = learnt ∪ {maccepted .lamb.sequence} accepted.removeAll(maccepted .lamb.sequence) maccepted .lamb.action =⊥ deliverInOrder(maccepted , getLocalSequence(maccepted .lamb.sequence)) end if end if

function deliverInOrder(mlearnt , sn) if sn ≥ deliveryepoch then pending.add(sn, mlearnt ) if sn = deliveryepoch then while pending.size(sn) > 0 do

mdeliver = pending.remove(sn)

Clockwork.syncOut(mdeliver )

mdeliver .lamb.directive = ”local” LAM B.process(mdeliver ) in.removeAll(mdeliver .lamb.correlation) sn = sn + 1 end while

deliveryepoch = sn

end if end if

IonianNB extends Ionian. We present it in Algorithm 15, the Java implementation shown in Appendix A. The protocol operates in the same way a Ionian with all instances buering incoming messages. A distinguished proposer is elected and must

choose

a mes-

sage to deliver. When this message is proposed IonianNB is free to propose the next and so forth until the incoming buer is exhausted. This is the ooding aspect. Instances receive

156


Chapter 5:


proposals but they do not cancel the clockwork countdowns like they would in Ionian. When an instance receives an accepted message it is not checked against an epoch because sequences may be received out of order. When agreement is learnt the message is passed for provisional delivery. Instead of delivering a learnt message immediately it is buered and indexed by a sequence number. If the local sequence is equal to a delivery epoch the message is delivered immediately. Once sequencei is delivered, we check if sequencei+1 is in the buer, if it is then we deliver it also, then check for sequencei+2 and so forth. The delivery epoch is always set to the sequence number of the last delivered message. The goal is to deliver a sequence of messages without any gaps.

Discussion IonianNB improves Ionian by delivering in ϕ = n(n + 2) message exchanges. The distinguished proposer has no governance of the rate of proposals so eectively all the messages in the incoming buer are proposed near concurrently. This oods all instances with sequences to agree. The sequence numbering itself eects the total ordering properties. We have created the two variations of Ionian to observe the dierence in correctness and performance of blocking and ooding.

5.4.6 Andros

State Machine Replication for Byzantine Failure with Andros is a SMR protocol that provides Byzantine agreement. It is based on the CLBFT

protocol [Castro99, Castro01, Castro02]. Like Patmos it is co-deployed with the functional service it mediates delivering messages locally through LAMB. Uniquely, Andros uses the Gatekeeper to provide digital signature and MAC based message authentication. This protocol can tolerate Byzantine failures with a resilience of f =

n−1 3 .

By inference Andros

can tolerate any lesser failure model. For comprehension purposes we refer the reader to the IT help desk example. Andros is similar to IonianNB except that the personal assistants may not be working for the 157


Chapter 5:


benet of the help desk. To counteract this problem all exchanges between assistants must be face to face to prevent a bad assistant falsifying a request or endorsement. Secondly, two rounds of endorsement rather than one are required for a given ticket/request to be processed. This allows the good assistants (who must outnumber the bad ones by 2 to 1) ensure that a false ticket/request is voted down. A meeting of assistants that chooses the triage works slightly dierently from Patmos, Ionian and IonianNB. Who is to be the triage assistant next is still pre-determined but a meeting can only be started when more than a third of assistants agree face-to-face that it should. This again ensures that at least one good assistant agrees there should be a meeting and a new triage assistant chosen. In the other protocols the meeting can be triggered at any time because all assistants are assumed good. Andros satises the following correctness properties. Andros is fully

decentralised.

It

provides agreement, when a correct instance delivers a message mi then all correct instances also deliver mi . Message delivery is

totally ordered,

if mi is the ith delivered by a correct

instance then it is the ith message delivered by all correct instances. guaranteed provided no less than

n 3

Termination

is

instances fail in anyway. Instances are allowed to join,

leave or crash, when they do the view is changed by agreement hence Group

Membership

is

supported. Finally, liveness is maintained by the Clockwork synchronisation upper-bounds and a view-change protocol. Andros is presented in Algorithms 16 and 17. Additionally, the Java implementation is presented in Appendix A. We provide two variations of Andros. The rst uses RSA digital signatures for authentications, the second uses MACs. Unlike earlier FT services, Andros requires that client messages are signed with a digital signatures or MAC, and are timestamped. This guarantees that the client is authentic and the request has not been tampered with. A client broadcasts it's request to all instances. Instances authenticate the message with Gatekeeper and checks the timestamp is greater than the timestamp of the last message received. If not the message is simply rejected. Unlike Ionian, Andros manages it's own view epoch and population using Viewpoint simply

158


Chapter 5:


Algorithm 16 Andros (Part 1) Primary Andros Domesday LAMB Viewpoint Gatekeeper Clockwork Upon init(sandbox, serviceId, wsdl) in, preprepared, prepared, commited, pending 7→ Domesday

Eternity.addListener(this) Clockwork.addListener(this) function sign(mx ) return

//message signed by Gatekeeper using MAC or RSA digital signature

function authenticate(mx ) return

//message authenticated by Gatekeeper using MAC or RSA digital signature

Upon Incoming(min ) // All instances if authenticate(min ) then

participants = min .lamb.causal.participants[wsdl.serviceU RI] inspectP articipants(participants) //Provided by Andros Recovery endpoints = min .lamb.causal.endpoints[wsdl.serviceU RI] if min .lamb.timestamp > timestampepoch then timestampepoch = min .lamb.timestamp mlast = min in.add(min .lamb.correlation, min ) Clockwork.syncIn(min ) checkIf AmP rimary() end if end if

function checkIf IAmP rimary() if V iewpoint.amIP rimary?(viewepoch , participants, idlocal ) then

sequence = 1

Start Proposing

else

Stop Proposing

end if

Every propose() mpp = in.pop() if mpp 6=⊥ then mpp .lamb.view = viewepoch mpp .lamb.sequence = sequence mpp .gatekeeper.digest = Gatekeeper.getDigest(mpp ) sequence = sequence + 1 sign(mpp ) mpp .lamb.action = ” :: P reprepare” LAMB.deliver(mpp , endpoints)

end if

159

Protocol


Chapter 5:


Algorithm 17 Andros (Part 2) Backup Instances (Normal Operation) Upon P reprepare(mpp ) // All instances f rom = mpp .gatekeeper.signer if mpp .lamb.view = viewepoch ∧ V iewpoint.amIP rimary?(viewepoch , participants, f rom) ∧ authenticate(mpp ) then gsn = getGlobalSequence(viewepoch , mpp .lamb.sequence) if preprepared.size(gsn) = 0 ∨ preprepared.getF irst(gsn).gatekeeper.digest = mpp .gatekeeper.digest then prepared.add(gsn, mpp .gatekeeper.digest) mpp .lamb.action = ” : P repare” sign(mpp ) LAMB.deliver(mpp , endpoints)

end if end if

Upon P repare(mp ) // All instances if mp .lamb.view = viewepoch ∧ authenticate(mp ) then sn = mp .lamb.sequence gsn = getGlobalSequence(viewepoch , sn) if ¬locallyCommitedSet.contains(sn) then

f rom = mp .gatekeeper.signer prepared.add(gsn, f rom) 2(|participants|−1) then 3 if preprepared.getF irst(gsn) = mp .gatekeeper.digest then

if prepared.size(gsn) >=

locallyCommitedSet.add(sn) mp .lamb.action = ” : Commit” sign(mp ) LAMB.deliver(mp , endpoints)

end if end if end if end if

Upon Commit(mc ) // All instances if mp .lamb.view = viewepoch ∧ authenticate(mp ) then sn = mc .lamb.sequence gsn = getGlobalSequence(viewepoch , sn) if ¬locallyDeliveredSet.contains(sn) then

f rom = mc .gatekeeper.signer committed.add(gsn, f rom) if comitted.size(gsn) >

2(|participants|−1) then 3

locallyDeliveredSet.add(sn) deliverInOrder(mc , sn) end if end if end if

function deliverInOrder(mx , sn) if sn ≥ deliveryepoch then

pending.add(sn, mx ) if sn = deliveryepoch then while pending.size(sn) > 0 do mdeliver = pending.remove(sn) mdeliver .lamb.directive = ”local” LAM B.process(mdeliver ) in.removeAll(mdeliver .lamb.correlation) Clockwork.syncOut(mdeliver ) sn = sn + 1 end while

deliveryepoch = sn

end if end if

to hint at the primary. Andros extracts the population of instance endpoints from the causal history of the message. The group membership is then explicitly handled by the protocol. Each instance logs the request in it's incoming buer and starts a Clockwork 160


Chapter 5:


synchronisation countdown. Finally, every instance checks if should be the primary in this view. A primary instance sends a signed preprepare message containing the view epoch, a sequence number and digest of the message body for the rst message in the incoming buer. Unlike Ionian, Andros primaries cannot perform a

choose

step since a view change

is propagated by agreement (see Section 5.4.6). When an instance, including the primary, receives a preprepare message it is authenticated, the view epoch is checked against the current one and it is checked that the signer is the current primary. The attached digest is used to check that the view/sequence has not been preprepared for another message, if so the message is rejected. When the message is accepted the digest is logged against the global sequence gs = viewepoch ∗ of f set + sequence. Finally an instance signs and sends a prepare message to all instances, this has the same view epoch, sequence and digest. When a prepare message is received an instance checks the view epoch is current, the message is authentic and the sequence has not already been committed. If the message is valid it is logged in the prepared log against the global sequence number. When an instance collects 2f prepare messages for the same global sequence number then the instance can commit that message to the global sequence. The global sequence is logged and the message changed into a commit message, signed and sent to all instances starting the second agreement phase. Upon receipt of a commit message an instance checks that view is current, the message is authentic and the sequence has not already been passed for delivery. If valid the message is logged in the commit log. When 2f + 1 messages are collected for the same global sequence the message can be provisionally delivered. Like IonianNB, instead of delivering a committed message immediately it is buered, indexed by a sequence number. If the sequence is equal to a delivery epoch the message is delivered immediately. Once sequencei is delivered, we check if sequencei+1 is in the buer, if it is then we deliver it also, then check for sequencei+2 and so forth. The delivery epoch is always set to the sequence number of the last delivered message. The goal is to deliver a sequence of messages without any gaps.

161


Chapter 5:


Andros like other Byzantine atomic broadcast protocols guarantees that 2f + 1 replicas deliver a message in order. The actual instances of the functional service may also be faulty. Therefore, a destination service has to collect

n−1 3

+ 1 (f + 1) identical responses

be sure one and therefore all are correct.

Recovery Protocol We present the recovery protocol of Andros in Algorithm 18. Additionally, the full Java implementation is shown in Appendix A. Our recovery protocol represents a simplication of the CLBFT viewstamped-replication technique. This is for two reasons: The complexity of the CLBFT recovery protocol (as shown Chapter 2) would make Andros too slow within our framework. CLBFT and BASE implementations are encoded in the faster C++ language and executed natively, a trade of portability for performance. Secondly, Andros does not assume that the application services it mediates provide state transferral for checkpoints. We only assume that the functional services have been adapted to work with LAMB. A reconguration is triggered by a change in Π, a crash failure detection by Eternity or a Clockwork synchronisation failure between correlated incoming and delivered messages. Unlike Ionian, Ω is not used because faulty instances may try to get elected by preparing a high view epoch. Instead f +1 dierent instances must agree to change to a new consistent view. When a correct instance collects this view change certicate for viewv+1 it is duty bound to change to the new view. To prevent faulty instances from tricking the recovery protocol all view-change messages are authenticated ensuring a instance can vote once only. When a view-change is triggered by agreement the following happens. Firstly, all Domesday logs except the incoming buer are ushed. Unlike CLBFT, we restart the protocol on the incoming buer rather than a hashed set of checkpoints and pre-prepared messages, Andros has not state recovery mechanisms. Since 2f + 1 commits before message delivery, we rely on reasonable synchronisation between instances to keep number of internal messages (being processed by Andros instances at any one time) to a minimum.

162


Chapter 5:


Algorithm 18 Andros (Part 3) Recovery Protocol Upon V iewChange(mvc ) if mvc .lamb.newV iew > viewepoch ∧ Gatekeeper.authenticate(mvc ) then viewChanges.add(mvc .lamb.newV iew, mvc .gatekeeper.signer) |participants| if viewChanges.size(mvc .lamb.newV iew) > then 3

viewepoch = mvc .lamb.newV iew Reset All Logs (except in) checkIf IAmP rimary()

end if end if

function sendV iewChange() mvc = mlast mvc .lamb.newV iew = viewepoch + 1 mvc .lamb.action = ” :: V iewChange” Gatekeeper.sign(mvc ) LAMB.deliver(mvc , endpoints)

function inspectP articipants(participantSet) //implementing EternityListener if participantcount > 0 ∧ participantcount 6= participantSet.size() then

sendV iewChange() end if

participantcount = participantSet.size()

Upon onChange(peerId) //implementing EternityListener if participants.remove(peerId) then

inspectP articipants(participants) end if

Upon doSyncF ail(mf ail ) //implementing ClockworkSyncFailListener sendV iewChange()

It is dicult to directly contrast the recovery times of Andros and a true CLBFT implementation but we believe the simplicity of ours would make it faster. When a new view is agreed the deterministically chosen primary takes over by proposing messages from it's incoming buer.

Discussion In normal operation, Andros is equivalent to CLBFT delivering messages in three-step

ε = 3 and ϕ = 2n assuming a multicast primitive, otherwise ϕ = 2n(n + 1). A view-change requires ϕ = n(f + 1) but this operates concurrently with normal-operation limiting the eect on performance.

163

Summary

5.5

Chapter 5:


Summary

This chapter has provided a description of our FT framework for SoC. The framework addresses the majority of our research objectives as follows. LAMB is a new SoA that is less orthogonal to FT than other examples in Chapter 4. It rstly enforces an asynchronous messaging environment (derived from the MOM broker pattern) that better facilitates implementation of the FT process model. LAMB operates autonomously at runtime because it uses name based matching (like the MOM routing key). These names can be automatically derived from name-spaced elements within a SOAP message and WSDL description so LAMB can operate out of the box with web services. LAMB has been extended to address further research objectives. Firstly, it is integrated with JXTA using WS-Advertisments to provide the decentralised infrastructure addressing the lack of decentralisation in competing frameworks. LAMB uses the decentralisation aorded it by JXTA P2P to replicate it self across all hosts in a peer network. Secondly, LAMB includes a pluggable QoS selection scheme that uses simple QoS metrics on SOAP messages and WSDL descriptions to aid service matching. The priority scheme allows application and FT services to be dierentiated and used at appropriate times. To support our research objectives to increase the utility of FT protocols in SoC we have developed the Sandbox introspective container. It makes FT protocols easier to deploy because they are simply represented as a Java class and WSDL description. Further more Sandbox provides a set of facilities that oer any FT protocol failure detection, synchrony, logging, authentication and deterministic view changes. Sandbox makes FT protocols pluggable to the extent they can be deployed/removed at runtime. This chapter described the platform that integrates LAMB, Sandbox and the underlying information model provided by JXTA. Whilst not addressing any research objectives specically the platform is essential because the JXTA framework is passive and does not oer the services needed for an autonomous SoA like LAMB. The platform also provides the HTML based user interface for controlling the framework. The last part of this chapter was devoted to describing a set of FT services based on 164

Summary

Chapter 5:


reference protocols from Chapter 2. This addresses the objective of increasing the utility of FT protocols to the extent that we provide greater coverage of reference protocols than any other FT framework for SoC. These protocols are pluggable, where one can directly replace another autonomously at runtime (so called runtime reconguration). It was not a trivial task making reference protocols operate in a real world service environment. Our protocols could support the group membership abstraction that the reference protocols cannot. Additionally, we have adapted the reference protocols to include multi-threading and to use the facilities provided by Sandbox.

165

Chapter 6

Evaluation This chapter provides a two part evaluation of FT protocols and then the framework itself. We start by introducing a case study based on the Trading Floor application. From this we described the dierent congurations that are to be used. A testbed based on a cloud topology is presented. We describe how requests and controlled failures are injected into the testbed to drive scenarios. We then describe the metric API that is used to gather then present the results of the scenarios. The rst part of the evaluation is a fullment of the research objective to evaluate and contrast the reliability and performance of dierent FT protocols. It is based on a range of failure models from crash to Byzantine. The second part of the evaluation is focused on the framework itself. It takes the form of a set of test cases that ensure the framework fulls the research objectives presented in Chapter 1. Finally, we revisit the literature survey from Chapter 4 contrasting our framework with existing work to verify it fulls the broad research objectives.

6.1

Trading Floor Application

Our case study is a highly modied version of a system used by one of London's largest nancial institutions. Trading Floor displays real-time stocks, bonds, commodities, deriva166


Chapter 6:

Evaluation

tives and currency information to many traders simultaneously. This application, as it appears to the trader, is shown in Figure 6.1. The information for all the indicators is sourced from disparate systems distributed across a corporate network that is tunnelled over the Internet with 128 bit encryption. For FT purposes each indicator can be sourced from dierent providers.

Figure 6.1: The Trading Floor Screen. The Trading Floor application consists of three core services, as shown in Figure 6.2. A coordinator initiates the cycle by sending a

fetch indicator

message to one or more source

services. Once the source has fetched an indicator it sends a

show indicator

message to

all available screen services. Each screen service shows the indicator value on the chart shown in Figure 6.1. The screen service keeps a record of all indicators so a line chart is drawn on the display. To complete the cycle a screen service sends a log

indicator

message

back to the coordinator. The coordinator can determine what screen services are currently

167


Chapter 6:

Evaluation

operating and if a source service has failed.

Figure 6.2: Trading Floor Services. This simple service oriented application is appropriate for our case study because both source and screen services depend on plurality. Source services require that fetch messages, originated from the coordinator, are received in the correct order. However, a source service does not contain an internal state machine so it will not crash if a message is received out of order breaking the test conditions. The application allows us to contrast dierent FT protocols, namely active and state-machine replication. These would not necessarily be used in the same application domain. Early versions of this case study sourced indicator data from the Internet [Hall07]. Unfortunately real-world failures were introduced making it a challenge to produce a comparative study especially for intentional errors. In this case study we use a constrained environment. Source services produce the same

pseudo random

values based on a clock

incremented by coordinator service. Provided the clock and indicator name are not perturbed then the source services are guaranteed to generate the same "random" value for each indicator. From the arrangement of the trading oor services we derived a conguration table for 168

Testbed

Chapter 6:

Evaluation

the evaluation, these congurations are shown in Table 6.1. The full set of congurations were not used for all test and evaluation cases because they were not needed to demonstrate each point. The No-FT conguration was derived from the original version of the trading oor application hence has not FT services deployed during execution. From Table 6.1 n is the number of hosts that provide each group of services. More complex FT congurations have a FT service, for example Ionian, and an application service, such as TF source deployed together (or co-hosted). Conguration No-FT Elegant Atakos

Coordinator Host(s) Services Instances TF Coordinator 1 TF Coordinator 1 TF Coordinator 1

Patmos

TF Coordinator

1

Ionian

TF Coordinator

1

IonianNB

TF Coordinator

1

Andros MAC

TF Coordinator

1

Andros RSA

TF Coordinator

1

Source Host(s) Services n TF Source 1 TF Source 3, 5, 7 TF Source 3, 5, 7 TF Source 3, 5, 7 Patmos TF Source 3, 5, 7 Ionian TF Source 3, 5, 7 IonianNB TF Source 4, 7, 10 Andros MAC TF Source 4, 7, 10 Andros RSA

f

0 2, 4, 6 1, 2, 3 2, 4, 6 1, 2, 3 1, 2, 3 1, 2, 3 1, 2, 3

Screen Host(s) Services Instances TF Screen 1 TF Screen 1 TF Screen 1 Atakos TF Screen TF Screen Atakos TF Screen Atakos TF Screen Atakos TF Screen Atakos

1 1 1 1 1

Table 6.1: Congurations. By varying the number of hosts (n) containing a service combination, for example 3, 5 and 7, we were able to evaluate a given protocol with dierent resilience (f ). The Ionian, IonianNB and Andros congurations included an Atakos service on the screen host because they would be typically deployed together.

6.2

Testbed

Our testbed consisted of fteen PC machines, half physical and half virtual. All the machines ran the latest Linux kernels. A feature of the testbed was it's heterogeneity with processors ranging from a Quad-Core Xeon with 8GB of memory to virtual machines with equivalent single core processors. The two most powerful machines hosted three and two virtual machines using virtualisation software. Lastly, there were two powerful virtual machines hosted by a virtualisation suite, these were used for client injection only as their 169

Testbed

Chapter 6:

Evaluation

power was matched by a lack of memory.

Figure 6.3: Testbed. On each machine we started a single instance of the Java WSPBFT application using the build scripts. A name stored in the system's host le was used as the name of the JXTA peer. The underlying JXTA network was congured to have only one rendezvous peer to aid network speed and start-up times. Multicast was enabled in JXTA but this did not extend to the virtual machines.

6.2.1

Cloud

Manually copying, installing, starting and stopping instances on fteen machines is a laborious task. Instead all instances were run from one NFS le share that is visible from all machines. Using

cloud

script and two folders on the share called

available

and

enabled,

we were able to start and stop an arbitrary amount of nodes simultaneously. When the cloud script starts on a host it inserts a le with the same name as the host in the available folder. If that le is subsequently copied to the

enabled

folder then a WSPBFT instance

will start on that host. If the le is removed from enabled the WSPBFT instance will stop. Finally, if a le is deleted from available the corresponding cloud script will stop. We extended the principle to having start and stop network icon on our desktops. An entire network restart can take place in 40-50 seconds.

170

Testbed

6.2.2

Chapter 6:

Evaluation

Doping

Our doping mechanism was used primarily to inject controlled failures within the framework. A dope is a Java object that is mixed with other dopes in a XML-based scenario script. A dope may be congured to perform some action once, repeatedly or every time a SOAP message is received. Every dope has a lifecyle as shown in Figure 6.4.

Figure 6.4: Lifecycle of a Dope. When a scenario script is started, all the dopes within it are synchronised so that all subsesequent events occur chronologically relative to this baseline. The XML conguration template for a dope is shown in Figure 6.5. It includes lifecycle attributes for the send, activate and retire times, these are relative to each other and expressed in seconds. All dopes must specify a set of destinations upon which they will be activated. ::FetchIndicator patstat1 patstat0

Figure 6.5: Conguration of a Dope. Dopes are grouped into passive, active and cyclic. Passive dopes only operate in relation to an invoke action event. This event happens when a given SOAP message is received by a node. A SOAP message is passed to all currently live passive dopes for inspection before being routed to its destination service. The Message Perturb dope, as shown in Figure 6.5 is a passive dope, it is able to alter a message before it reaches its destination. Passive dopes include guard clauses that tailor the actions they perform on messages. For example, a dope may be stochastic, only activating with a given probability, this may 171

Testbed

Chapter 6:

Evaluation

be xed or change over time. The example in Figure 6.5 shows a dope that activates with a zero probability but this increases in increments by ve every 10 seconds until the dope retires. Guard clauses may also match only certain message signatures, therefore targeting only a subset of messages. Active dopes perform a single action once activated. A classic example is an active crash dope, this sends an exit code to the Java Virtual Machine of an WSPBFT instance instantly crashing the peer. Lastly, the cyclic dope periodically executes once activated. The client dope is an example. It is used as the primary mechanism for request injection. Dopes are also able to perform administrative tasks such as deploying services and starting peers during runtime.

Figure 6.6: Doping Scenarios through the Web Interface. Figure 6.6 shows how doping scenarios are started from the web interface. In addition to the test-cases and request injection, doping scenarios are used to congure services. Service congurations can be conrmed by inspecting the LAMB registry.

172

Testbed

6.2.3

Chapter 6:

Evaluation

Request Injection

A client Dope uses send methods on the Trading Floor coordinator service to perform each request. Parameters can be passed from the dope to the coordinator service, for example we can dene what indicators will be sourced or how often the pseudo random generator clock is updated. We have two forms of injection. at a xed rate for a given time.

Load

Soak

injection creates client requests

injection increases the rate of requests over time.

Two dierent coordinator services can be used to inject loads higher than ve requests per second.

6.2.4

Metrics

The metrics interface provides the data upon which we assess the correctness and performance of dierent congurations. Figure 6.7 shows how the metric interface works by intercepting messages that match a message URI and service URI combination. When a match is found a snapshot is taken and passed to a monitoring peer. Snapshots are transported through a separate HTTP-based communication subsystem for eciency. A snapshot consists of the message and service URIs and most importantly the correlation identier of the message as created by the coordinator service. To avoid clock drift between nodes, the monitor timestamps a message when it is received. Snapshots are stored in a data structure that orders them chronologically. We take snapshots at two points, the start and end point, that represent where messages enter and leave the application provided by the SoA. These two related snapshots share the same correlation identier. We group them to form correlated pairs. These too are stored in chronological order. A metric sampler is run periodically on the set of snapshots and correlated pairs to calculate the metrics. By counting the number of snapshots with timestamps in the current sample period we can calculate the in count. Counting the correlations gives the out count. Using the sample period we calculate the input and output rates and loss. Latency is the dierence between the timestamps for the two snapshots in a correlated pair. The 173

Testbed

Chapter 6:

Evaluation

Figure 6.7: Anatomy of Metric Gathering. correlation identiers have a clock part that allows the assertion of the order in which messages should be timestamped by the monitor. If identiers are received out of order then an events are registered. During a run the sampler keeps running totals to allow the calculation of cumulative variants of the metrics. Metric gathering is controlled using the interface shown in Figure 6.8. It enables the denition of start and endpoints for monitoring and the sample period for the metrics. Once running it shows the results in a table. Values shown in brackets are the cumulative totals. A save button enables the table on screen to be saved in a comma separated format (CSV) log. Metrics taken by this system are:

•

Runtime.

The amount of time passed since the metric gathering was started.

•

Messages In.

•

Messages Out.

Snapshot of start point messages in the current period. Snapshot of end point messages that are correlated with start point

messages in the current period. This number can be greater than messages in because 174

Testbed

Chapter 6:

Evaluation

Figure 6.8: Metrics Interface. they may have been accounted for in an earlier sample period. The cumulative messages out can never exceed cumulative messages in.

•

In Throughput.

The number of in messages divided by the sample time to give rate

of incoming messages.

•

Out Throughput.

The number of out messages divided by the sample time to give

rate of outgoing messages.

•

Latency.

The dierence between the timestamps of the start and endpoint snapshots

within a correlated pair. An average is taken across all correlations.

•

Loss %.

The dierence between the cumulative in and out totals divided by the in

total. The loss is cumulative to prevent negative totals on the charts produced. Loss charts are therefore logarithmic in nature.

•

Reorders.

Using the clock component of the snapshot correlation identier we identify

when endpoint messages are delivered out of sequence. This metric cannot be used as a denitive measure of FIFO or total-order as the metric subsystem itself can introduce instances of reordering due to network non-determinism. 175

Testbed

Chapter 6:

Evaluation

Figure 6.9: Graphing the Metrics. The chart interface is used visualise the metrics gathered. It uses the saved logs rather than present live information. This enables the combination of several overlapping datasets in one chart (otherwise there would be over ve hundred charts to display in this chapter). Using a client side database provided by Google Gears we combine the data on the eld headings [Google09]. Figure 6.9 shows the chart interface consisting of a dataset selector, chart parameters (mapping the x, y and series to elds in the dataset) and nally the chart itself. This chart interface provides all the charts seen throughout this chapter and in the appendices.

176

Evaluation

6.3

Chapter 6:

Evaluation

Evaluation

The purpose of the evaluation was two-fold. The rst part of the evaluation was concerned with fullling the research objective of evaluating dierent FT protocols for reliability and performance. Here the scenarios were derived from the failure models described in Section 2.1 and from the Trading Floor application described in Section 6.1. One of the scenarios was normal case operation so the relative performances of the FT congurations could be assessed. The second part of the evaluation was a set of cases to test whether our framework had acceptably met our research objectives dened in Chapter 1. These cases were trivially derived from the objectives themselves. To support our assertions we performed some scenarios on congurations from the Trading Floor application. These tests, in common with the rst part of the evaluation, were driven using the doping mechanism for controlled injection of requests and failures with metrics gathered through the special API and then visualised. These specic scenarios are shown in Appendix C. In places we have used the results from Section 6.4 as supporting evidence that our framework has achieved the test case. Finally, in the cases where we could not make assertions experimentally we made specic observations to support our case, these may refer the reader to the source code of the framework.

6.4

FT Protocol Evaluation

This section is a fullment of the research objective of evaluating dierent FT protocols for reliability and performance. The results we have produced from this protocol evaluation were also used for demonstrating the ecacy of our approach in the later test cases (shown in Section 6.5). This evaluation consisted of several scenarios each run with multiple congurations of the Trading Floor application. Some scenarios were left out this section and included in the Appendix C (C.1 and C.2) because these formed pass/fail test cases rather comparable 177


Chapter 6:

Evaluation

results for discussion. Throughout this section we have presented results captured directly from our metric gathering subsystem and displayed them using the chart interface. Each scenario is presented as follows: A brief description of the scenario; expectations derived from knowledge of the relevant reference protocols, in addition to knowledge of the frameworks operation; the results in chart format; an ongoing discussion of the results (relating to items of interest in the charts).

6.4.1

Normal Operation

This scenario demonstrated the normal-case (failure free) operation of each conguration. It evaluated the general performance and scalability. Firstly, the

soak

case injected re-

quests at a xed rate for the duration of the scenario. By inputting dierent values for

n (for example we tested Andros with n = {4, 7, 10}) we observed the properties of nscalability directly. Lastly, load case injected a increasing rate of requests to identify maximum throughput and load-scalability of each congurations. This case was applied to all congurations and n variants. We expected that all the congurations under test would complete the scenario with a latency time related to the complexity of protocol under test (No-FT the lowest and Andros the highest). Given the underlying cached P2P service discovery dissemination we expected all the congurations to n-scale logarithmically (O(log n)) but load scale linearly (O(n)). Table 6.2 provides performance and scalability metrics for all the FT protocols and a baseline (No-FT) for the framework. There was a near 100% completion of the scenario, given a 1.5% margin of error, expect for IonianNB (n = 7) and Andros (n = {7, 10}) which halted at a maximum input rate of 3 requests per second. All the congurations

n − scaled with O(log n) with the exception of Ionian which scaled O(n) at an input of 1 request per second and did not scale at 3 requests per second. We were unable to assert the n-scalability of IonianNB and Andros at the higher input rate because they failed. However, we have shown that the framework

n-scaled

178

logarithmically. Figure 6.10 shows


Name No FT Elegant Elegant Elegant Atakos Atakos Atakos Patmos Patmos Patmos Ionian Ionian Ionian IonianNB IonianNB IonianNB Andros MAC Andros MAC Andros MAC Andros RSA Andros RSA Andros RSA

n

1 3 5 7 3 5 7 3 5 7 3 5 7 3 5 7 4 7 10 4 7 10

Chapter 6:

Loss % 0 0 0 0 0 0 0 0.5 0.5 0 0 0 0 0 0 0 0 1.2 0 0.8 0.8 1.0

Inject 1 Request/s Latency (ms) MD Msg/s 177 2.2 220 3.5 270 3.65 340 4.25 625 3.6 375 3.1 405 3.25 375 2.8 302 3.3 375 4.15 500 3.5 600 5.9 1250 8.8 625 3.0 760 8.0 740 12.0 1130 5.0 1170 15.0 1220 53.0 775 5.0 1050 22.5 1030 22.0

Loss % 0 0.25 0 0.25 0.25 0.5 0.25 0.3 0 0.31 0 0 0 0 0 9.0 0 85.0 98.0 0.0 75.0 90.0

Evaluation

Inject 3 Requests/s Latency (ms) MD Msg/s 177 3.2 150 11.5 260 12.75 310 14.1 322 13.6 380 14.5 430 15.0 350 4.3 390 6.2 432 7.5 700 9.0 750 15.0 15000 27.0 500 5.0 620 12.5 4300 37.0 1200 5.0 28000 35.0 20000 10.0 1000 6.0 17000 34.0 30000 19

Table 6.2: Normal Operation with Soak Results. a contrast of latencies at the lower input rate. As expected there was general trend in increase of latency time throughout the congurations.

Figure 6.10: Contrasting Latencies at Injection of 1 request/s. The overall load results are shown in Figure 6.11. All the protocols demonstrated a maximum throughput (4-5 requests per second for Atakos, Elegant and Patmos, 2 requests per second for Ionian/IonianNB and 1 request per second for Andros). Ionian outperformed IonianNB and Andros by continuing throughput even at much higher input rates, this was a 179


Chapter 6:

Evaluation

Figure 6.11: Congurations (n = 7) under Increasing Load. further demonstration of its robustness. All the congurations load-scaled with O(n) until their maximum throughput was reached. The No-FT conguration demonstrated that the framework load scaled linearly. A comparison of the load scalability of the Andros variants is shown in Figure 6.12. The overall load-scalability results for Andros were generally disappointing when compared to other congurations. When we ran Andros in the smaller congurations (n = 4) it scaled better. This demonstrated the load that Andros placed on less powerful nodes which were required for n = 10 congurations. A comparison of the Andros RSA and MAC variants showed little dierence between the two. Conventional wisdom on this matter suggested that MACs would be an order or magnitude faster than RSA digital signatures [Castro02, Reiter95]. In later scenarios we omitted the Andros MAC conguration. 180


Chapter 6:

Evaluation

Figure 6.12: Comparing Andros Congurations.

6.4.2

Fail-Silent

The goal of this scenario was to demonstrate that all congurations could tolerate f but not f + 1 fail-silent (undetected) crashes. The scenario emulated the case where services crashed but their host nodes did not and so crashes went undetected, the fail-silent distributed system model. We created the scenario by starting n FT service instances for each conguration then after a time we omitted all messages to f and then f +1 nodes (supporting the Trading Floor source service) using an omitting dope. We ran all congurations as

n = 7 in conjunction with soak injection. We expected that all the congurations would tolerate the f scenario but not f + 1 fail-silent crashes. Figure 6.13 shows the results for f concurrent fail-silent crashes. After the crash event Elegant, Atakos and Patmos demonstrated a good recovery shown by the loss dropping to near zero. Ionian out performed our expectations. After one large spike in loss and latency, Ionian reverted to normal-operation with a near zero message loss. IonianNB and Andros did not meet our expectations. IonianNB nearly failed indicated by the loss line tending towards 80 but recovered slightly after 175 seconds with a slight dip to 70. Andros initially seemed to recover but then started to spike in latency and loss towards the end of the scenario. As a result of this we investigated these congurations with console trace. The problem lay in the heterogeneity on the host nodes. With f nodes already failed the

181


Chapter 6:

Evaluation

Figure 6.13: Fail-Silent (f ) Results. remaining n − f nodes all needed to reach consensus. Weaker virtual machine nodes were showing a tendency to slow under load and in some cases to fail-stop, causing the later spikes with the Andros conguration. Ionian avoided overloading these weaker nodes by regulating proposal rates. Figure 6.14 shows the results for f + 1 concurrent fail-silent crashes. As expected all the congurations failed totally indicated by a logarithmic tendency to a loss percentage of over 80%. Patmos provided the most interesting behaviour by recovering after 60 second to a near 0% loss. This was traced to a bug in Patmos implementation that allowed a potential leader instance to collect multiple promises from another instance. This allowed an election without a majority and gave Patmos a resilience of f = n − 1 rather than the quoted f =

n−1 2 .

The bug allowed the potential of two instances being elected leader of 182


Chapter 6:

Evaluation

Figure 6.14: Fail-Silent (f + 1) Results. the protocol. However, this fault in Patmos was not observed in any other scenario so the algorithm was left unchanged.

6.4.3

Omission

The goal of this scenario was to evaluate how congurations dealt with arbitrary message omissions. To enact omission we inserted failures stochastically into any message incoming to nodes supporting the Trading Floor source service. Systematic omission would have been the equivalent to the fail-silent scenario. The likelihood of an omission was increased over time by periodically increasing the probability parameter of the omit dope. This scenario was executed against all congurations, except No-FT, where n = 7 using soak injection. Note that at this point we started using the message throughput as clearer metric for 183


Chapter 6:

Evaluation

reasoning about non-crash failures. We had no set expectations from this scenario because of a lack of experience with stochastic failure injection.

Figure 6.15: Omission Results. Figure 6.15 shows how the output of congurations dropped whilst the probability of omissions increased. With the activate replication protocols this was a slow process (Elegant never failed totally). The SMR protocols succumbed at a very low probability of omission. Andros performed badly by failing totally at an omission probability of 15%. Ionian and IonianNB failed at approximately 20% probability. Atakos, Patmos, Ionian, IonianNB and Andros all gave a substantial recovery in message throughput at the end of the test, coinciding with the end of the scenario and return to a 0% omission.

184


6.4.4

Chapter 6:

Evaluation

Timing

The goal of this scenario was to evaluate how well the congurations tolerated systematic and stochastic timing failures. In a boundary case we tested the congurations against f then f + 1 permanent timing failures with xed delays. This delay was large enough to ensure that recongurations occurred in congurations using the Clockwork synchronisation facility. To investigate the eects on message ordering caused by timing failures we added an alternating f + 1 case. The alternation case used a xed delay that was in itself not long enough to cause reconguration by Clockwork. Lastly, we performed a arbitrary case where failures were introduced to all n nodes but only stochastically with an increasing probability. This last case also introduced an arbitrary delay in the range {0s...20s} emulating real-word delays. We applied the four timing cases to all congurations (with n = 7 where applicable) in conjunction with soak request injection. Failures were only applied to nodes hosting the Trading Floor source service. We had the following expectations based on our knowledge of FT protocols. All the congurations would survive the f xed delay case but they would fail in the f + 1 case. We expected that the alternating failure case would cause application message reordering in passive and active replication techniques but the total order of SMR would prevent this. Again we had no insight to how the congurations would behave in the stochastic case. Figure 6.16 shows all the FT congurations generally tolerating f xed delay failures. Both Patmos and Andros demonstrated showed big troughs in throughput which were attributed to the presence of recongurations, however, they both eventually recovered to provide a constant throughput. In Figure 6.17 we have chosen the latency to best demonstrate the behaviour of the protocols during f + 1 xed delays. The baseline, passive and active replication congurations demonstrated a very unstable operation (large peaks) during this case. Ionian and IonianNB appear to maintain a low latency but this was because the latency was actually zero. Later they recovered marginally but with huge latencies. Surprisingly, Andros recovered after f + 1 xed delay failures were introduced demonstrated by a consistently 185


Chapter 6:

Evaluation

Figure 6.16: Timing (f ) Results.

Figure 6.17: Timing (f + 1) Results. low (not zero like Ionian) latency. Anecdotal observations showed that two instances of the Andros service had fail-stopped (i.e. their hosts were detected as failed by Eternity) allowing a group membership reconguration and lesser values for f and n allowing Andros 186


Chapter 6:

Evaluation

to continue operation on a lesser consensus.

Figure 6.18: Timing Alternate (f + 1) Results. Figure 6.18 shows the alternated f +1 which as expected shows a disparity between the passive and active replication and the order imposing SMR techniques. Ionian(NB) and Andros had zero reorders. The passive and active replication techniques proved unstable in this case with varying reorders over time.

Figure 6.19: Timing Stochastic Results. The message throughput and reordering for the stochastic timing case is shown in Figure 6.19. All the congurations survived the stochastic case. The throughput for all congurations was variable with Patmos, Ionian, IonianNB and Andros performing well.

187


6.4.5

Chapter 6:

Evaluation

Commission

This scenario was intended to show that the Atakos conguration could tolerate f but not f + 1 concurrent commission failures (incorrect responses). To emulate commission failures we perturbed the messages with implausible values such that the Trading Floor screen service could internally detect the error and reject the message. This transformed the commission failure into an omission that would be detected by the message loss metric. The scenario was executed with the Atakos conguration (where n = 7) in conjunction with the soak injection of requests. We expected no message loss from the f case and total message loss from the f + 1 case.

Figure 6.20: Commission Results for f and f + 1 Respectively. Figure 6.20 shows that Atakos can tolerate f commission failures (the spikes in the chart only equate to a cumulative loss of less than 2 that is later recovered). The chart also shows that the failure for f + 1 commission failures is total (the cumulative eective causes the graph to be logarithmic).

6.4.6

Denial Of Service Attack

The goal of this scenario was to demonstrate the congurations could tolerate f but not

f + 1 nodes being under a Denial of Service (DoS) attack. A DoS is a malicious attack instigated by a third party to prevent universal access to a service. It works by saturating the service with requests and manifests itself in the timing, omission or fail-silent failure models. We simulated a DoS attack with a third party node that sent arbitrary SOAP 188


Chapter 6:

Evaluation

messages to nodes within the conguration, this at an ever increasing rate. Whilst developing the doping mechanism to perform this task it was observed that the testbed was unable to inject messages at a high enough rate to cause the documented eects of DoS. To compensate we made every node in receipt of a DoS message perform an arbitrary set of calculations called a

micropayment.

The micropayment was calibrated

to cause a processor load of approximately 100% when DoS messages were received at a rate of 50 per second. We applied the DoS attack as f and f + 1 cases to all congurations (except No-FT in the f case since f = 0) in conjunction with soak injection of requests. In both cases we increased the DoS rate from 0 to 50 messages per second. We expected that all the protocols would tolerate the f case even at the highest DoS rate (50 messages/s). We did not know how the congurations would behave in the f + 1 rate.

Figure 6.21: DoS Attack (f ) Results. Figure 6.21 shows the f case of DoS attack providing the output throughput metric.

189


Chapter 6:

Evaluation

Figure 6.22: DoS Attack (f + 1) Results. Unexpectedly, all the congurations were strongly aected by the DoS attack even a the low DoS rate. All the congurations showed a drop o in throughput towards a DoS rate of 20 messages/s. The most surprising result was the drop in the throughput of Elegant that over time recovered. Investigation revealed that the only node not under attack also happened to be the JXTA gateway peer for a set of six JXTA peers hosted behind a rewall. Any JXTA pipe abstraction bound by these peers would require communications through the gateway. When some of these peers fail-stopped because of load it reduced the request throughput on the gateway. The result was the output of the Elegant conguration rose sharply towards the end of the scenario (even above the input rate indicating requests were being queued by the gateway). This was the only time we reached the performance limitations of the underlying JXTA network. IonianNB, provided the best throughput. It is suspected that this was because the primary instance of the IonianNB protocol happened to be only one not under attack.

190


Chapter 6:

Evaluation

Figure 6.22 shows the results for all congurations in the f + 1 case. The results of the No-FT conguration supported the evidence of the JXTA gateway being overloaded with pipe requests. With just one node being attacked the pipe gateway trac was much reduced. This was a much greater factor than the JXTA gateway suering a DoS attack itself. We concluded that above a DoS rate of 5 messages per second the load placed on the JXTA network aected the results giving inconsistent and unexpected metrics. Elegant performed better in the f + 1 case as did Patmos and Ionian. For IonianNB and Andros f + 1 DoS attacks caused them to fail almost totally above a DoS rate of 10 messages per second.

6.4.7

Divisive Attack

The goal of this scenario was to demonstrate that the consensus-based congurations were able to tolerate f but not f + 1 commission type attacks targeted at control headers rather than application data. We simulated this attack by perturbing the sequence designators used to enforce atomic-broadcast. We replaced the sequence number with a randomly generated large number designed to corrupt the ow of sequence. Incoming messages to the nodes housing Trading Floor source services were targeted. This scenario was executed against the Ionian, IonianNB and Andros congurations (where n = 7) in conjunction with soak injection of requests. We expected that all three protocols would tolerate f but not

f + 1 cases. We expected Andros to provide the greatest tolerance in bthe f case because perturbed headers would fail authentication. Figure 6.23 uses the message loss to indicate how the protocols reacted to the f case. Ionian and in general Andros tolerated f divisive attacks. IonianNB failed totally after the boundary event, the reason for this was unexplained. Andros surprisingly demonstrated a couple of spikes indicating the presence of unexpected recongurations. We later conrmed by observation that this was caused by the overloading and subsequent fail-stop of hosts containing Andros service instances. The message loss of Ionian, IonianNB and Andros are shown in Figure 6.24. These results for the f + 1 case clearly showed that all three

191


Chapter 6:

Evaluation

Figure 6.23: Divisive Attack (f ) Results. protocols had failed completely.

Figure 6.24: Divisive Attack (f + 1) Results.

6.4.8

Byzantine

The goal of our last scenario was to evaluate the ability of congurations to tolerate failures in the Byzantine model. It consisted of f , f +1 and stochastic cases. The Byzantine failure model was achieved by combining omission, timing, commission, divisive and DoS attacks, Even when f < 5 the dierent failure types were alternated so that the congurations would experience the full range of failure types. For the stochastic case we introduced an increasing probability of failure (starting at 0). At the end of the stochastic case we 192


Chapter 6:

Evaluation

reverted the probability to zero for two minutes to observe if the congurations would recover, at the same time we ceased request injection. All three cases were evaluated against all congurations (except No-FT for the f case) where n = 7 with a soak injection of requests. We expected that only the Andros conguration would tolerate f Byzantine failures but no conguration would tolerate the f + 1 case. A lack of experience with regards to the stochastic case means that we had no expectations. At the end of the stochastic case we expected that the Patmos and the SMR protocols would recover the message loss back to zero.

Figure 6.25: Byzantine (f ) Results. Figure 6.25 shows the message loss and latency with the f scenario for all the FT congurations. Though Elegant and Atakos demonstrated a slow leak of messages (though the latency was relatively constant). Patmos, Ionian and IonianNB ended with high cumulative message loss whilst having large spikes (caused by recongurations) which then later were followed by corresponding spikes in the latency. Andros suered from three recongu193


Chapter 6:

Evaluation

ration events but overall messages loss was down to 7% when the scenario ended. The last spike in latency (caused by reconguration) for the Andros conguration coincided with the fail-stop of two nodes thus further demonstrated the load placed on a host by Andros.

Figure 6.26: Byzantine (f + 1) Results. Figure 6.26 shows the results for all congurations in the f + 1 case. No-FT, Elegant, Patmos and Atakos did not fail totally in the f + 1 case against expectations, instead they tended towards a 25-40% message loss. Ionian, IonianNB and Andros failed more decisively by tending towards an 80% message loss. We concluded that as expected no conguration was able to tolerate f + 1 concurrent Byzantine failures. Lastly, Figure 6.27 shows the results for the stochastic case with subsequent recovery period. As expected No-FT, Elegant and Atakos did not fail-totally in the stochastic case. They demonstrated an upward trend of message loss to approximately 20%. Elegant only 194


Chapter 6:

Evaluation

Figure 6.27: Byzantine Stochastic Results. reached a message loss of 7% when the probability of failure was 100%. In the recovery period the loss fell back slightly because of the cumulative eect of the metric. Patmos performed better than expected. Up to a failure probability of 40% the message loss raised steeply to over 60%. Afterwards there were a couple of periods of high throughput that returned the overall loss to approximately 20%. In the periods of throughput the latency was high for Patmos. The results for the consensus based FT congurations were more denitive. The message losses grew steadily to 55%, 60% and 75% for Ionian, IonianNB and Andros respectively. Ionian demonstrated the best recovery getting back 40% of messages message lost, after the failure probability was returned to zero. IonianNB recovered the cumulative message loss to 23%. Unfortunately, Andros was only able to recover just of 5% of the messages lost. It was concluded that consensus based protocols provide less tolerance than passive or active replication techniques to arbitrary stochastic attacks.

195


6.4.9

Chapter 6:

Evaluation

Evaluation Summary

A summary of our evaluation is shown in Table 6.3. This summary includes results from the test cases in Section C.1 and C.2 as they form part of the FT protocol evaluation. The FT protocol evaluation has largely followed expectations such as all the protocols were able to tolerate f crash failures (fail-stop and fail-silent). The anomalies were where the SMR protocols, in particular IonianNB and Andros had large deviations in fail-silent a model they are expected to support. From it observation it was noted that large spikes in message loss and subsequent high latency were always related to recongurations of the FT protocol (for example leader election in IonianNB). Recongurations could be observed from console trace. We found the main cause of these recongurations were the unintentional fail-stop of some of our less powerful hosts, this was particularly prevalent with the Andros conguration. We speculate this was caused by the host overloading and not providing heartbeat messages to the Eternity failure detection scheme. A feature of the evaluation was the introduction of stochastic variants of the omission, timing and Byzantine scenarios. We had no insight into how and if the congurations would tolerate randomly occurring failures that were steadily increasing in probability. The results showed that active replication protocols (Elegant and Atakos) provided the greater tolerance of stochastic results. This was explained by the fact that SMR protocols exchange many more messages (as clearly demonstrated in performance eld of Table 2.3). Given that stochastic scenarios targeted all service messages, FT or otherwise, (as would be case in real life) the chance of a failure hit was much higher. The only anomaly is the Andros protocol that survived the stochastic timing case up to a probability of 95%. An observation we made during the FT protocol evaluation was that after the Byzantine stochastic scenario that the SMR congurations performed a recovery with overall message loss being reduced from 70% to under 15% for the impressive Ionian conguration, IonianNB also performed a substantial recovery. However, Andros only managed to recover 10% of the messages lost. From this we can see the value of blocking the request sequence until the previous message has been delivered by a majority, the scheme employed 196


Scenario Case Load Max (Trans/s) Runtime Reconguration Group Membership Sequential Concurrent Sequential Fail-Stop Concurrent Primary f Fail-Silent f +1 Omission Stochastic % 2 Timing Commission Denial of Service Divisive Attack Byzantine

f f +1

alternate Stochastic % f f f f f f f f

+1

3

+1 3 +1 +1

Stochastic % Post Recovery

Chapter 6:

Congurations n = 7 Elegant Atakos Ionian 5.1 4.0 2.0

No-FT 12

Patmos 4.5

4

4 4 4 4 4

4 4 4 4 4

8

4 4 4 4 4 4 4 ∼

4 8

∼ 8

4 ∼ 4

4 ∼ 4

N/A N/A N/A N/A N/A N/A N/A N/A 67 N/A N/A N/A 50 N/A N/A N/A

N/A

42

100

95 N/A N/A 40 50 N/A N/A

95 N/A N/A 20 50 N/A N/A

IonianNB 2.7

AndrosRSA 1.25

4 8

4 4 4 4 4 4 4 8

4 4 4 4 4 ∼ 1 ∼ 8

4 4 4 4 4 ∼ ∼ 8

4 ∼ 4

4 8 8

4 8 8

4 4 8

8 8 8 8

4 8 4 8

4

∼

N/A 75 95 4 8

50 20 N/A N/A

20 45 N/A N/A 20 20

8

8 8

8 8

8 8

4 8 8 8

∼

∼

∼

∼

4

100

20

100

100

40

Table 6.3: FT Protocol Evaluation Summary. in Ionian. 1 ∼ indicates a partial success. 2 Probability threshold at which the conguration 3 Maximum DoS rate (in messages/s) tolerated.

failed.

197

Evaluation

17

45 N/A N/A 50 10

35

15 95 N/A N/A 40 10

20

Framework Evaluation

6.5

Chapter 6:

Evaluation


The following test cases assessed whether our framework addressed the research objectives in Chapter 1. The rst four test cases were derived from our broad objective of increasing the range of FT protocols available to service oriented systems.

Does the framework simplify FT service deployment? Our framework fullled this objective by enabling a service to be developed with one Java class and and associated WSDL description to provide a generic FT protocol implementation with an additional sub class to apply the FT service to particular application. To support our assertion, Andros, by far the most complex of our FT services, was implemented in 335 lines of Java (shown in Appendix A). The key to this conciseness was the FT facilities oered by Sandbox which would translate to hundreds of extra lines of code.

Does the framework make FT services pluggable and swappable? Our framework fullled this objective by allowing dierent FT services (representing diverse reference FT protocols) be used in the same Trading Floor application. The swappable aspect of the objective was clearly demonstrated by a runtime reconguration scenario as shown in Section C.1. In this scenario the FT services were autonomously swapped by LAMB at runtime. No other framework surveyed in Chapter 4 had this ability.

Have the FT protocols been enhanced to operate with the SoA? This objective was fullled by the provision of six demonstrable FT services based on reference protocol. These FT service congurations were demonstrated in operation over a range of scenarios in Section 6.4. During the design phase of the framework it was clear that the reference protocols would not work out of the box with LAMB or any other SoA. We adapted the protocols so that operated using LAMB bindings in sandbox so that protocol instances and application services could communicate. As evidence we refer the reader to Table 6.2 where all the FT services were operated successfully with the application. Though not originally envisioned by the research objectives our FT services were able to perform group membership operations whereby the new instances of an FT service could join/leave or crash at runtime, an ability never oered in the original reference FT 198


Chapter 6:

Evaluation

protocols. This facility is oered by few other frameworks surveyed in Chapter 4. We refer the reader to Section C.2 for evidence of all the FT services performing group membership.

Have a set of dierent FT protocols been evaluated and contrasted for reliability and performance? This objective was fullled with the evaluation presented in Section 6.4. Our evaluation is to the best of our knowledge the rst comparison of passive, active, crash SMR (Paxos) and Byzantine SMR (CLBFT) protocols with the same service oriented application (Trading Floor).

The next three test cases were derived from our broad objective to adapt current SoAs to make less orthogonal to FT.

Has the framework SoA been enhanced by enforcing asynchronous messaging, thereby facilitating the FT process model? Our framework fullled this objective by the enforcement of the LAMB SoA for all service interactions throughout the evaluation. LAMB was asynchronous and the Trading Floor application, whether in a FT conguration or not, consisted of services on multiple hosts exchanging SOAP messages thus clearly modelled the FT process model discussed in Section 2.1. We cannot prove this test experimentally but can be asserted by code inspection in Appendix A.

Does the SoA provide autonomous runtime service discovery? This research objective was fullled by implementing the broker pattern within the LAMB SoA. During scenario runs, the framework operated autonomously, without any human intervention. Table 6.2 shows that not only did the runtime discovery work but that it was also timely, executing three LAMB service brokering instances in 177 milliseconds for the No-FT conguration during the soak test scenario. To further support the evidence of the strength of our framework in this area, for good measure it also adapted to network topology changes (group membership in Section C.2) and even the deployment of new FT services (runtime reconguration in Section Section C.1). A unique property of LAMB not found in any other framework was the ability to create MOM-like routing keys from the XML structure of a SOAP message and corresponding

199


Chapter 6:

Evaluation

WSDL description. This property allowed LAMB enabled services to work out of the box without the need to annotate them for service discovery.

Does the SoA allow FT services to be dierentiated based on their QoS metrics? This research objective presented the fundamental problem of dierentiating FT services from their application counterparts when both provide similar interfaces the match incoming requests. Our simple priority based selection scheme based on QoS metrics allows services (both FT and application) to be dierentiated based on whether reliability or performance is desired. The scheme worked demonstrably in two ways. Firstly, by all the FT services working in normal operation as shown in Section 6.4 indicated both FT and application services were addressed in the correct order to achieve the goals of the Trading Floor application. Secondly, we created a scenario where during normal operation of Trading Floor application increasingly reliable (Elegant, Patmos, Atakos, Ionian, IonianNB and Andros respectively) FT services were deployed without loss of message or signicant increase in end to end message latency. The actual results of the scenario are shown in Section C.1. This demonstrated that LAMB brokers would choose the most reliable FT service available thus the research object was achieved.

The last two test cases are based on the broad objective of addressing the lack of decentralisation in many of the current FT-SoC approaches.

Is the service discovery infrastructure distributed across many nodes? Our framework fullled this objective by providing LAMB as a distributed SoA. All the nodes in the network host their own LAMB broker. This assertion was demonstrated by the implementations in Appendix A using a local LAMB directive to address a co-hosted application service thus proving the LAMB broker is local to the application service.

Has the SoA been integrated with Peer to Peer (P2P) protocols to remove centralisation? Our nal research objective was fullled by an integration of JXTA with LAMB and Sandbox to create the WSPBFT framework. The previous test case proved that the LAMB SoA was distributed across all nodes in the network, of course that network was

200


Chapter 6:

Evaluation

formed by JXTA. We demonstrated that the framework operated over a scalable network (i.e. it n scaled with O(log n)) with the latency metrics for the Elegant conguration shown in Table 6.2. Test Case

Met?

Supporting Evidence

Does the framework simplify FT service deployment? Does the framework make FT services pluggable and swappable?

4

Andros implemented in 335 lines of Java (Appendix A)

4

Have the FT protocols been enhanced to operate with the SoA? Have a set of dierent FT protocols been evaluated and contrasted for reliability and performance? Has the framework SoA been enhanced by enforcing asynchronous messaging, thereby facilitating the FT process model? Does the SoA provide autonomous runtime service discovery?

4 4

Pluggable: Dierent FT services used with same Trading Floor application. Swappable: Runtime Reconguration Scenario in Section C.1. Table 6.2 shows FT services operating with Trading Floor application. Group membership: Scenario in SectionC.2. Results of evaluation shown in Section 6.4.

4

Demonstrable by source code inspection (Appendix A).

4

Does the SoA allow FT services to be dierentiated based on their QoS metrics? Is the service discovery infrastructure distributed across many nodes? Has the SoA been integrated with Peer to Peer (P2P) protocols to remove centralisation?

4

Autonomous nature of scenario runs. Timeliness: Table 6.2, No-FT case, shows 3 LAMB brokering occurrences in 177ms. Runtime Reconguration Scenario in Section C.1.

4

Veriable by code inspection of platform.

4

Framework scalability (O(log n)) demonstrated in Table 6.2, Elegant case, implies a scalable P2P network. JXTA integration veriable by code inspection.

Table 6.4: Summary of Framework Test Cases. We have presented a summary of the test cases in Table 6.4. All the test cases have been successfully met. Wherever possible we have used experimental means of conrming these cases though some have been implied through other results.

6.5.1

FT-SoC Literature Survey Revisited

We present the summary from the FT-SoC literature survey presented in Chapter 4 again in Table 6.5. This time we have added our framework, WSPBFT (2009) to contrast it's coverage against other frameworks in the same domain. The coverage of our framework is asserted from the test cases above and the evaluation in Section 6.4. Using this table we can create three more test cases posed by the broad research objectives. Does the framework increase the range of FT protocols available to 201

202

Intermediary Intermediary Intermediary Connector Client Client CORBA CORBA Recovery Group Mem. Group Mem. MOM P2P P2P CLBFT CLBFT P2P LAMB/P2P

Class

a Passive Replication. b Active Replication. c Asynchronous - No implied timing d Decentralised e Group Membership f ∼: Implied or Partial Support

Container [Sommerville05] FAWS [Jayasinghe05] WS-BPEL FT [Dobson06] IWSD [Salatge07] WS-FTM [Looker05] MidRWS [Ye05] FT-SOAP [Fang04] FTWeb [Santos05] TransparentFT [Dialani02] WS-Replication [Salas06] RMWS [Osrael07] WS-BUS [Erradi05] FT-Net [Caituiro-monge07] RWSI [Norcross05] Thema [Merideth05] BFT-WS [Zhao07] WSPBFT [Hall07] WSPBFT 2009

Framework

4 4 4 4 4 8 ∼ 8 8 8 8 4 4

8 4 4 4 8 4 ∼ ∼ ∼ 8 8 4 4

8 8 8 8 8 8 8 8 4 4 8 4

4

8 8 8 8 8

SMR

4 4 4 4 4 4 4 4 4 4 4 4

4

4 4 4 4 4

Crash

∼ ∼ ∼ ∼ 8 ∼ 8 8 4 4 ∼ 4

TOTAL

8

TOTAL TOTAL

8 8 8 8 8 8 8 8

PROB. TOTAL

8 8 8 8 8

∼ f 8 ∼ ∼ ∼ ∼

Ordering

Byzantine

8 8 4 4 4 4 4 4 4 4 4 4

4

8 8 4 8 8

8 4 8 4 4 8 4 4 4 4 4 4

4

8 8 8 8 8

Asyncc DCd

assumptions!

Table 6.5: FT-SoC Literature Survey Revisited.

4 8 4 4 4

b

AR

4 4 4 4 8

a

PR

∼ 8 8 4 4 8 8 8 8 8 8 4

8

8 8 8 8 8

GMe

8 8 8 8 8 4 8 8 8 8 8 4

8

8 8 8 8 8

QoS Matching

8 4 8 4 4 8 4 4 8 8 4 4

4

4 4 4 4 4

Scalable

4 ∼ ∼ 4 8 ∼ 8 8 8 8 4 4

4

4 8 4 4 4

Diversity

8 8 8 8 8 8 8 8 8 8 4 4

8

8 8 8 8 8

Late Protocol Binding

Framework Evaluation Chapter 6: Evaluation

Summary

Chapter 6:

Evaluation

service oriented systems?. This objective has been met because our framework can support passive, active and SMR replication tolerating crash and Byzantine failures (and all failure models in between). Table 6.3 supports this assertion. No other framework can support this range of FT protocols. Does the framework adapt current SoAs

to make less orthogonal to FT? Our framework has met this objective by enforcing an asynchronous (MOM-like) broker pattern in LAMB. In addition, our framework supports QoS matching through our simple QoS selction scheme. Finally, by facilitating late protocol binding (choosing the appropriate FT service at runtime) is a validation of the QoS matching scheme and also the highly autonomous nature of the LAMB based service discovery. Does the framework address the lack of decentralisation in current

FT-SoC frameworks? This objective has been met because our framework is not only decentralised but scalable also thanks to the integration with the JXTA P2P protocols.

6.6

Summary

This chapter started by presenting a case study application, Trading Floor, from which the congurations, used in the evaluation, were derived. We also described the doping mechanism used to drive the scenarios with requests and controlled failures. Metrics were gathered using a special API that allowed the results to be presented as charts. We performed the evaluation in two parts. The rst part was set of scenarios to assess the ability of dierent FT protocols (implemented as service) to tolerate dierent failure models. This was a fullment of our research objective to evaluate and contrast the reliability and performance of dierent FT protocols. Our second evaluation was a set of test cases that assessed whether our framework fullled the research objectives presented in Chapter 1. We nally contrasted our framework with those surveyed in Chapter 4 and asserted whether they fullled our broad research objectives.

203

Chapter 7

Conclusions This chapter concludes our work by rstly assessing our work from a the perspective of our top level research objective. We then discuss some limitations of our work. Next some possible research directions are described. Finally, we make some nal remarks about this work and the future.

7.1

Research Objectives Revisited

Support the reliability characteristics of SoC by improving the provision of FT in conjunction with the SoA.


Simplify the deployment of FT protocols as services. Make FT service pluggable and swappable. Enhance FT protocols, where necessary, to operate with the SoA. Evaluate and contrast dierent FT protocols with regards to reliability and performance.

• Adapt current SoAs to make less orthogonal to FT. 204

Assessment

Chapter 7:

Conclusions

Enhance SoA by enforcing asynchronous messaging, thereby facilitating the FT process model.

Provide autonomous runtime service discovery. Allow FT services to be dierentiated based on their QoS metrics. • Address the lack of decentralisation in current FT-SoC approaches.

Distribute service discovery infrastructure across many nodes. Integrate SoA with Peer to Peer (P2P) protocols to remove centralisation.

7.2

Assessment

Have we addressed the primary research objective of supporting the reliability characteristics of SoC by improving the provision of FT in conjunction with the SoA? In our evaluation (Section 6.5) we demonstrated that our framework (described in full within Chapter 5) fullled the decomposed research objectives presented in Chapter 1. In addition the FT protocol evaluation in Section 6.4 fullled the nal research objective. Section took the framework and compared it to the current state of the art in FT-SoC (as surveyed in Chapter 4) indicating that our framework addressed the issues not covered comprehensively by other frameworks. We have improved provision of FT by enabling 6 reference protocols to be implemented as services and then evaluated. A hosting environment has been developed to simplify the deployment of FT services. We have adapted current SoA into LAMB that allows asynchronous and autonomous service discovery, this is enhanced with QoS based selection. Thus we addressed our primary research objective.

7.3

Limitations of our Work

In this section we discuss some limitations of our work.

Name based Discovery. A weakness of our framework is that services are brokered by LAMB on the basis that a service consumes a set of uniquely named messages and 205

Limitations of our Work

Chapter 7:

Conclusions

these messages are listed in the interface section of the WSDL description. This means that it solely the reponsibility of the application developer at design time to write client services that consume certain named messages that correspond to well known services [Alonso02]. This is ne if the same developer is responsible for all the web services in the domain (as was the case in our research). When consuming third party services it becomes essential that within a certain domain all providers share the same interface denition for a common service. Currently, there are no schemes available to do this on a global naming basis. However, the Semantic Web is a initiative from the W3C to standardise ontology denitions, so semantic decriptions may subsume named based matching in the future. We reviewed several initiatives that integrate ontologies into service discovery in 3.3.3.

QoS Matching Scheme. Our QoS matching scheme is eective in the context of dierentiating fault tolerance services within our framework. However, the scheme is essentially a proof of concept that demonstrates QoS metrics can be used for this purpose. The metrics are 2 dimensional (reliability and performance) and limited to a percentage score. In Section 3.3.3 we reviewed a set of service discovery schemes based on QoS [Dobson05, Erradi05, Makris06, Ran03]. We believe that these approaches could be integrated into LAMB. QoS matching becomes more accurate with the expression of more detailed metrics such as MTTF or mean availability [Dobson05]. In our scheme service metrics are statically dened in the WSDL description. Other schemes allow service metrics to be monitored enabling their metrics to change at runtime [Makris06]. Finally, QoS matching can be extended to a reputation scheme [Erradi05].

Limitations of JXTA. We adopted JXTA [Brookshier02, Oaks02] as the basis for the P2P overlay to support decentralisation within our framework. However, JXTA has noted problems with performance by depending on XML [Antoniu05, Halepovic03] and is passive, requiring us to develop a platform to mediate the JXTA protocols. Secondly, using JXTA limited us to developing in Java because the bindings to other languages have not been fully developed.

206

Further Research Directions

7.4

Chapter 7:

Conclusions

Further Research Directions

We use this section to discuss some further work that could result from this research.

Renement and standardisation of LAMB. We believe that LAMB can provide a real contribution to service oriented computing. Competing message oriented middleware are oriented towards the message queue abstraction. This is unsurprising when you consider these middleware model electronic mail, just as the REST paradigm models the WWW. The principle of messages signatures identifying services is applicable, we believe, to all service-oriented applications. Other LAMB principals are more specic to the requirements of this thesis. However, these principals may be tailored to the requirements of other application domains. Integration with protocols such as WS-Addressing [Gudgin06] would be an important step towards widespread adoption of LAMB. We feel the strength of LAMB is its simplicity. It is felt that over-engineering LAMB, for example by adding delivery guarantees, would reduce its ecacy.

Evolving SMR. Our comparative case study has allowed us to compare the relative performances of our dierent FT services. An artefact of the study is the relative weakness of IonianNB and Andros consensus-based FT services in several benign failure models such as fail-silent. Anecdotal evidence has shown the problem is related to frequent leader election recongurations (Ω or View-Change) caused by the overloading of weaker nodes in our heterogeneous environment. Ionian, the other consensus-based FT service, suers less from this problem because sequence proposals are rationed rather than ooded into the protocol. We propose that a combination of rationed proposals of Ionian and the Byzantine FT Andros would provide an eective FT service for many applications. A more radical solution to the overheads of leader election congurations is to remove such recongurations altogether. To achieve this we propose two innovations. Firstly, instances within a protocol would share multiple sequence proposals (based on message digests) within in one message, for example a message may include (prepare digest3 as 2, prepare digest2 as 2, commit digest1 as 1). When a quorum of FT service instances say commit digest1 it is committed by all correct instances. Secondly, when quorums cannot 207

Final Remarks

Chapter 7:

Conclusions

be reached the instance with the nearest key to the digest of a message is the deterministic leader for all operations relating to that message. This concept borrowed from a P2P ring topology removes the requirement of an election process.

Phokal

is a proposed fault

tolerance protocol to provide these innovations. However, it is currently hypothetical. This new protocol can be quickly compared with the current state of the art in FT by using the WSPBFT framework and a replication of our case-study. Lastly, current SMR approaches are innite, that is they maintain state forever. We suggest that it is sucient for many applications that state is only maintained for a set of interactions. We call them

statelets.

At the boundary of such statelets, FT protocols may

perform operations such as group membership and fail-stop to change the set of nodes. This approach has been available in enterprise systems for many years in the form of transactions. However, it has yet to be adopted by SMR FT protocols.

7.5

Final Remarks

Ensuring reliability in service-oriented systems is a challenging task. Convincing service developers to use FT techniques in their systems is even more dicult. In our research we have gone beyond simply providing another FT framework by providing full coverage of seminal FT protocols and set of desirable attributes including asynchrony, decentralisation, diversity, group membership and late-protocol-binding. Additionally, in our case study we demonstrated WSPBFT performs well and is scalable. All the properties found in our framework have only patchy coverage in other approaches. During this work we have gained an appreciation of how dicult it is to develop a distributed system. In particular there is a large gulf between theoretical agreement protocols such as Paxos and real world implementations such as Ionian. These protocols were beset with problems of concurrency and node overloading. Our evaluation has often demonstrated that simple protocols such as Elegant out-perform more comprehensive ones such as Andros. We believe service oriented computing will evolve over the coming decades. As trust 208

Final Remarks

Chapter 7:

Conclusions

and semantic horizons are reached fault tolerant computing will become more important to this distributed paradigm. It is hoped that this work will provide a signicant contribution towards this vision.

209

Bibliography [Alonso02]

Alonso, G. Myths around letin, 23(4):39, 2002.

web services. IEEE Data Engineering Bul-

[Alves07]

Alves, A., Arkin, A., Askary, S., Barreto, C., Bloch, B., Curbera, F., Ford, M., Goland, Y., Guízar, A., Kartha, N., Liu, C. K., Khalaf, R., König, D., Marin, M., Mehta, V., Thatte, S., van der Rijn, D., Yendluri, P., and Yiu, A. Oasis web services business process execution language (wsbpel) v2.0. Web, 2007. URL http://docs.oasis-open. org/wsbpel/2.0/OS/wsbpel-v2.0-OS.pdf.

[Amoretti08]

Amoretti, M., Zanichelli, F., Conte, G., and Bisi, M. Enabling peerto-peer web service architectures with jxta-soap. In IADIS e-Society. 2008. URL https://soap.dev.java.net/.

[Antoniu05]

Antoniu, G., Hatcher, P., Jan, M., and Noblet, D. Performance evaluation of jxta communication layers. In Cluster Computing and the Grid, 2005. CCGrid 2005. IEEE International Symposium on, volume 1, pages 251258. 2005.

[Arkin02]

Arkin, A., Askary, S., Fordin, S., Jekeli, W., Kawaguchi, K., Orchard, D., Pogliani, S., Riemer, K., Struble, S., Takacsi-Nagy, P., Trickovic, I., and Zimek, S. Web service choreography interface (wsci) 1.0. Web, 2002. URL http://www.w3.org/TR/wsci.

[Avizienis85]

Avizienis, A.

[Avizienis01]

Avizienis, A., Laprie, J., and Randell, B. Fundamental dependability. Technical report, LAAS-CNRS, 2001.

[Balakrishnan03]

Balakrishnan, H., Kaashoek, M. F., Karger, D., Morris, R., and Stoica, I. Looking up data in p2p systems. Commun. ACM, 46(2):4348, 2003. ISSN 0001-0782. doi:http://doi.acm.org/10.1145/606272.606299.

The n-version approach to fault-tolerant software. IEEE

Trans. Softw. Eng.,

11(12):14911501, 1985. ISSN 0098-5589. doi: http://dx.doi.org/10.1109/TSE.1985.231893.

210

concepts of

BIBLIOGRAPHY

BIBLIOGRAPHY

[Ballinger01]

Ballinger, K., Brittenham, P., Malhotra, A., Nagy, W. A., and Pharies, S. Web services inspection language (ws-inspection) 1.0. Technical report, IBM, 2001.

[Banavar99]

Banavar, G., Chandra, T. D., Strom, R. E., and Sturman, D. C. A case for message oriented middleware. In Proceedings of the 13th International Symposium on Distributed Computing, pages 118. SpringerVerlag, London, UK, 1999. ISBN 3-540-66531-5.

[Barborak93]

Barborak, M., Dahbura, A., and Malek, M. The consensus problem in fault-tolerant computing. ACM Comput. Surv., 25(2):171220, 1993. ISSN 0360-0300. doi:http://doi.acm.org/10.1145/152610.152612.

[Bechhofer04]

Bechhofer, S., van Harmelen, F., Hendler, J., Horrocks, I., McGuinness, D. L., Patel-Schneider, P. F., and Stein, L. A. Owl web ontology language reference. Web, 2004. URL http://www.w3.org/TR/ owl-ref/.

[Ben-Or83]

Ben-Or, M.

Another advantage of free choice (extended abstract):

Completely asynchronous agreement protocols.

In PODC

'83: Proceed-

ings of the second annual ACM symposium on Principles of distributed computing,

pages 2730. ACM, New York, NY, USA, 1983. ISBN 089791-110-5. doi:http://doi.acm.org/10.1145/800221.806707. [Birman87]

Birman, K. and Joseph, T.

Exploiting virtual synchrony in distributed

systems. SIGOPS Oper. Syst. Rev.,

21(5):123138, 1987. ISSN 01635980. doi:http://doi.acm.org/10.1145/37499.37515.

[Birman99]

Birman, K. P., Hayden, M., Ozkasap, O., Xiao, Z., Budiu, M., and Minsky, Y. Bimodal multicast. ACM Trans. Comput. Syst., 17(2):41 88, 1999. ISSN 0734-2071. doi:http://doi.acm.org/10.1145/312203. 312207.

[Bondi00]

Bondi, A. B. Characteristics of scalability and their impact on performance. In WOSP '00: Proceedings of the 2nd international workshop on Software and performance, pages 195203. ACM, New York, NY, USA, 2000. ISBN 1-58113-195-X. doi:http://doi.acm.org/10.1145/ 350391.350432.

[Brambilla04]

[Brookshier02]

Brambilla, M., Ceri, S., Passamani, M., and Riccio, A. Managing In ICWS '04: Proceedings of the IEEE International Conference on Web Services, page 80. IEEE Computer Society, Washington, DC, USA, 2004. ISBN 0-7695-2167-3. doi:http://dx.doi.org/10.1109/ICWS.2004.73.

asynchronous web services interactions.

Brookshier, D., Govoni, D., Krishnan, N., and Soto, J. C. JXTA: Java P2P Programming. Sams, Indianapolis, IN, USA, 2002. ISBN 0672323664. 211

BIBLIOGRAPHY

BIBLIOGRAPHY

[Caituiro-monge07] Caituiro-monge, H. and Rodriguez-martinez, M.

Ft-net traveler:

Fault-tolerant and scalable web service broker architecture.

Web, 2007. URL http://citeseerx.ist.psu.edu/viewdoc/download? doi=10.1.1.76.3270&rep=rep1&type=pdf.

[Carvalho03] [Castro99]

Carvalho, N. and Rodrigues, N. Implementing Reliable tocols in Appia. Universidade de Lisboa, 2003. Castro, M. and Liskov, B.

Broadcast Pro-

Practical byzantine fault tolerance.

In OSDI

'99: Proceedings of the third symposium on Operating systems design and implementation,

pages 173186. USENIX Association, Berkeley, CA, USA, 1999. ISBN 1-880446-39-1. [Castro01]

Castro, M. and Liskov, B.

Byzantine fault tolerance can be fast.

In

DSN '01: Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS),

pages 513518. IEEE Computer Society, Washington, DC, USA, 2001. ISBN 0-7695-1101-5. [Castro02]

Castro, M. and Liskov, B. Practical byzantine fault tolerance and proactive recovery. ACM Trans. Comput. Syst., 20(4):398461, 2002. ISSN 0734-2071. doi:http://doi.acm.org/10.1145/571637.571640.

[Chandra96a]

Chandra, T. D., Hadzilacos, V., and Toueg, S. The weakest failure detector for solving consensus. J. ACM, 43(4):685722, 1996. ISSN 0004-5411. doi:http://doi.acm.org/10.1145/234533.234549.

[Chandra96b]

Chandra, T. D., Hadzilacos, V., Toueg, S., and Charron-Bost, B. On the impossibility of group membership. In PODC '96: Proceedings of the fteenth annual ACM symposium on Principles of distributed computing,

pages 322330. ACM, New York, NY, USA, 1996. ISBN 089791-800-2. doi:http://doi.acm.org/10.1145/248052.248120. [Chandra96c]

[Chandra07]

[Chawathe03]

Chandra, T. D. and Toueg, S.

Unreliable failure detectors for reliable

distributed systems. J. ACM,

43(2):225267, 1996. ISSN 0004-5411. doi:http://doi.acm.org/10.1145/226643.226647. Chandra, T. D., Griesemer, R., and Redstone, J. Paxos made live: an In PODC '07: Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing, pages 398407. ACM, New York, NY, USA, 2007. ISBN 978-1-59593-616-5. doi:http://doi.acm.org/10.1145/1281100.1281103. engineering perspective.

Chawathe, Y., Ratnasamy, S., Breslau, L., Lanham, N., and Shenker, S. Making gnutella-like p2p systems scalable. In SIGCOMM '03: Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications,

pages 407 418. ACM, New York, NY, USA, 2003. ISBN 1-58113-735-4. doi: http://doi.acm.org/10.1145/863955.864000. 212

BIBLIOGRAPHY

BIBLIOGRAPHY

[Clement04]

Clement, L., Hately, A., von Riegen, C., and Rogers, T. Uddi version 3.0.2 uddi spec. Web, 2004. URL http://uddi.org/pubs/uddi_v3. htm.

[Correia06]

Correia, M., Neves, N., and Veríssimo, P. Byzantine consensus in asynchronous message-passing systems: a survey. In In Resilience-building Technologies: State of Knowledge, RESIST Network of Excellence, volume Chapter1. 2006. URL http://homepages.di.fc.ul.pt/~mpc/ pubs/survey-ba-tr-difcul.pdf.

[Cristian91]

Cristian, F.

Reaching agreement on processor-group membrship in syn-

chronous distributed systems. Distributed Computing,

1991.

[Czajkowski04]

Czajkowski, K., Ferguson, D. F., Foster, I., Frey, J., Graham, S., Sedukhin, I., Snelling, D., Tuecke, S., and Vambenepe, W. The wsresource framework. Technical report, Global Grid Forum, 2004. URL http://www.globus.org/wsrf/specs/ws-wsrf.pdf.

[Dattatri06]

Dattatri,

Stitcher, A. Amqp: A general-purpose middleware standard. Web, 2006. URL http://jira.amqp.org/confluence/download/attachments/ 720900/amqp.0-10.pdf?version=1.

[Davis07]

Davis, D., Karmarkar, A., Pilz, G., Winkler, S., and Yalçinalp, Ü. Web services reliable messaging (ws-reliablemessaging) version 1.1. Web, 2007. URL http://docs.oasis-open.org/ws-rx/wsrm/ 200702/wsrm-1.1-spec-os-01.html.

[Défago04]

Défago, X., Schiper, A., and Urbán, P.

Total order broadcast and

36(4):372421, 2004. ISSN 0360-0300. 1145/1041680.1041682.

doi:http://doi.acm.org/10.

[Dialani02]

K.,

Smith,

R.,

and

multicast algorithms: Taxonomy and survey.

ACM Comput. Surv.,

Dialani, V., Miles, S., Moreau, L., Roure, D. D., and Luck, M. TransIn Euro-Par

parent fault tolerance for web services based architectures.

'02: Proceedings of the 8th International Euro-Par Conference on Parallel Processing,

pages 889898. Springer-Verlag, London, UK, 2002. ISBN 3-540-44049-6. [Dobson05]

Dobson, G., Lock, R., and Sommerville, I. Qosont: a qos ontology for service-centric systems. In EUROMICRO '05: Proceedings of the 31st EUROMICRO Conference on Software Engineering and Advanced Applications,

pages 8087. IEEE Computer Society, Washington, DC, USA, 2005. ISBN 0-7695-2431-1. doi:http://dx.doi.org/ 10.1109/EUROMICRO.2005.49.

213

BIBLIOGRAPHY

BIBLIOGRAPHY

[Dobson06]

Dobson, G. Using ws-bpel to implement software fault tolerance for web services. In EUROMICRO '06: Proceedings of the 32nd EUROMICRO Conference on Software Engineering and Advanced Applications, pages 126133. IEEE Computer Society, Washington, DC, USA, 2006. ISBN 0-7695-2594-6. doi:http://dx.doi.org/10.1109/EUROMICRO.2006.63.

[Dobson07]

Dobson, G., Hall, S., and Kotonya, G. A domain-independent ontology for non-functional requirements. In IEEE International Conference on e-Business Engineering (ICEBE'07), pages 563566. 2007.

[Dolev87]

Dolev, D., Dwork, C., and Stockmeyer, L. On the minimal synchronism needed for distributed consensus. J. ACM, 34(1):7797, 1987. ISSN 0004-5411. doi:http://doi.acm.org/10.1145/7531.7533.

[Doudou05]

Doudou, A., Garbinato, B., and Guerraoui, R. Tolerating Arbitrary Failures with State Machine Replication. In Dependable Computing Systems, Wiley Series on Parallel and Distributed Computing. John Wiley and Sons, Inc, 2005.

[Dwork88]

Dwork, C., Lynch, N., and Stockmeyer, L. Consensus in the presence 35(2):288323, 1988. ISSN 0004-5411. doi:http://doi.acm.org/10.1145/42282.42283.

of partial synchrony. J. ACM,

[Elnozahy02]

Elnozahy, E. N. M., Alvisi, L., Wang, Y.-M., and Johnson, D. B. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375408, 2002. ISSN 0360-0300. doi:http://doi. acm.org/10.1145/568522.568525.

[Erradi05]

Erradi, A. and Maheshwari, P. wsbus: a framework for reliable web services interactions. In SAC '05: Proceedings of the 2005 ACM symposium on Applied computing, pages 17391740. ACM, New York, NY, USA, 2005. ISBN 1-58113-964-0. doi:http://doi.acm.org/10.1145/ 1066677.1067070.

[Fang04]

Fang, C.-L., Liang, D., Chen, C., and Lin, P.

A redundant nested invo-

cation suppression mechanism for active replication fault-tolerant web service.

In EEE

'04: Proceedings of the 2004 IEEE International Con-

ference on e-Technology, e-Commerce and e-Service (EEE'04),

pages 916. IEEE Computer Society, Washington, DC, USA, 2004. ISBN 0-7695-2073-1. [Felber99]

Felber, P., Défago, X., Guerraoui, R., and Oser, P. In DOA '99: Proceedings of

as rst class objects.

Failure detectors the International

Symposium on Distributed Objects and Applications,

page 132. IEEE Computer Society, Washington, DC, USA, 1999. ISBN 0-7695-0182-6.

214

BIBLIOGRAPHY

[Fetzer01]

[Fielding00]

BIBLIOGRAPHY

Fetzer, C., Raynal, M., and Tronel, F. An adaptive failure detection In PRDC '01: Proceedings of the 2001 Pacic Rim International Symposium on Dependable Computing, page 146. IEEE Computer Society, Washington, DC, USA, 2001. ISBN 0-7695-1414-6. protocol.

Fielding, R. T.

[Fischer85]

[Fox02]

Architectural Styles and the Design of Network-

Ph.D. thesis, University of California, 2000. URL http://www.ics.uci.edu/~fielding/pubs/ dissertation/fielding_dissertation.pdf. based

Software

Architectures.

Fischer, M. J., Lynch, N. A., and Paterson, M. S. Impossibility of distributed consensus with one faulty process. J. ACM, 32(2):374382, 1985. ISSN 0004-5411. doi:http://doi.acm.org/10.1145/3149.214121. Fox, G. and Pallickara, S. In

Overview and extensions.

The narada event brokering system: PDPTA '02: Proceedings of the Inter-

national Conference on Parallel and Distributed Processing Techniques and Applications,

87-4. [Friedman96]

[Garofalakis04]

pages 353359. CSREA Press, 2002. ISBN 1-892512-

Friedman, R. and van Renesse, R. Strong and weak virtual synsrds, 00:140, 1996. ISSN 1060-9857. doi:http: //doi.ieeecomputersociety.org/10.1109/RELDIS.1996.559711.

chrony in horus.

Garofalakis, J., Panagis, Y., and Sakkopoulos, E.

Web service discov-

ery mechanisms: looking for a needle in a haystack.

tional Workshop on Web Engineering. (2004.

In

In: Interna-

Web, 2009. URL http://gears.

[Google09]

Google. Gears google.com/.

[Gudgin06]

Gudgin, M., Hadley, M., and Rogers, T. Web services addressing - core. Web, 2006. URL http://www.w3.org/TR/ws-addr-core.

[Gudgin07a]

Gudgin, M., Hadley, M., Mendelsohn, N., Moreau, J.-J., Nielsen, H. F., Karmarkar, A., and Lafon, Y. Soap version 1.2 part 1: Messaging framework. Web, 2007. URL http://www.w3.org/TR/soap12-part1/.

[Gudgin07b]

Gudgin, M., Hadley, M., Mendelsohn, N., Moreau, J.-J., Nielsen, H. F., Karmarkar, A., and Lafon, Y. Soap version 1.2 part 2: Adjuncts. Web, 2007. URL http://www.w3.org/TR/soap12-part2/.

[Guerraoui06] [Gupta00]

- improving the web.

2004.

Guerraoui, R. and Rodrigues, L. An Springer, 2006.

1.0

Introduction to Reliable Dis-

tributed Programming.

Gupta, I., van Renesse, R., and Birman, K. P. A probabilistically correct leader election protocol for large groups. In DISC '00: Proceedings 215

BIBLIOGRAPHY

BIBLIOGRAPHY

of the 14th International Conference on Distributed Computing,

pages 89103. Springer-Verlag, London, UK, 2000. ISBN 3-540-41143-7.

[Hadzilacos94]

Hadzilacos, V. and Toueg, S.

A modular approach to fault-tolerant

broadcasts and related problems.

Technical report, Cornell University, Ithaca, NY, USA, 1994. URL http://dit.unitn.it/~montreso/ds/ syllabus/papers/FaultTolerantBroadcast.pdf.

[Hajamohideen03] Hajamohideen, S. H. A model for web service discovery and invocation in jxta. Technical report, Technical University Hamburg-Harburg, Department of Telematics, 2003. [Halepovic03]

Halepovic, E. D. R.

Jxta performance study.

In

Communications,

Computers and signal Processing, 2003. PACRIM. 2003 IEEE Pacic Rim Conference on,

[Hall07]

volume 1, pages 149154. 2003.

Hall, S. and Kotonya, G. An a peer-to-peer framework. In

adaptable fault-tolerance for soa using ICEBE '07: Proceedings of the IEEE

International Conference on e-Business Engineering,

pages 520527. IEEE Computer Society, Washington, DC, USA, 2007. ISBN 0-76953003-6. doi:http://dx.doi.org/10.1109/ICEBE.2007.32. [Hall09] [Hapner02]

[Hayashibara02]

Hall, S. and Kotonya, G. Eternity University of Lancaster, 2009.

failure detector.

Technical report,

Hapner, M., Burridge, R., Sharma, R., Fialli, J., and Stout, K. Web, 2002. URL http://java.sun.com/ products/jms/docs.html. Java messaging service.

Hayashibara, N., Cherif, A., and Katayama, T. Failure detectors for In SRDS '02: Proceedings of the 21st IEEE Symposium on Reliable Distributed Systems (SRDS'02), page 404. IEEE Computer Society, Washington, DC, USA, 2002. ISBN 0-7695-1659-9.

large-scale distributed systems.

[Jayasinghe05]

Jayasinghe, D. Faws for soap-based web services. IBM Developerworks, 2005. URL http://www.ibm.com/developerworks/webservices/ library/ws-faws/.

[Keidar01]

Keidar, I. and Rajsbaum, S.

On the cost of fault-tolerant consen-

sus when there are no faults: preliminary version.

SIGACT News,

32(2):4563, 2001. ISSN 0163-5700. doi:http://doi.acm.org/10.1145/ 504192.504195. [Kursawe02]

Kursawe, K.

Optimistic byzantine agreement.

In

SRDS '02:

Pro-

ceedings of the 21st IEEE Symposium on Reliable Distributed Systems (SRDS'02),

page 352. IEEE Computer Society, Washington, DC, USA, 2002. ISBN 0-7695-1659-9. 216

BIBLIOGRAPHY

[Lamport78a] [Lamport78b]

BIBLIOGRAPHY

Lamport, L.

The implementation of reliable distributed multiprocess

Lamport, L.

Time, clocks, and the ordering of events in a distributed

systems. Computer Networks,

(2):95114, 1978.

system. Commun. ACM,

21(7):558565, 1978. ISSN 0001-0782. doi: http://doi.acm.org/10.1145/359545.359563.

[Lamport82]

Lamport, L., Shostak, R., and Pease, M. The byzantine generals problem. ACM Trans. Program. Lang. Syst., 4(3):382401, 1982. ISSN 0164-0925. doi:http://doi.acm.org/10.1145/357172.357176.

[Lamport98]

Lamport, L. The part-time parliament. ACM Trans. Comput. Syst., 16(2):133169, 1998. ISSN 0734-2071. doi:http://doi.acm.org/10. 1145/279227.279229.

[Lamport01]

Lamport, L.

[Lamport04]

Paxos made simple. ACM SIGACT News (Distributed

Computing Column),

32(4):5158, 2001.

Lamport, L. and Massa, M.

Cheap paxos.

In

DSN '04:

Proceed-

ings of the 2004 International Conference on Dependable Systems and Networks,

page 307. IEEE Computer Society, Washington, DC, USA, 2004. ISBN 0-7695-2052-9. [Lamport05]

Lamport, L. Generalized crosoft Research, 2005.

[Lamport06]

Lamport, L. Fast paxos. Technical report, Microsoft Research, 2006. Technical Report MSR-TR-2005-112.

[Laprie92]

Laprie, J.-C.

consensus and paxos.

Technical report, Mi-

Dependability: A unifying concept for reliable, safe, se-

cure computing.

In

Proceedings of the IFIP 12th World Computer

Congress on Algorithms, Software, Architecture - Information Processing '92, Volume 1,

pages 585593. North-Holland Publishing Co., Amsterdam, The Netherlands, The Netherlands, 1992. ISBN 0-44489747-X. [Laprie95]

Laprie, J.-C.

provements.

1995.

Dependability of computer systems: concepts, limits, im-

In

IEEE Symposium on Software Reliability Engineering.

[Li03]

Li, S. Jxta 2: A high-performance, massively scalable p2p network. IBM Developerworks, 2003. URL http://www.ibm.com/ developerworks/java/library/j-jxta2/.

[Li04]

Li, Y., Zou, F., Wu, Z., and Ma, F.

Pwsd: A scalable web service dis-

covery architecture based on peer-to-peer overlay network.

pages 291300. 2004.

217

In

APWeb,

BIBLIOGRAPHY

BIBLIOGRAPHY

[Li07]

Li, H. C., Clement, A., Aiyer, A. S., and Alvisi, L. The paxos register. In SRDS '07: Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems, pages 114126. IEEE Computer Society, Washington, DC, USA, 2007. ISBN 0-7695-2995-X.

[Liskov07]

Liskov, B. From viewstamped replication to bft, 2007. URL http: //www.inf.unisi.ch/30YearsOfReplication/pps/Liskov.pdf, 30 Years of Replication Lecture Series.

[Looker05]

Looker, N., Munro, M., and Xu, J. In

ability through consensus voting.

Increasing web service dependCOMPSAC '05: Proceedings of

the 29th Annual International Computer Software and Applications Conference (COMPSAC'05) Volume 2,

pages 6669. IEEE Computer Society, Washington, DC, USA, 2005. ISBN 0-7695-2413-3-02. doi: http://dx.doi.org/10.1109/COMPSAC.2005.88. [Ma07]

Ma, T. Quality of Service of Crash-Recovery Failure Detectors. Ph.D. thesis, School of Informatics, University of Edinburgh, 2007.

[MacKenzie06]

MacKenzie, C. M., Laskey, K., McCabe, F., Brown, P. F., Metz, R., and Hamilton, B. A. Reference model for service oriented architecture 1.0. Web, 2006. URL http://docs.oasis-open.org/soa-rm/v1.0/, oASIS Standard.

[Maheshwari04]

Maheshwari, P., Tang, H., and Liang, R. Enhancing web services with message-oriented middleware. In ICWS '04: Proceedings of the IEEE International Conference on Web Services, page 524. IEEE Computer Society, Washington, DC, USA, 2004. ISBN 0-7695-2167-3. doi:http: //dx.doi.org/10.1109/ICWS.2004.53.

[Makris05]

Makris, C., Sakkopoulos, E., Sioutas, S., Triantallou, P., Tsakalidis, A., and Vassiliadis, B. Nippers: Network of interpolated peers for web service discovery. In ITCC '05: Proceedings of the International Conference on Information Technology:

Coding and Com-

puting (ITCC'05) - Volume II,

pages 193198. IEEE Computer Society, Washington, DC, USA, 2005. ISBN 0-7695-2315-3. doi:http: //dx.doi.org/10.1109/ITCC.2005.211. [Makris06]

Makris, C., Panagis, Y., Sakkopoulos, E., and Tsakalidis, A.

Ecient

and adaptive discovery techniques of web services handling large data sets. J. Syst. Softw.,

79(4):480495, 2006. ISSN 0164-1212. doi:http: //dx.doi.org/10.1016/j.jss.2005.06.002. [Martin06]

Martin, J.-P.

Fast byzantine consensus.

IEEE Trans. Dependable

Secur. Comput., 3(3):202215, 2006. ISSN 1545-5971. doi:http: //dx.doi.org/10.1109/TDSC.2006.35. Senior Member-Lorenzo Alvisi.

218

BIBLIOGRAPHY

[Maymounkov02]

BIBLIOGRAPHY

Maymounkov, P. and Mazières, D.

Kademlia: A peer-to-peer informa-

tion system based on the xor metric.

In

IPTPS '01: Revised Papers

from the First International Workshop on Peer-to-Peer Systems, pages 5365. Springer-Verlag, London, UK, 2002. ISBN 3-540-44179-4.

[Merideth05]

Merideth, M. G., Iyengar, A., Mikalsen, T., Tai, S., Rouvellou, I., and Narasimhan, P. Thema: Byzantine-fault-tolerant middleware for webservice applications. In SRDS '05: Proceedings of the 24th IEEE Symposium on Reliable Distributed Systems, pages 131142. IEEE Computer Society, Washington, DC, USA, 2005. ISBN 0-7695-2463-X. doi: http://dx.doi.org/10.1109/RELDIS.2005.28.

[Milanovic04]

Milanovic, N. and Malek, M. Current solutions for web service composition. IEEE Internet Computing, 8(6):5159, 2004. ISSN 1089-7801. doi:http://dx.doi.org/10.1109/MIC.2004.58.

[Mitra07]

Mitra, N. and Lafon, Y. Soap version 1.2 part 0: Primer (second edition). Web, 2007. URL http://www.w3.org/TR/soap12-part0/.

[Muehlen05]

Muehlen, M., Nickerson, J. V., and Swenson, K. D.

services choreography standards - the case of rest vs. soap. Decision

Support Systems,

[Norcross05]

Developing web

1(40):929, 2005.

Norcross, S. J., Dearle, A., Kirby, G. N. C., and Walker, S. M. A peerto-peer infrastructure for resilient web services. In AAA-IDEA '05: Proceedings of the First International Workshop on Advanced Architectures and Algorithms for Internet Delivery and Applications,

pages 6572. IEEE Computer Society, Washington, DC, USA, 2005. ISBN 0-7695-2525-3. doi:http://dx.doi.org/10.1109/AAA-IDEA.2005.16.

[Oaks02]

Oaks, S., Travaset, B., Gong, L., and Traversat, B. JXTA in a Nutshell. O'Reilly. 2002.

[Oki88]

Oki, B. M. and Liskov, B. H. Viewstamped replication: a general primary copy. In PODC '88: Proceedings of the seventh annual ACM Symposium on Principles of distributed computing, pages 817. ACM, New York, NY, USA, 1988. ISBN 0-89791-277-2. doi:http://doi.acm. org/10.1145/62546.62549.

[Osrael07]

Osrael, J., Froihofer, L., Weghofer, M., and Goeschka, K. Axis2-based replication middleware forweb services. In Web Services, 2007. ICWS 2007. IEEE International Conference on, pages 591598. 2007. doi: 10.1109/ICWS.2007.57.

[Owicki82]

Owicki, S. and Lamport, L. Proving liveness properties of concurrent programs. ACM Trans. Program. Lang. Syst., 4(3):455495, 1982. ISSN 0164-0925. doi:http://doi.acm.org/10.1145/357172.357178. 219

BIBLIOGRAPHY

[Pallickara03]

BIBLIOGRAPHY

Pallickara, S. and Fox, G.

NaradaBrokering: A Distributed Middleware

Framework and Architecture for Enabling Durable Peer-to-Peer Grids,

chapter Middleware 2003, pages 998999. Springer Berlin / Heidelberg, 2003. [Papazoglou03] [Patil04]

[Peltz03]

[Pullman01] [Qu04]

Papazoglou, M. P. and Georgakopoulos, D. 46(10), 2003.

Service-oriented comput-

ing. Commun. ACM,

Patil, A. A., Oundhakar, S. A., Sheth, A. P., and Verma, K. MeteorIn WWW '04: Proceedings of the 13th international conference on World Wide Web, pages 553562. ACM, New York, NY, USA, 2004. ISBN 1-58113-844-X. doi:http: //doi.acm.org/10.1145/988672.988747.

s web service annotation framework.

Peltz, C.

Web services orchestration -. a review of emerging tech-

nologies, tools, and. standards.

Technical report, Hewlet-Packard, 2003. URL http://devresource.hp.com/drc/technical_white_ papers/WSOrch/WSOrchestration.pdf. Pullman, L. Software Fault Tolerance Boston, Artech House, 2001.

Techniques and Implementa-

tions.

Qu, C.

services.

Interacting the edutella/jxta peer-to-peer network with web

In

In 2004 International Symposium on Applications and the

Internet (SAINT 2004,

2004.

pages 6773. IEEE Computer Society Press,

[Rabin83]

Rabin, M.

[Ran03]

Ran, S. A model for web services discovery with qos. SIGecom Exch., 4(1):110, 2003. ISSN 1551-9031. doi:http://doi.acm.org/10.1145/ 844357.844360.

[Randell75]

Randomized Byzantine generals.

Foundations of Computer Science.

Randell, B.

1983.

In

Proc. Symposium on

System structure for software fault tolerance.

In Propages 437449. ACM, New York, NY, USA, 1975. doi:http://doi.acm.org/ 10.1145/800027.808467. ceedings of the international conference on Reliable software,

[Randell95]

Randell, B. and Xu, J. Software Fault Tolerance, chapter The Evolution of the Recovery Block Concept, pages 122. John Wiley & Sons, 1995.

[Ratnasamy01]

Ratnasamy, S., Francis, P., Handley, M., Karp, R., and Schenker, S. A scalable content-addressable network. In SIGCOMM '01: Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications,

220

pages 161172. ACM, New

BIBLIOGRAPHY

BIBLIOGRAPHY

York, NY, USA, 2001. ISBN 1-58113-411-8. doi:http://doi.acm.org/ 10.1145/383059.383072. [Raynal96]

Raynal, M. and Singhal, M. Logical time: Capturing causality in distributed systems. Computer, 29(2):4956, 1996. ISSN 0018-9162. doi: http://dx.doi.org/10.1109/2.485846.

[Reiter94]

Reiter, M. K.

Secure agreement protocols: reliable and atomic group

multicast in rampart.

In

CCS '94:

Proceedings of the 2nd ACM

Conference on Computer and communications security,

pages 68 80. ACM, New York, NY, USA, 1994. ISBN 0-89791-732-4. doi: http://doi.acm.org/10.1145/191177.191194. [Reiter95]

Reiter, M. K. The rampart toolkit for building high-integrity services. In Theory and Practice in Distributed Systems, volume 938, pages 99 110. Springer-Verlag, Berlin Germany, 1995. URL citeseer.ist.psu. edu/reiter95rampart.html.

[Renesse95]

Renesse, R. V., Birman, K. P., Friedman, R., Hayden, M., and Karr, D. A. A framework for protocol composition in horus. In PODC '95: Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing,

pages 8089. ACM, New York, NY, USA, 1995. ISBN 0-89791-710-3. doi:http://doi.acm.org/10.1145/224964.224974. [Renesse98]

Renesse, R. V., Minsky, Y., and Hayden, M. A gossip-style detection service. Technical report, Ithaca, NY, USA, 1998.

[Rodrigues01]

Rodrigues, R., Castro, M., and Liskov, B. Base: using abstraction to improve fault tolerance. In SOSP '01: Proceedings of the eighteenth ACM symposium on Operating systems principles, pages 1528. ACM, New York, NY, USA, 2001. ISBN 1-58113-389-8. doi:http://doi.acm. org/10.1145/502034.502037.

[Rowstron01]

Rowstron, A. I. T. and Druschel, P.

failure

Pastry: Scalable, decentralized ob-

ject location, and routing for large-scale peer-to-peer systems.

In

Mid-

dleware '01: Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg, pages 329350. SpringerVerlag, London, UK, 2001. ISBN 3-540-42800-3.

[Salas06]

Salas, J., Perez-Sorrosal, F., no Martínez Marta Pati and JiménezPeris, R. Ws-replication: a framework for highly available web services. In WWW '06: Proceedings of the 15th international conference on World Wide Web, pages 357366. ACM, New York, NY, USA, 2006. ISBN 1-59593-323-9. doi:http://doi.acm.org/10.1145/1135777. 1135831.

221

BIBLIOGRAPHY

[Salatge07]

[Santos05]

BIBLIOGRAPHY

Salatge, N. and Fabre, J.-C. Fault tolerance connectors for unreliable In DSN '07: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pages 5160. IEEE Computer Society, Washington, DC, USA, 2007. ISBN 0-7695-2855-4. doi:http://dx.doi.org/10.1109/DSN.2007.48. web services.

Santos, G. T., Lung, L. C., and Montez, C. In EDOC '05:

infrastructure for web services.

Ftweb: A fault tolerant Proceedings of the Ninth

IEEE International EDOC Enterprise Computing Conference, pages 95105. IEEE Computer Society, Washington, DC, USA, 2005. ISBN 0-7695-2441-9. doi:http://dx.doi.org/10.1109/EDOC.2005.15.

[Schlosser02]

Schlosser, M., Sintek, M., Decker, S., and Nejdl, W.

A scalable and

ontology-based p2p infrastructure for semantic web services.

In in

Pro-

ceedings of the Second International Conference on Peer-to-Peer Computing,

[Schneider90]

[Scott87]

pages 104111. 2002.

Schneider, F. B.

Implementing fault-tolerant services using the state

machine approach: a tutorial.

ACM Comput. Surv.,

22(4):299319, 1990. ISSN 0360-0300. doi:http://doi.acm.org/10.1145/98163.98167. Scott, R. K., Gault, J. W., and McAllister, D. F.

software reliability modeling.

Fault-tolerant

IEEE Trans. Softw. Eng.,

13(5):582 592, 1987. ISSN 0098-5589. doi:http://dx.doi.org/10.1109/TSE.1987. 233463.

[Shirky02]

Shirky, C. Web services and context horizons. Computer, 35(9):98 100, 2002. ISSN 0018-9162. doi:http://dx.doi.org/10.1109/MC.2002. 1033037.

[Sommerville01]

Sommerville, I. Software Engineering. International computer science series. AddisonWesley, Wokingham [u.a.], 6th edition, 2001.

[Sommerville05]

Sommerville, I., Hall, S., and Dobson, G.

A generic mechanism for im-

plementing fault tolerance in service-oriented architectures.

report, University of Lancaster, 2005.

Technical

[Stoica03]

Stoica, I., Morris, R., Liben-Nowell, D., Karger, D. R., Kaashoek, M. F., Dabek, F., and Balakrishnan, H. Chord: a scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Trans. Netw., 11(1):1732, 2003. ISSN 1063-6692. doi:http://dx.doi.org/10.1109/ TNET.2002.808407.

[Sussman00]

Sussman, J., Keidar, I., and Marzullo, K. Optimistic virtual synchrony. In SRDS '00: Proceedings of the 19th IEEE Symposium on Reliable Distributed Systems (SRDS'00), page 42. IEEE Computer Society, Washington, DC, USA, 2000. ISBN 0-7695-0543-0. 222

BIBLIOGRAPHY

BIBLIOGRAPHY

[Torres-Pomales00] Torres-Pomales, W. Software fault tolerance: A tutorial. Technical Report NASA/TM-2000-210616, Langley Research Centre, Langley Research Center, Hampton, Virginia, 2000. URL http://techreports. larc.nasa.gov/ltrs/PDF/2000/tm/NASA-2000-tm210616.pdf. [Toueg84]

Toueg, S.

Randomized byzantine agreements.

In

PODC '84: Proceed-

ings of the third annual ACM symposium on Principles of distributed computing,

pages 163178. ACM, New York, NY, USA, 1984. ISBN 0-89791-143-1. doi:http://doi.acm.org/10.1145/800222.806744. [Turner07]

Turner, B. The paxos family of consensus protocols. Web, 2007. URL http://brturn.googlepages.com/PaxosFamily.pdf.

[Vitenberg99]

Vitenberg, R., Keidar, I., Chockler, G., and Dolev, D. Group communication specications: A comprehensive study. Technical Report CS99-31, Comp. Sci. Inst., The Hebrew University of Jerusalem note =, 1999.

[Wang03]

Wang, J. and Blum, C.

Engineering Persistent Queue System for

a Unied Stock Transaction Platform,

volume Volume 2660/2003 of chapter Computational Science â ICCS 2003, page 715. Springer Berlin / Heidelberg, 2003.

Lecture Notes in Computer Science,

[Wiley03]

Wiley, B. Distributed hash tables. Linux Journal, 2003. URL http: //www.linuxjournal.com/article/6797.

[Xu97]

Xu, J. and Randell, B. Software fault tolerance: t/(n-1)-variant gramming. IEEE Trans. Reliability, 46(1):6068, 1997.

[Ye05]

Ye, X. and Shen, Y.

A middleware for replicated web services.

pro-

In ICWS

'05: Proceedings of the IEEE International Conference on Web Services,

pages 631638. IEEE Computer Society, Washington, DC, USA, 2005. ISBN 0-7695-2409-5. doi:http://dx.doi.org/10.1109/ICWS.2005. 8. [Yu04]

Yu, T. and Lin, K.-J. The design of qos broker algorithms for qoscapable web services. In EEE '04: Proceedings of the 2004 IEEE International Conference on e-Technology, e-Commerce and e-Service (EEE'04),

pages 1724. IEEE Computer Society, Washington, DC, USA, 2004. ISBN 0-7695-2073-1. [Zhao01]

Zhao, B. Y., Kubiatowicz, J. D., and Joseph, A. D.

Tapestry:

An infrastructure for fault-tolerant wide-area location and.

Technical report, Berkeley, CA, USA, Berkeley, CA, USA, 2001. URL http://www.ncstrl.org:8900/ncstrl/servlet/search?formname= detail&id=oai%3Ancstrlh%3Aucb%3AUCB%2F%2FCSD-01-1141. 223

BIBLIOGRAPHY

[Zhao07]

BIBLIOGRAPHY

Zhao, W. In

vices.

Bft-ws: A byzantine fault tolerance framework for web serEDOC Conference Workshop, 2007. EDOC '07. Eleventh

International IEEE,

[Zielinski04]

Zielinski, P. 2004.

pages 8996. 2007.

Paxos at war.

224

Technical report, University of Cambridge,

Appendix A

FT Protocol Code Listings In Appendix A we present the actual Java code implementations of the FT protocols described in Chapter 5. Listing A.1: Patmos Implemented in Java 1 package net . wspbft . f t . p r o t o c o l ; 2 import java . u t i l . L i s t { . . . } ; 3 4 p u b l i c a b s t r a c t c l a s s Patmos extends S e r v i c e implements CrashListener , ClockworkSyncFailListener , ViewChangeListener 5 { 6 @Override p u b l i c void i n i t (WSDL wsdl , Sandbox sandbox , S t r i n g serviceURI , S t r i n g i n s t a n c e I d ) throws Exception 7 { 8 super . i n i t ( wsdl , sandbox , serviceURI , i n s t a n c e I d ) ; 9 e t e r n i t y . addCrashListener ( t h i s ) ; 10 clockwork . setLog ( new DLog( " out " ) ) ; 11 clockwork . a d d S y n c F a i l L i s t e n e r ( t h i s ) ; 12 viewpoint . a d d L i s t e n e r ( t h i s ) ; 13 } 14 15 @Override p u b l i c void d e s t r o y ( ) throws Exception 16 { 17 e t e r n i t y . removeCrashListener ( t h i s ) ; 18 clockwork . r e m o v e S y n c F a i l L i s t e n e r ( t h i s ) ; 19 viewpoint . r e m o v e L is t e n e r ( t h i s ) ; 20 in . destroy () ; 21 promised . d e s t r o y ( ) ; 22 } 23 24 p u b l i c void incoming ( SoapMessage message ) throws Exception 25 { 26 LambHeader header = message . getLambHeader ( ) ; 27 S t r i n g c o r r e l a t i o n = header . g e t C o r r e l a t i o n I d ( ) . t o S t r i n g ( ) ; 28 i n . add ( c o r r e l a t i o n , message ) ; 29 viewpoint . s e t P o p u l a t i o n ( header . g e t C o S e r v i c e P a r t i c i p a n t s ( serviceURI ) ) ; 30 c o S e r v i c e E n d p o i n t s = header . getCausalEndpoints ( serviceURI ) ; 31 clockwork . syncIn ( header . getMessageURI ( ) , c o r r e l a t i o n ) ; 32 l a s t R e c e i v e d M e s s a g e = message ; 33 checkIfAmDistinguishedDeliverer ( false ) ; 34 } 35 36 protected void d e l i v e r M e s s a g e ( SoapMessage message ) throws Exception 37 { 38 i f ( message == n u l l ) return ; 39 LambHeader header = message . getLambHeader ( ) ; 40 header . r e s e t A c t i o n ( ) ; 41 i f ( Main . traceMap . ge t ( " FtTrace " ) ) System . out . p r i n t l n ( "Patmos . d e l i v e r ( "+header . g e t C o r r e l a t i o n I d ( )+" ) " ) ; 42 s e n d F u n c t i o n a l L o c a l ( message ) ; 43 header . s e t A c t i o n ( getDeliveredMessageURI ( ) ) ; 44 sendTo ( message , c o S e r v i c e E n d p o i n t s ) ; 45 } 46 47 p u b l i c void Prepare ( SoapMessage message ) throws Exception

225

Chapter A: FT Protocol Code Listings

48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147

{

}

LambHeader header = message . getLambHeader ( ) ; i n t view = header . g e t A t t r i b u t e A s I n t ( " view " ) ; i f ( view > c u r r e n t D e l i v e r i n g V i e w ) { c u r r e n t D e l i v e r i n g V i e w = view ; header . s e t A c t i o n ( getPromiseMessageURI ( ) ) ; reset () ; } else { header . s e t A t t r i b u t e ( " view " , c u r r e n t D e l i v e r i n g V i e w ) ; header . s e t A c t i o n ( getNackMessageURI ( ) ) ; } r e p l y ( message ) ; void Nack ( SoapMessage message ) throws Exception LambHeader header = message . getLambHeader ( ) ; i n t view = header . g e t A t t r i b u t e A s I n t ( " view " ) ; i f ( view > viewpoint . getView ( ) ) viewpoint . setView ( view ) ;

public

{ }

void Promise ( SoapMessage message ) throws Exception LambHeader header = message . getLambHeader ( ) ; i n t view = header . g e t A t t r i b u t e A s I n t ( " view " ) ; synchronized ( promiseMutex ) { i f ( view > lastPromisedView && view == viewpoint . getView ( ) ) { promised . add ( view , new Object ( ) ) ; i f ( isQuorum ( promised . s i z e ( view ) , viewpoint . p o p u l a t i o n S i z e ( ) ) ) { lastPromisedView = view ; startDelivering () ; } } }

public

{

}

void c h e c k I f A m D i s t i n g u i s h e d D e l i v e r e r ( boolean r e s t a r t ) throws Exception ( viewpoint . amIPrimary ( ) ) { i f ( ! i A m D i s t i n g u i s h e d D e l i v e r e r ) { sendPrepare ( ) ; } i f ( i A m D i s t i n g u i s h e d D e l i v e r e r && r e s t a r t ) { s t o p D e l i v e r i n g ( ) ; sendPrepare ( ) ; } } else i f ( iAmDistinguishedDeliverer ) { stopDelivering () ; }

protected

{

}

if

void sendPrepare ( ) throws Exception ( l a s t R e c e i v e d M e s s a g e == n u l l ) return ; SoapMessage message = l a s t R e c e i v e d M e s s a g e ; LambHeader header = message . getLambHeader ( ) ; header . s e t A c t i o n ( getPrepareMessageURI ( ) ) ; header . s e t A t t r i b u t e ( " view " , viewpoint . getView ( ) ) ; header . setReplyTo ( wsdl . getEndpoint ( ) ) ; sendTo ( message , c o S e r v i c e E n d p o i n t s ) ;

protected

{

}

if

void s t a r t D e l i v e r i n g ( ) throws Exception i A m D i s t i n g u i s h e d D e l i v e r e r = true ; i f ( Main . traceMap . ge t ( " FtTrace " ) ) System . out . p r i n t l n ( "Patmos . s t a r t D e l i v e r i n g ( ) " ) ; i f ( f u t u r e != n u l l ) f u t u r e . c a n c e l ( true ) ; f u t u r e = e x e c u t o r . scheduleAtFixedRate ( new D e l i v e r e r ( ) , 0 , 20 , TimeUnit . MILLISECONDS) ;

protected

{ }

void s t o p D e l i v e r i n g ( ) throws Exception iAmDistinguishedDeliverer = false ; i f ( Main . traceMap . ge t ( " FtTrace " ) ) System . out . p r i n t l n ( "Patmos . s t o p D e l i v e r i n g ( ) " ) ; i f ( f u t u r e != n u l l ) { f u t u r e . c a n c e l ( true ) ; f u t u r e = null ; }

protected

{

}

protected void r e s e t ( ) throws Exception { clockwork . r e s e t ( ) ; } protected a b s t r a c t S t r i n g getPrepareMessageURI ( ) ; protected a b s t r a c t S t r i n g getPromiseMessageURI ( ) ; protected a b s t r a c t S t r i n g getNackMessageURI ( ) ; protected a b s t r a c t S t r i n g getDeliveredMessageURI ( ) ; p u b l i c void D e l i v e r e d ( SoapMessage message ) throws Exception { S t r i n g c o r r e l a t i o n = message . getLambHeader ( ) . g e t C o r r e l a t i o n I d ( ) . t o S t r i n g ( ) ; clockwork . syncOut ( getDeliveredMessageURI ( ) , c o r r e l a t i o n ) ; i n . removeAll ( c o r r e l a t i o n ) ; } @Override p u b l i c void viewChanged ( i n t newView ) throws Exception

226


148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 }

{ c h e c k I f A m D i s t i n g u i s h e d D e l i v e r e r ( true ) ; } @Override p u b l i c void clockworkSyncFail ( S t r i n g inMessageURI , S t r i n g outMessageURI , S t r i n g c o r r e l a t i o n ) throws Exception { i f ( Main . traceMap . ge t ( " FtTrace " ) ) System . out . p r i n t l n ( " Clockwork . s y n c F a i l ( "+c o r r e l a t i o n+" ) " ) ; viewpoint . advanceView ( ) ; } @Override p u b l i c void crashed ( S t r i n g p e e r I d ) throws Exception { viewpoint . remove ( p e e r I d ) ; } protected DLog i n = new DLog( " i n " ) ; protected DLog promised = new DLog( " promised " ) ; protected Object promiseMutex = new Object ( ) ; protected i n t c u r r e n t D e l i v e r i n g V i e w = − 1; protected i n t lastPromisedView = − 1; protected boolean i A m D i s t i n g u i s h e d D e l i v e r e r = f a l s e ; protected SoapMessage l a s t R e c e i v e d M e s s a g e = n u l l ; protected L i s t c o S e r v i c e E n d p o i n t s = n u l l ; protected Future f u t u r e = n u l l ; protected ScheduledThreadPoolExecutor e x e c u t o r = new ScheduledThreadPoolExecutor ( 1 ) ; protected c l a s s D e l i v e r e r implements Runnable { p u b l i c void run ( ) { t r y { safeRun ( ) ; } catch ( Exception e ) { e . p r i n t S t a c k T r a c e ( ) ; } } protected void safeRun ( ) throws Exception { i f ( iAmDistinguishedDeliverer ) { SoapMessage message = i n . popMessage ( ) ; i f ( message != n u l l ) d e l i v e r M e s s a g e ( message ) ; } } }

Listing A.2: Atakos Implemented in Java 1 package net . wspbft . f t . p r o t o c o l ; 2 import java . u t i l . L i s t { . . . } ; 3 4 p u b l i c a b s t r a c t c l a s s Atakos extends S e r v i c e implements C r a s h L i s t e n e r 5 { 6 @Override p u b l i c void i n i t (WSDL wsdl , Sandbox sandbox , S t r i n g serviceURI , S t r i n g i n s t a n c e I d ) throws Exception 7 { 8 super . i n i t ( wsdl , sandbox , serviceURI , i n s t a n c e I d ) ; 9 e t e r n i t y . addCrashListener ( t h i s ) ; 10 Main . e x e c u t o r . scheduleAtFixedRate ( nCalculator , 0 , getRepopulateNCycle ( ) , TimeUnit . MILLISECONDS) ; 11 } 12 13 @Override p u b l i c void d e s t r o y ( ) throws Exception 14 { 15 e t e r n i t y . removeCrashListener ( t h i s ) ; 16 done . d e s t r o y ( ) ; 17 in . destroy () ; 18 i f ( f u t u r e != n u l l && ! f u t u r e . i s C a n c e l l e d ( ) ) f u t u r e . c a n c e l ( true ) ; 19 } 20 21 @Override p u b l i c void crashed ( S t r i n g p e e r I d ) throws Exception 22 { 23 n C a l c u l a t o r . run ( ) ; 24 } 25 26 p u b l i c void incoming ( SoapMessage message ) throws Exception 27 { 28 LambHeader header = message . getLambHeader ( ) ; 29 header . r e s e t A c t i o n ( ) ; 30 S t r i n g c o r r e l a t i o n = header . g e t C o r r e l a t i o n I d ( ) . t o S t r i n g ( ) ; 31 boolean canMerge = f a l s e ; 32 i n . add ( c o r r e l a t i o n , message ) ; 33 synchronized ( mutex ) 34 i f ( done . s i z e ( c o r r e l a t i o n ) == 0) 35 i f ( isQuorum ( i n . s i z e ( c o r r e l a t i o n ) , n ) ) { done . add ( c o r r e l a t i o n , new Object ( ) ) ; canMerge = true ; } 36 i f ( canMerge ) startMerge ( c o r r e l a t i o n ) ; 37 } 38 39 protected void startMerge ( S t r i n g c o r r e l a t i o n ) throws Exception 40 { 41 Thread . s l e e p ( 1 0 ) ; 42 s e n d F u n c t i o n a l L o c a l ( merge ( i n . g e t A l l M e s s a g e s ( c o r r e l a t i o n ) ) ) ; 43 i n . removeAll ( c o r r e l a t i o n ) ; 44 } 45 46 protected a b s t r a c t long getRepopulateNCycle ( ) ; 47 protected a b s t r a c t SoapMessage merge ( L i s t messages ) throws Exception ; 48 49 protected a b s t r a c t S t r i n g g e t D i v e r s e S e r v i c e U R I ( ) ; 50 protected DLog i n = new DLog( " i n " ) ;

227


51 protected DLog done = new DLog( " done " ) ; 52 protected Object mutex = new Object ( ) ; 53 54 protected i n t n = 0 ; 55 protected Random random = new Random ( ) ; 56 protected ScheduledFuture f u t u r e = n u l l ; 57 protected NCalculator n C a l c u l a t o r = new NCalculator ( ) ; 58 59 protected c l a s s NCalculator implements Runnable 60 { 61 p u b l i c void run ( ) { t r y { safeRun ( ) ; } catch ( Exception e ) {} } 62 protected void safeRun ( ) throws Exception { n = lamb . d i s c o v e r ( g e t D i v e r s e S e r v i c e U R I ( ) ) . s i z e ( ) ; } 63 } 64 }

Listing A.3: Ionian Implemented in Java 1 package net . wspbft . f t . p r o t o c o l ; 2 import java . u t i l . L i s t { . . . } ; 3 4 p u b l i c a b s t r a c t c l a s s I o n i a n extends S e r v i c e implements CrashListener , ClockworkSyncFailListener , ViewChangeListener 5 { 6 @Override p u b l i c void i n i t (WSDL wsdl , Sandbox sandbox , S t r i n g serviceURI , S t r i n g i n s t a n c e I d ) throws Exception 7 { 8 super . i n i t ( wsdl , sandbox , serviceURI , i n s t a n c e I d ) ; 9 clockwork . setLog ( new DLog( " out " ) ) ; 10 clockwork . a d d S y n c F a i l L i s t e n e r ( t h i s ) ; 11 viewpoint . a d d L i s t e n e r ( t h i s ) ; 12 e t e r n i t y . addCrashListener ( t h i s ) ; 13 } 14 15 @Override p u b l i c void d e s t r o y ( ) throws Exception 16 { 17 e t e r n i t y . removeCrashListener ( t h i s ) ; 18 clockwork . r e m o v e S y n c F a i l L i s t e n e r ( t h i s ) ; 19 in . destroy () ; 20 promised . d e s t r o y ( ) ; 21 accepted . d e s t r o y ( ) ; 22 learnt . destroy () ; 23 undelivered . destroy () ; 24 viewpoint . r e m o v e L is t e n e r ( t h i s ) ; 25 } 26 27 p u b l i c void incoming ( SoapMessage message ) throws Exception 28 { 29 LambHeader header = message . getLambHeader ( ) ; 30 S t r i n g c o r r e l a t i o n = header . g e t C o r r e l a t i o n I d ( ) . t o S t r i n g ( ) ; 31 i n . add ( c o r r e l a t i o n , message ) ; 32 l a s t M e s s a g e R e c e i v e d = message ; 33 viewpoint . s e t P o p u l a t i o n ( header . g e t C o S e r v i c e P a r t i c i p a n t s ( serviceURI ) ) ; 34 c o S e r v i c e E n d p o i n t s = header . getCausalEndpoints ( serviceURI ) ; 35 checkIfIAmDistinguishedProposer ( false ) ; 36 clockwork . syncIn ( header . getMessageURI ( ) , c o r r e l a t i o n ) ; 37 } 38 39 protected void sendPrepare ( ) throws Exception 40 { 41 i f ( l a s t M e s s a g e R e c e i v e d != n u l l ) 42 { 43 i f ( Main . traceMap . ge t ( " FtTrace " ) ) System . out . p r i n t l n ( " I o n i a n . SendPrepare ( "+viewpoint . getView ( )+" ) " ) ; 44 LambHeader header = l a s t M e s s a g e R e c e i v e d . getLambHeader ( ) ; 45 header . s e t A t t r i b u t e ( " view " , viewpoint . getView ( ) ) ; 46 header . s e t A c t i o n ( getPrepareMessageURI ( ) ) ; 47 header . setReplyTo ( wsdl . getEndpoint ( ) ) ; 48 sendTo ( lastMessageReceived , c o S e r v i c e E n d p o i n t s ) ; 49 } 50 } 51 52 protected void sendProposal ( ) throws Exception 53 { 54 SoapMessage message = getNextMessage ( ) ; 55 i f ( message == n u l l ) return ; 56 allowOneProposal = f a l s e ; 57 i n t view = viewpoint . getView ( ) ; 58 i n t sequence = getGlobalSequence ( view , p r o p o s a l I n s t a n c e++) ; 59 LambHeader header = message . getLambHeader ( ) ; 60 header . s e t A c t i o n ( getProposalMessageURI ( ) ) ; 61 header . s e t A t t r i b u t e ( " view " , view ) ; 62 header . s e t A t t r i b u t e ( " seq " , sequence ) ; 63 header . setReplyTo ( wsdl . getEndpoint ( ) ) ; 64 i f ( Main . traceMap . ge t ( " FtTrace " ) ) 65 System . out . p r i n t l n ( " I o n i a n . SendProposal ( "+sequence+" , "+header . g e t C o r r e l a t i o n I d ( )+ " , to=" + coServiceEndpoints . s i z e () + ")") ; 66 sendTo ( message , c o S e r v i c e E n d p o i n t s ) ; 67 } 68 69 protected SoapMessage getNextMessage ( ) throws Exception 70 { 71 SoapMessage message = n u l l ; 72 i f ( u n d e l i v e r e d . s i z e ( ) > 0) { S t r i n g c o r r e l a t i o n = choose ( ) ; i f ( c o r r e l a t i o n != n u l l ) message = in . getFirstMessage ( c o r r e l a t i o n ) ;} 73 i f ( message == n u l l ) message = i n . popMessage ( ) ; 74 return message ;

228


75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170

}

void Nack ( SoapMessage message ) throws Exception LambHeader header = message . getLambHeader ( ) ; i n t view = header . g e t A t t r i b u t e A s I n t ( " view " ) ; i f ( Main . traceMap . ge t ( " FtTrace " ) ) System . out . p r i n t l n ( " I o n i a n . Nack ( updateTo="+view+" , Current= " + viewpoint . getView ( )+" ) " ) ; i f ( view > viewpoint . getView ( ) ) viewpoint . setView ( view ) ;

public

{

}

void Promise ( SoapMessage message ) throws Exception LambHeader header = message . getLambHeader ( ) ; synchronized ( promiseMutex ) { i n t view = header . g e t A t t r i b u t e A s I n t ( " view " ) ; i f ( view > lastPromisedView && view == viewpoint . getView ( ) ) { r e g i s t e r C h o i c e ( header . g e t A t t r i b u t e ( "ud" ) ) ; promised . add ( view , new Object ( ) ) ; i f ( isQuorum ( promised . s i z e ( view ) , viewpoint . p o p u l a t i o n S i z e ( ) ) ) { allowOneProposal = true ; lastPromisedView = view ; startProposing () ; } } }

public

{

}

Learnt ( SoapMessage message ) throws Exception ( learntMutex ) { LambHeader header = message . getLambHeader ( ) ; i n t sequence = header . g e t A t t r i b u t e A s I n t ( " seq " ) ; i f ( iAmDistinguishedProposer && sequence > l a s t L e a r n t S e q u e n c e ) { l e a r n t . add ( sequence , new Object ( ) ) ; i f ( isQuorum ( l e a r n t . s i z e ( sequence ) , viewpoint . p o p u l a t i o n S i z e ( ) ) ) { l a s t L e a r n t S e q u e n c e = sequence ; allowOneProposal = true ; } } }

public

{

}

void

synchronized

void Prepare ( SoapMessage message ) throws Exception LambHeader header = message . getLambHeader ( ) ; i n t view = header . g e t A t t r i b u t e A s I n t ( " view " ) ; i f ( Main . traceMap . ge t ( " FtTrace " ) ) System . out . p r i n t l n ( " I o n i a n . Prepare ( "+view+" ) " ) ; i f ( view > currentAcceptingView ) { reset () ; currentAcceptingView = view ; header . s e t A c t i o n ( getPromiseMessageURI ( ) ) ; String correlation = getFirstUndeliveredCorrelation () ; i f ( c o r r e l a t i o n != n u l l ) header . s e t A t t r i b u t e ( "ud" , c o r r e l a t i o n ) ; i f ( Main . traceMap . ge t ( " FtTrace " ) ) System . out . p r i n t l n ( " I o n i a n . SendPromise ( "+view+" ) " ) ; } e l s e i f ( view < currentAcceptingView ) { header . s e t A t t r i b u t e ( " view " , viewpoint . getView ( ) ) ; header . s e t A c t i o n ( getNackMessageURI ( ) ) ; i f ( Main . traceMap . ge t ( " FtTrace " ) ) System . out . p r i n t l n ( " I o n i a n . SendNack ( "+view+" ) " ) ; } r e p l y ( message ) ;

public

{

}

void Propose ( SoapMessage message ) throws Exception LambHeader header = message . getLambHeader ( ) ; clockwork . syncOut ( getProposalMessageURI ( ) , header . g e t C o r r e l a t i o n I d ( ) . t o S t r i n g ( ) ) ; i f ( Main . traceMap . ge t ( " FtTrace " ) ) System . out . p r i n t l n ( " I o n i a n . Propose ( "+header . g e t A t t r i b u t e A s I n t ( " view " )+" , "+header . g e t A t t r i b u t e A s I n t ( " seq " )+" ) " ) ; i n t sequence = header . g e t A t t r i b u t e A s I n t ( " seq " ) ; i f ( sequence > l a s t P r o p o s e d S e q u e n c e ) { i f ( Main . traceMap . ge t ( " FtTrace " ) ) System . out . p r i n t l n ( " I o n i a n . SendAccepted ( "+header . g e t A t t r i b u t e A s I n t ( " view " )+" , "+header . g e t A t t r i b u t e A s I n t ( " seq " )+" ) " ) ; l a s t P r o p o s e d S e q u e n c e = sequence ; header . s e t A c t i o n ( getAcceptedMessageURI ( ) ) ; sendTo ( message , getCoServiceEndpoints ( header ) ) ; }

public

{

}

Accepted ( SoapMessage message ) throws Exception canLearn = f a l s e ; LambHeader header = message . getLambHeader ( ) ; i n t sequence = header . g e t A t t r i b u t e A s I n t ( " seq " ) ; synchronized ( acceptedMutex ) { i f ( sequence > l a s t A c c e p t e d S e q u e n c e )

public

{

void

boolean

229


171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

{

accepted . add ( sequence , message ) ; i f ( isQuorum ( accepted . s i z e ( sequence ) , viewpoint . p o p u l a t i o n S i z e ( ) ) ) { l a s t A c c e p t e d S e q u e n c e = sequence ; canLearn = true ; }

} } i f ( canLearn ) { S t r i n g c o r r e l a t i o n = message . getLambHeader ( ) . g e t C o r r e l a t i o n I d ( ) . t o S t r i n g ( ) ; clockwork . syncOut ( getLearntMessageURI ( ) , c o r r e l a t i o n ) ; i n . removeAll ( c o r r e l a t i o n ) ; accepted . removeAll ( sequence ) ; header . r e s e t A c t i o n ( ) ; i f ( Main . traceMap . ge t ( " FtTrace " ) ) System . out . p r i n t l n ( " I o n i a n . d e l i v e r e d ( " + sequence + " = " + c o r r e l a t i o n +" ) " ) ; s e n d F u n c t i o n a l L o c a l ( message ) ; header . s e t A c t i o n ( getLearntMessageURI ( ) ) ; r e p l y ( message ) ; }

} @Override p u b l i c void crashed ( S t r i n g p e e r I d ) throws Exception { viewpoint . remove ( p e e r I d ) ; } @Override p u b l i c void clockworkSyncFail ( S t r i n g inMessageURI , S t r i n g outMessageURI , S t r i n g c o r r e l a t i o n ) throws Exception { i f ( Main . traceMap . ge t ( " FtTrace " ) ) System . out . p r i n t l n ( " Clockwork . s y n c F a i l ( "+c o r r e l a t i o n+" ) " ) ; viewpoint . advanceView ( ) ; } @Override p u b l i c void viewChanged ( i n t newView ) throws Exception { c h e c k I f I A m D i s t i n g u i s h e d P r o p o s e r ( true ) ; } protected void r e s e t ( ) throws Exception { clockwork . r e s e t ( ) ; accepted . removeAll ( ) ; } protected S t r i n g choose ( ) { S t r i n g c o r r e l a t i o n = null ; f o r ( Object index : u n d e l i v e r e d . g e t I n d e x e s ( ) ) i f ( isQuorum ( u n d e l i v e r e d . s i z e ( index ) , viewpoint . p o p u l a t i o n S i z e ( ) ) ) c o r r e l a t i o n = ( S t r i n g ) index ; u n d e l i v e r e d . removeAll ( ) ; return c o r r e l a t i o n ; } protected void r e g i s t e r C h o i c e ( S t r i n g c o r r e l a t i o n ) throws Exception { i f ( c o r r e l a t i o n != n u l l ) u n d e l i v e r e d . add ( c o r r e l a t i o n , new Object ( ) ) ; } protected S t r i n g g e t F i r s t U n d e l i v e r e d C o r r e l a t i o n ( ) throws Exception { S t r i n g c o r r e l a t i o n = null ; SoapMessage message = i n . g e t F i r s t M e s s a g e ( n u l l ) ; i f ( message != n u l l ) c o r r e l a t i o n = message . getLambHeader ( ) . g e t C o r r e l a t i o n I d ( ) . t o S t r i n g ( ) ; return c o r r e l a t i o n ; } protected void c h e c k I f I A m D i s t i n g u i s h e d P r o p o s e r ( boolean r e s t a r t ) throws Exception { boolean iAmPrimary = viewpoint . amIPrimary ( ) ; i f ( r e s t a r t && iAmDistinguishedProposer && iAmPrimary ) { stopProposing ( ) ; sendPrepare ( ) ; } e l s e i f ( ! iAmDistinguishedProposer && iAmPrimary ) { sendPrepare ( ) ; } e l s e i f ( iAmDistinguishedProposer && ! iAmPrimary ) { stopProposing ( ) ; } } protected void s t a r t P r o p o s i n g ( ) throws Exception { iAmDistinguishedProposer = true ; i f ( Main . traceMap . ge t ( " FtTrace " ) ) System . out . p r i n t l n ( " I o n i a n . s t a r t P r o p o s i n g ( ) " ) ; proposalInstance = 1; i f ( f u t u r e != n u l l ) f u t u r e . c a n c e l ( true ) ; f u t u r e = e x e c u t o r . scheduleAtFixedRate ( g e t P r o p o s e r ( ) , 0 , 100 , TimeUnit . MILLISECONDS) ; } protected void stopProposing ( ) throws Exception { iAmDistinguishedProposer = f a l s e ; i f ( Main . traceMap . ge t ( " FtTrace " ) ) System . out . p r i n t l n ( " I o n i a n . stopProposing ( ) " ) ; i f ( f u t u r e != n u l l ) { f u t u r e . c a n c e l ( true ) ; f u t u r e = null ; } }

230


270 271 protected Runnable g e t P r o p o s e r ( ) { return new Proposer ( ) ; } 272 273 protected a b s t r a c t S t r i n g getPrepareMessageURI ( ) ; 274 protected a b s t r a c t S t r i n g getNackMessageURI ( ) ; 275 protected a b s t r a c t S t r i n g getPromiseMessageURI ( ) ; 276 protected a b s t r a c t S t r i n g getProposalMessageURI ( ) ; 277 protected a b s t r a c t S t r i n g getAcceptedMessageURI ( ) ; 278 protected a b s t r a c t S t r i n g getLearntMessageURI ( ) ; 279 280 protected L i s t c o S e r v i c e E n d p o i n t s = n u l l ; 281 protected DLog i n = new DLog( " i n " ) ; 282 protected DLog promised = new DLog( " promised " ) ; 283 protected DLog accepted = new DLog( " accepted " ) ; 284 protected DLog l e a r n t = new DLog( " l e a r n t " ) ; 285 protected DLog u n d e l i v e r e d = new DLog( " u n d e l i v e r e d " ) ; 286 protected SoapMessage l a s t M e s s a g e R e c e i v e d = n u l l ; 287 protected i n t lastPromisedView = − 1; 288 protected i n t l a s t P r o p o s e d S e q u e n c e = − 1; 289 protected i n t l a s t A c c e p t e d S e q u e n c e = − 1; 290 protected i n t l a s t L e a r n t S e q u e n c e = − 1; 291 protected i n t currentAcceptingView = − 1; 292 protected i n t p r o p o s a l I n s t a n c e = 1 ; 293 protected boolean iAmDistinguishedProposer = f a l s e ; 294 protected boolean allowOneProposal = f a l s e ; 295 protected Object promiseMutex = new Object ( ) ; 296 protected Object acceptedMutex = new Object ( ) ; 297 protected Object learntMutex = new Object ( ) ; 298 protected Future f u t u r e = n u l l ; 299 protected ScheduledThreadPoolExecutor e x e c u t o r = new ScheduledThreadPoolExecutor ( 1 ) ; 300 301 protected c l a s s Proposer implements Runnable 302 { 303 p u b l i c void run ( ) { t r y { safeRun ( ) ; } catch ( Exception e ) { e . p r i n t S t a c k T r a c e ( ) ; } } 304 protected void safeRun ( ) throws Exception { i f ( allowOneProposal ) sendProposal ( ) ; } 305 } 306 }

Listing A.4: IonianNB Implemented in Java 1 package net . wspbft . f t . p r o t o c o l ; 2 import java . u t i l . HashSet { . . . } ; 3 4 p u b l i c a b s t r a c t c l a s s IonianNB extends I o n i a n 5 { 6 @Override protected S t r i n g getLearntMessageURI ( ) { return n u l l ; } 7 8 @Override p u b l i c void Propose ( SoapMessage message ) throws Exception 9 { 10 LambHeader header = message . getLambHeader ( ) ; 11 i n t sequence = header . g e t A t t r i b u t e A s I n t ( " seq " ) ; 12 i f ( Main . traceMap . ge t ( " FtTrace " ) ) System . out . p r i n t l n ( " I o n i a n . Propose ( "+header . g e t A t t r i b u t e A s I n t ( " view " )+" , "+header . g e t A t t r i b u t e A s I n t ( " seq " )+" ) " ) ; 13 i f ( sequence > l a s t P r o p o s e d S e q u e n c e ) 14 { 15 l a s t P r o p o s e d S e q u e n c e = sequence ; 16 header . s e t A c t i o n ( getAcceptedMessageURI ( ) ) ; 17 sendTo ( message , getCoServiceEndpoints ( header ) ) ; 18 } 19 i n t view = header . g e t A t t r i b u t e A s I n t ( " view " ) ; 20 i f ( view > viewpoint . getView ( ) ) viewpoint . setView ( view ) ; 21 } 22 23 24 @Override p u b l i c void Accepted ( SoapMessage message ) throws Exception 25 { 26 boolean canLearn = f a l s e ; 27 LambHeader header = message . getLambHeader ( ) ; 28 i n t sequence = header . g e t A t t r i b u t e A s I n t ( " seq " ) ; 29 i n t view = header . g e t A t t r i b u t e A s I n t ( " view " ) ; 30 synchronized ( acceptedMutex ) 31 { 32 i f ( view == currentAcceptingView && ! acceptedSequences . c o n t a i n s ( sequence ) ) 33 { 34 accepted . add ( sequence , new Object ( ) ) ; 35 i f ( isQuorum ( accepted . s i z e ( sequence ) , viewpoint . p o p u l a t i o n S i z e ( ) ) ) 36 { 37 acceptedSequences . add ( sequence ) ; 38 canLearn = true ; 39 } 40 } 41 } 42 i f ( canLearn ) { accepted . removeAll ( sequence ) ; d e l i v e r I n O r d e r ( getLocalSequence ( sequence ) , message ) ; } 43 } 44 45 protected void d e l i v e r I n O r d e r ( i n t sequence , SoapMessage message ) throws Exception 46 { 47 LambHeader header = message . getLambHeader ( ) ; 48 i f ( Main . traceMap . ge t ( " FtTrace " ) ) System . out . p r i n t l n ( " I o n i a n . Pending ( "+header . g e t C o r r e l a t i o n I d ( )+" ) " ) ; 49 i f ( sequence > l a s t D e l i v e r e d S e q u e n c e ) 50 { 51 p e n d i n g D e l i v e r y . add ( sequence , message ) ; 52 i f ( sequence == l a s t D e l i v e r e d S e q u e n c e + 1 | | l a s t D e l i v e r e d S e q u e n c e < 1) 53 { 54 while ( p e n d i n g D e l i v e r y . s i z e ( sequence ) > 0) 55 {

231


56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 }

}

}

}

}

d e l i v e r ( p e n d i n g D e l i v e r y . g e t F i r s t M e s s a g e ( sequence ) ) ; l a s t D e l i v e r e d S e q u e n c e = sequence ; sequence = sequence + 1 ;

void d e l i v e r ( SoapMessage message ) throws Exception LambHeader header = message . getLambHeader ( ) ; S t r i n g c o r r e l a t i o n = header . g e t C o r r e l a t i o n I d ( ) . t o S t r i n g ( ) ; clockwork . syncOut ( getDeliveredMessageURI ( ) , c o r r e l a t i o n ) ; i f ( Main . traceMap . ge t ( " FtTrace " ) ) System . out . p r i n t l n ( " IonianNB . D e l i v e r e d ( " + header . g e t A t t r i b u t e ( " seq " ) + " = " + c o r r e l a t i o n +" ) " ) ; i n . removeAll ( c o r r e l a t i o n ) ; header . r e s e t A c t i o n ( ) ; s e n d F u n c t i o n a l L o c a l ( message ) ;

protected

{

} @Override protected void r e s e t ( ) throws Exception { lastDeliveredSequence = 0; acceptedSequences . c l e a r ( ) ; p e n d i n g D e l i v e r y . removeAll ( ) ; super . r e s e t ( ) ; } @Override protected Runnable g e t P r o p o s e r ( ) { return new NBProposer ( ) ; } protected a b s t r a c t S t r i n g getDeliveredMessageURI ( ) ; protected Set acceptedSequences = new HashSet() ; protected i n t l a s t D e l i v e r e d S e q u e n c e = 0 ; protected DLog p e n d i n g D e l i v e r y = new DLog( " PendingDelivery " ) ; protected c l a s s NBProposer implements Runnable { p u b l i c void run ( ) { t r y { safeRun ( ) ; } catch ( Exception e ) { e . p r i n t S t a c k T r a c e ( ) ; } } protected void safeRun ( ) throws Exception { sendProposal ( ) ; } }

Listing A.5: Andros Implemented in Java 1 package net . wspbft . f t . p r o t o c o l ; 2 import java . u t i l . HashSet { . . . } ; 3 4 p u b l i c a b s t r a c t c l a s s Andros extends S e r v i c e implements CrashListener , C l o c k w o r k S y n c F a i l L i s t e n e r 5 { 6 @Override p u b l i c void i n i t (WSDL wsdl , Sandbox sandbox , S t r i n g serviceURI , S t r i n g i n s t a n c e I d ) throws Exception 7 { 8 super . i n i t ( wsdl , sandbox , serviceURI , i n s t a n c e I d ) ; 9 e t e r n i t y . addCrashListener ( t h i s ) ; 10 clockwork . setLog ( new DLog( " out " ) ) ; 11 clockwork . a d d S y n c F a i l L i s t e n e r ( t h i s ) ; 12 } 13 14 @Override p u b l i c void d e s t r o y ( ) throws Exception 15 { 16 e t e r n i t y . removeCrashListener ( t h i s ) ; 17 clockwork . r e m o v e S y n c F a i l L i s t e n e r ( t h i s ) ; 18 in . destroy () ; 19 viewChanges . d e s t r o y ( ) ; 20 preprepared . d e s t r o y ( ) ; 21 prepared . d e s t r o y ( ) ; 22 committed . d e s t r o y ( ) ; 23 pendingDelivery . destroy () ; 24 } 25 26 p u b l i c void incoming ( SoapMessage message ) throws Exception 27 { 28 i f ( a u t h e n t i c a t e M e s s a g e ( message ) ) 29 { 30 LambHeader header = message . getLambHeader ( ) ; 31 p a r t i c i p a n t s = header . g e t C o S e r v i c e P a r t i c i p a n t s ( serviceURI ) ; 32 inspectParticipants ( participants ) ; 33 c o S e r v i c e E n d p o i n t s = header . getCausalEndpoints ( serviceURI ) ; 34 i f ( header . getTimestamp ( ) > lastDeliveredTimestamp ) 35 { 36 S t r i n g c o r r e l a t i o n = header . g e t C o r r e l a t i o n I d ( ) . t o S t r i n g ( ) ; 37 lastReceivedSoapMessage = message ; 38 i n . add ( c o r r e l a t i o n , message ) ; 39 clockwork . syncIn ( header . getMessageURI ( ) , c o r r e l a t i o n ) ; 40 c he c kI fI s ho ul dB e Pr i ma r y ( ) ; 41 } 42 } 43 } 44 45 protected void sendPrePrepare ( SoapMessage message ) throws Exception 46 { 47 i f ( message == n u l l ) return ; 48 LambHeader header = message . getLambHeader ( ) ; 49 header . s e t A t t r i b u t e ( " view " , view ) ; 50 header . s e t A t t r i b u t e ( " seq " , primarySequence++) ; 51 header . s e t A c t i o n ( getPrePrepareMessageURI ( ) ) ; 52 g a t e k e e p e r . g e n e r a t e D i g e s t ( message ) ;

232


53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146

}

signMessage ( message , p a r t i c i p a n t s ) ; i f ( Main . traceMap . ge t ( " FtTrace " ) ) System . out . p r i n t l n ( " Andros . sendPREPREPARE( "+header . g e t C o r r e l a t i o n I d ( )+" ) " ) ; sendTo ( message , c o S e r v i c e E n d p o i n t s ) ; void PrePrepare ( SoapMessage message ) throws Exception GatekeeperHeader gatekeeperHeader = message . getGatekeeperHeader ( ) ; LambHeader header = message . getLambHeader ( ) ; i f ( Viewpoint . isPrimary ( view , p a r t i c i p a n t s , gatekeeperHeader . g e t S i g n e r ( ) ) && a u t h e n t i c a t e M e s s a g e ( message ) ) { i n t receivedView = header . g e t A t t r i b u t e A s I n t ( " view " ) ; i f ( receivedView == view ) { i n t sequence = header . g e t A t t r i b u t e A s I n t ( " seq " ) ; i n t g l o b a l S e q u e n c e = getGlobalSequence ( view , sequence ) ; S t r i n g d i g e s t = gatekeeperHeader . g e t D i g e s t ( ) ; i f ( preprepared . s i z e ( g l o b a l S e q u e n c e ) == 0 | | preprepared . g e t F i r s t ( g l o b a l S e q u e n c e ) . e q u a l s ( digest ) ) { preprepared . add ( globalSequence , d i g e s t ) ; header . s e t A c t i o n ( getPrepareMessageURI ( ) ) ; signMessage ( message , p a r t i c i p a n t s ) ; sendTo ( message , c o S e r v i c e E n d p o i n t s ) ; } } }

public

{

}

void Prepare ( SoapMessage message ) throws Exception ( a u t h e n t i c a t e M e s s a g e ( message ) ) { LambHeader header = message . getLambHeader ( ) ; i n t receivedView = header . g e t A t t r i b u t e A s I n t ( " view " ) ; i f ( receivedView == view ) { boolean allowCommit = f a l s e ; synchronized ( prepareMutex ) { S t r i n g nodeId = message . getGatekeeperHeader ( ) . g e t S i g n e r ( ) ; i n t sequence = header . g e t A t t r i b u t e A s I n t ( " seq " ) ; i n t g l o b a l S e q u e n c e = getGlobalSequence ( view , sequence ) ; i f ( ! l o c a l l y P r e p a r e d S e t . c o n t a i n s ( sequence ) ) { i f ( ! prepared . c o n t a i n s ( globalSequence , nodeId ) ) prepared . add ( globalSequence , nodeId ) ; i f ( isPreparedQuorum ( prepared . s i z e ( g l o b a l S e q u e n c e ) , p a r t i c i p a n t s . s i z e ( ) ) && preprepared . s i z e ( g l o b a l S e q u e n c e ) > 0) { S t r i n g d i g e s t = ( S t r i n g ) preprepared . g e t F i r s t ( g l o b a l S e q u e n c e ) ; i f ( d i g e s t != n u l l && d i g e s t . e q u a l s ( message . getGatekeeperHeader ( ) . g e t D i g e s t ( ) ) ) { l o c a l l y P r e p a r e d S e t . add ( sequence ) ; allowCommit = true ; } } } } i f ( allowCommit ) { header . s e t A c t i o n ( getCommitMessageURI ( ) ) ; signMessage ( message , p a r t i c i p a n t s ) ; sendTo ( message , c o S e r v i c e E n d p o i n t s ) ; } } }

public

{

}

if

void Commit( SoapMessage message ) throws Exception ( a u t h e n t i c a t e M e s s a g e ( message ) ) { LambHeader header = message . getLambHeader ( ) ; i n t receivedView = header . g e t A t t r i b u t e A s I n t ( " view " ) ; i f ( receivedView == view ) { boolean a l l o w D e l i v e r y = f a l s e ; synchronized ( commitMutex ) { S t r i n g nodeId = message . getGatekeeperHeader ( ) . g e t S i g n e r ( ) ; i n t sequence = header . g e t A t t r i b u t e A s I n t ( " seq " ) ; i n t g l o b a l S e q u e n c e = getGlobalSequence ( view , sequence ) ; i f ( ! locallyCommittedSet . c o n t a i n s ( sequence ) ) { i f ( ! committed . c o n t a i n s ( globalSequence , nodeId ) ) committed . add ( globalSequence , nodeId ); i f ( isCommittedQuorum ( committed . s i z e ( g l o b a l S e q u e n c e ) , p a r t i c i p a n t s . s i z e ( ) ) ) { i f ( committed . s i z e ( g l o b a l S e q u e n c e ) == 4) { L i s t peerNames = new Vector() ; f o r ( Object nid : committed . g e t A l l ( g l o b a l S e q u e n c e ) ) peerNames . add ( sandbox . getPeerManager ( ) . lookupPeer ( ( S t r i n g ) nid ) ) ; System . out . p r i n t l n ( "Good Four : " + peerNames ) ; }

public

{

if

233


147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244

} }

}

}

}

if

}

locallyCommittedSet . add ( sequence ) ; a l l o w D e l i v e r y = true ; ( a l l o w D e l i v e r y ) d e l i v e r I n O r d e r ( sequence , message ) ;

void d e l i v e r I n O r d e r ( i n t sequence , SoapMessage message ) throws Exception ( Main . traceMap . ge t ( " FtTrace " ) ) System . out . p r i n t l n ( " Andros . d e l i v e r I n O r d e r ( "+sequence+" , "+message . getLambHeader ( ) . g e t C o r r e l a t i o n I d ( )+" ) " ) ; i f ( sequence > l a s t D e l i v e r e d S e q u e n c e ) { p e n d i n g D e l i v e r y . add ( sequence , message ) ; i f ( sequence == l a s t D e l i v e r e d S e q u e n c e + 1 | | l a s t D e l i v e r e d S e q u e n c e < 1) { while ( p e n d i n g D e l i v e r y . s i z e ( sequence ) > 0) { d e l i v e r ( p e n d i n g D e l i v e r y . g e t F i r s t M e s s a g e ( sequence ) ) ; l a s t D e l i v e r e d S e q u e n c e = sequence ; sequence = sequence + 1 ; } } }

protected

{

}

if

void d e l i v e r ( SoapMessage message ) throws Exception LambHeader header = message . getLambHeader ( ) ; header . r e s e t A c t i o n ( ) ; lastDeliveredTimestamp = header . getTimestamp ( ) ; S t r i n g c o r r e l a t i o n = header . g e t C o r r e l a t i o n I d ( ) . t o S t r i n g ( ) ; clockwork . syncOut ( getDeliveredMessageURI ( ) , c o r r e l a t i o n ) ; i n . removeAll ( c o r r e l a t i o n ) ; i f ( Main . traceMap . ge t ( " FtTrace " ) ) System . out . p r i n t l n ( " Andros . d e l i v e r ( "+c o r r e l a t i o n+" ) " ) ; s e n d F u n c t i o n a l L o c a l ( message ) ;

protected

{

}

void sendViewChange ( ) throws Exception ( lastReceivedSoapMessage == n u l l | | c o S e r v i c e E n d p o i n t s == n u l l ) return ; SoapMessage message = lastReceivedSoapMessage ; LambHeader header = message . getLambHeader ( ) ; header . s e t A c t i o n ( getViewChangeMessageURI ( ) ) ; header . s e t A t t r i b u t e ( "newView" , view +1) ; i f ( Main . traceMap . ge t ( " FtTrace " ) ) System . out . p r i n t l n ( " Andros . sendViewChange ( " + ( view +1) + " ) ") ; g a t e k e e p e r . signMessage ( message ) ; sendTo ( message , c o S e r v i c e E n d p o i n t s ) ;

protected

{

}

if

void ViewChange ( SoapMessage message ) throws Exception ( g a t e k e e p e r . a u t h e n t i c a t e M e s s a g e ( message ) ) { i n t newView = message . getLambHeader ( ) . g e t A t t r i b u t e A s I n t ( "newView" ) ; S t r i n g p e e r I d = message . getGatekeeperHeader ( ) . g e t S i g n e r ( ) ; boolean hasChangedView = f a l s e ; synchronized ( viewChangeMutex ) { i f ( newView > view ) { i f ( ! viewChanges . c o n t a i n s ( newView , p e e r I d ) ) viewChanges . add ( newView , p e e r I d ) ; i f ( isWeakQuorum ( viewChanges . s i z e ( newView ) , p a r t i c i p a n t s . s i z e ( ) ) ) { view = newView ; hasChangedView = true ; } } } i f ( hasChangedView ) { i f ( Main . traceMap . ge t ( " FtTrace " ) ) System . out . p r i n t l n ( " Andros . changeView ( "+view+" ) " ) ; viewChanges . removeAll ( newView ) ; reset () ; c h ec kI fI s ho ul d Be Pr i ma r y ( ) ; } }

public

{

}

if

void r e s e t ( ) throws Exception clockwork . r e s e t ( ) ; preprepared . removeAll ( ) ; prepared . removeAll ( ) ; committed . removeAll ( ) ; p e n d i n g D e l i v e r y . removeAll ( ) ; primarySequence = 1 ; lastDeliveredSequence = 0; locallyPreparedSet . clear () ; locallyCommittedSet . c l e a r ( ) ;

protected

{

}

protected

void

i n s p e c t P a r t i c i p a n t s ( L i s t p a r t i c i p a n t s )

234

throws

Exception


245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 }

{

( l a s t P a r t i c i p a n t C o u n t < 0) lastParticipantCount = participants . size () ; i f ( p a r t i c i p a n t s . s i z e ( ) != l a s t P a r t i c i p a n t C o u n t ) { lastParticipantCount = participants . size () ; sendViewChange ( ) ; } if

else

} @Override p u b l i c void clockworkSyncFail ( S t r i n g inMessageURI , S t r i n g outMessageURI , S t r i n g c o r r e l a t i o n ) throws Exception { sendViewChange ( ) ; } @Override p u b l i c void crashed ( S t r i n g p e e r I d ) throws Exception { i f ( p a r t i c i p a n t s . remove ( p e e r I d ) ) inspectParticipants ( participants ) ; } protected a b s t r a c t void signMessage ( SoapMessage message , L i s t p a r t i c i p a n t s ) throws Exception ; protected a b s t r a c t boolean a u t h e n t i c a t e M e s s a g e ( SoapMessage message ) throws Exception ; protected a b s t r a c t S t r i n g getViewChangeMessageURI ( ) ; protected a b s t r a c t S t r i n g getPrePrepareMessageURI ( ) ; protected a b s t r a c t S t r i n g getPrepareMessageURI ( ) ; protected a b s t r a c t S t r i n g getCommitMessageURI ( ) ; protected a b s t r a c t S t r i n g getDeliveredMessageURI ( ) ; protected void c he c kI fI sh ou l dB eP r im a ry ( ) throws Exception { i f ( p a r t i c i p a n t s == n u l l ) return ; boolean isPrimary = Viewpoint . isPrimary ( view , p a r t i c i p a n t s , l o c a l P e e r I d ) ; i f ( isPrimary && ! iAmPrePreparing ) s t a r t P r e P r e p a r i n g ( ) ; e l s e i f ( iAmPrePreparing && ! isPrimary ) stopPrePreparing ( ) ; } protected void s t a r t P r e P r e p a r i n g ( ) throws Exception { iAmPrePreparing = true ; primarySequence = 1 ; i f ( f u t u r e != n u l l ) f u t u r e . c a n c e l ( true ) ; f u t u r e = e x e c u t o r . scheduleAtFixedRate ( new PrePreparer ( ) , 0 , 200 , TimeUnit . MILLISECONDS) ; } protected void stopPrePreparing ( ) throws Exception { iAmPrePreparing = f a l s e ; i f ( f u t u r e != n u l l ) { f u t u r e . c a n c e l ( true ) ; f u t u r e = n u l l ; } } protected boolean isWeakQuorum ( i n t amount , i n t n ) { i n t f = ( n − 1) / 3 ; return amount > f ; } protected boolean isPreparedQuorum ( i n t amount , i n t n ) { i n t f = ( n − 1) / 3 ; return amount >= 2 ∗ f ; } protected boolean isCommittedQuorum ( i n t amount , i n t n ) { i n t f = ( n − 1) / 3 ; return amount > 2 ∗ f ; } protected DLog i n = new DLog( " i n " ) ; protected DLog preprepared = new DLog( " PrePrepared " ) ; protected DLog prepared = new DLog( " Prepared " ) ; protected DLog committed = new DLog( "Committed" ) ; protected DLog p e n d i n g D e l i v e r y = new DLog( " PendingDelivery " ) ; protected DLog viewChanges = new DLog( "View Changes " ) ; protected Set l o c a l l y P r e p a r e d S e t = new HashSet() ; protected Set locallyCommittedSet = new HashSet() ; protected Object prepareMutex = new Object ( ) ; protected Object commitMutex = new Object ( ) ; protected Object viewChangeMutex = new Object ( ) ; protected i n t view = 1 ; protected i n t primarySequence = 1 ; protected i n t l a s t D e l i v e r e d S e q u e n c e = 0 ; protected long lastDeliveredTimestamp = 0 ; protected i n t lastAcceptedViewChange = 1 ; protected SoapMessage lastReceivedSoapMessage = n u l l ; boolean iAmPrePreparing = f a l s e ; protected Future f u t u r e = n u l l ; protected ScheduledThreadPoolExecutor e x e c u t o r = new ScheduledThreadPoolExecutor ( 1 ) ; protected L i s t p a r t i c i p a n t s = new Vector() ; protected L i s t c o S e r v i c e E n d p o i n t s = n u l l ; protected i n t l a s t P a r t i c i p a n t C o u n t = − 1; protected c l a s s PrePreparer implements Runnable { p u b l i c void run ( ) { t r y { safeRun ( ) ; } catch ( Exception e ) { e . p r i n t S t a c k T r a c e ( ) ; } } protected void safeRun ( ) throws Exception { SoapMessage message = i n . popMessage ( ) ; i f ( message != n u l l ) sendPrePrepare ( message ) ; } }

235

Appendix B

Screenshots In Appendix A we show some screenshots of the full web-based user interface.

Figure B.1: Doping and Metrics Screenshot 236

Chapter B: Screenshots

Figure B.2: LAMB Service Discovery Screenshot

237

Appendix C

Additional Scenarios This appendix is an overow from the evaluation in Chapter 6.

C.1

Runtime Reconfiguration Test Case

Figure C.1: Runtime Reconguration Results. 238

Group Membership Test Case

Chapter C: Additional Scenarios

Figure C.1 shows a scenario based on normal operation where new FT services were deployed at set intervals to demonstrate that services with better matching QoS metrics (in this case reliability metrics) would be chosen in preference by LAMB. It clearly demonstrated that the framework is able autonomously discover services and dierentiate those services using metrics.

C.2


This test case consisted of a number of scenarios that demonstrated that the framework was capable of FT protocol adaption by allowing new instances to join, leave and fail-stop. Collectively, the group membership scenarios proved that our framework was capable of the group membership abstraction.

C.2.1

Sequential Group Membership

The goal of this scenario was to prove that congurations allowed new instances of a FT service join the protocol at runtime. This scenario in conjunction with fail-stop (Section C.2.3) demonstrated that our platform was able to perform group membership. The scenario was established by starting n = 7 nodes and shutting n − 1 of them down. For the duration of the scenario requests were injected at a constant rate. Nodes were started sequentially at xed intervals until there were n running. This scenario was not tested against the No-FT conguration. Figure C.2 demonstrated that all the FT services could maintain operation whilst new instances were joining the FT protocol one by one. The large spikes in latency when transitioning between 1 and 2 instances demonstrated the problem of reaching agreement with two instances.

C.2.2

Concurrent Group Membership

This scenario was used to demonstrate that more than several FT service instances could join a protocol concurrently at runtime. It was started with 1 node and after a short period another n − 1 were also started concurrently. This scenario was run against all 239



Figure C.2: Sequential Group Membership. congurations except No-FT when n = 7, in conjunction with soak injection. Figure C.3 showed that all the FT services transitioned between 1 and n instances with out message loss, therefore were able to perform concurrent group membership.

C.2.3

Sequential Fail-Stop

The goal of this scenario was to show that all congurations (except No-FT) could tolerate sequential crashes provided there was one FT service instance still running. This scenario in conjunction with the sequential group membership (Section C.2.1) proved that the WSPBFT could provide group membership. The crashes were in the fail-stop distributed system model because their detection was guaranteed by the Eternity failure detector [Hall09]. This scenario was established by starting n nodes (n = 7) and then crashing

240



Figure C.3: Concurrent Group Membership Results. one periodically until all n had crashed. For the duration of the scenario we soak injected requests. Figure C.4 showed that all the protocols were able to tolerate sequential fail-stops until reaching n = 2. Again this demonstrated the dicult of reaching agreement with two instances. In addition to Section C.2.1, this proved our framework could perform group membership.

C.2.4

Concurrent Fail-Stop

The goal of this scenario was to demonstrate that the congurations could tolerate n − 1 but not n simultaneous fail-stops. The scenario consisted of the two boundary cases where

n − 1 then n − 1 instances were crashed concurrently. In both cases the scenario started

241



with n = 7 instances executed in conjunction with soak injection. Figure C.5 showed that all the frameworks tolerated n − 1 concurrent fail-stops. However, their was a large period of time for the protocols to recover because Eternity took a long time to detect all the failures. Figure C.6 shows that no protocol could tolerate the fail-stop of n instances.

C.2.5

Primary Fail-Stop

The goal of this scenario was to demonstrate that leader-oriented protocols, Patmos, Ionian, IonianNB and Andros could tolerate the fail-stop of a primary FT service instance. This was the only scenario not to be automated by the doping mechanism. Instead human intervention was required to monitor console trace to identify a primary and then crash it using the cloud mechanism. Each of the congurations, as n = 5, was executed in conjunction with a soak injection of requests. Figure C.7 showed that all the protocols could tolerate failure of primaries until n = 2 where again IonianNB and Andros struggled to reach agreement and hence failed.

242



243 Figure C.4: Sequential Fail-Stop Results.



244 Figure C.5: Concurrent (n − 1) Fail-Stop Results.



Figure C.6: Concurrent (n) Fail-Stop Results.

245



Figure C.7: Primary Fail-Stop Results.

246