Novel Generic Middleware Building Blocks for ...

Novel Generic Middleware Building Blocks for Dependable Modular Avionics Systems Marc Le Roy(1), Pascal Gula(2), Jean Charles Fabre (3), Gérard Le Lann(4), Eric Bornschlegl(5) (1) ASTRIUM SAS 31 avenue des Cosmonautes 31402 Toulouse Cedex 4 - France +33 (0) 5 62 19 78 76 [email protected]

(2) AXLOG Ingéniérie 19-21 rue du 8 mai 1945 94110 Arcueil – France +33 (0) 1 41 24 31 38 [email protected]

(3) LAAS-CNRS 7, avenue du Colonel Roche 31077 Toulouse – France +33 (0) 5 61 33 62 36 [email protected]

(4) INRIA Domaine de Voluceau - B.P. 105 78153 Le Chesnay – France +33 (0) 1 39 63 53 64 [email protected]

(5) European Space Agency, ESTEC Keplerlaan 1 - P.O Box 299 2200 AG Noordwijk – The Netherlands +31 (0) 71 5653487 [email protected]

Abstract The complexity of the satellites and launchers data system tends to increase rapidly, as well as their performance requirements. The efficiency of their development will depend on the availability and integration of standard software products (e.g. COTS) solving recurrent problems in space vehicle avionics systems. To address these problems, our objective in this project was to define a middleware support for a wide range of space systems and applications. We primarily focused on the core functionalities that this middleware should provide, namely basic building blocks that are highly required for both dependability and real-time requirements. During the first phase of the “Advanced Avionics Architecture and Modules” (A3M) activity (ESA TRP activity), solutions based on innovative algorithms from INRIA have been identified. The objective of the on-going second phase, to end in summer 2003, is to develop a basic middleware solving two application-centric problems and to evaluate the performance of the algorithms, on a standard space-borne computer platform. The two selected problems are “distributed consistent processing under active redundancy” and “distributed replicated data consistency, program serialization and program atomicity”. The A3M basic middleware is decomposed into three main layers: The SpaceWire communication management layer, Uniform Consensus (UCS) and Uniform Coordination (UCN) protocols layer and the application level services layer. The SpaceWire communication management software, developed by Astrium, provides specific functionalities required by the UCS & UCN protocols. It also ensures that their implementation does not depend on a specific communication medium. The implementation of the UCS/UCN protocols layer, developed by Axlog as a library, has been verified through extensive and accurate testing under the Real Time OS simulator RTSim prior to the integration on the target platform. The main originality of this development was to fully develop (in C) the building block and simulate their behaviour, including in the presence of faults. The highest layer of the middleware uses these protocols to provide the services solving the two selected problems to the application software. In particular, it relies on the customisation of the UCS and UCN protocols modules with application specific plug-in functions. Such type of solution may be applied to very different fields, from high performance distributed computing to satellite formation flying coordination.

1

The final prototype implementing this new generation middleware platform is based on three LEON SPARC microprocessors interconnected with point-to-point SpaceWire links, reuse a system on a chip design developed in the frame of the Spacecraft Controller on a Chip (SCOC) activity for ESA.

1. Introduction and motivations The complexity of the satellites and launchers data management systems tends to increase rapidly, as well as their performance requirements. The efficiency of their development will depend on the availability and integration of standard software products (e.g. COTS) solving recurrent problems in space vehicle avionics systems. To address these problems, our objective in this project was to define a middleware support for a wide range of space systems and applications. Indeed, the evolution of space platforms is today faced to the inherent distribution of resources and must deal with strong real-time and dependability requirements. The handling of both real- time constraints in the implementation of fault tolerance strategies raises new problems and call for new approaches which is very challenging. The A3M (Advanced Avionics Architecture and Modules) project was launched to address these issues. The objective is to develop a new generation platform for a wide range of infrastructure, providing generic components as basic building blocks for the development of ad-hoc middleware targeting various on-board space applications. The core components of this middleware enable real- time fault tolerant application to be developed. The project was organized in two phases: Phase 1 was devoted to the overall analysis of the problem, whereas Phase 2 focused on initial implementation plans for the basic middleware layer of the architecture. This paper describes the main achievements carried out in Phase 2. The major components of the architecture rely on distributed consensus and coordination protocols developed at INRIA [REF-IEEE ToC]. These are based on an asynchronous model of computation, thus making no assumption about timing issues at design time. In other words, the safety and liveness properties are guaranteed to hold true, always, irrespective of the availability of a global time in the system. The superiority of asynchronous solutions for safety-critical systems – which substantiates the choice made by ESA/ESTEC and Astrium – is examined in [LL 2003] and explored in great details in [LLS 2003]. These facilities are key features from a dependability viewpoint, as most of distributed fault tolerance strategies rely, in one way or anothe r, on this kind of protocols. They role is essential as far as replicated processing is concerned. For instance, they ensure that all replicas with receive the same inputs in the same order. Assuming that replicas are deterministic, the processing of the same inputs lead to the same results in the absence of faults. In addition, they are used to ensure that distributed scheduling decisions are consistent in a distributed system as a whole. For instance, in systems where tasks cannot be rolled back (our case), tasks must be granted access to shared resources (updatable data in particular) in some total ordering that must be unique system-wide. In other words, concurrent accesses to shared resources can be serialized by resorting to scheduling algorithms based on these protocols. This is a significant benefit, as avoiding uncontrolled multiple access enables real-time deadlines to be met despite failures. The crucial assumption regarding failure modes is that nodes in the system fail by crashing, this being a conventional assumption in distributed dependable systems. Although the proposed solution may tackle more subtle faults, this means that this type of fault is only considered in the proposed developments. As a side effect, we assume a very high coverage of this assumption, the means to achieve this being out of the scope of this paper. The use of Middleware layers is clearly the most convenient approach to tackle various application problems. Such layer provides appropriate generic services for the development of fault tolerant applications, independently from the underlying runtime support, namely COTS operating system kernel. The core components early mentioned constitute the basic layer of this middleware architecture, on top of which standard personalities (e.g. microCORBA or Java) could be developed. The focus in A3M Phase 2 is the development and the validation (incl. real time characterisation) of this sub- layer,. The development of additional services and standard personalities are long term objective s of the project that will be further discussed at the end of this paper.

2

The paper is organized as follows. Section 2 sketches the A3M architectural framework defined in Phase 1 that is the basis for the development carried out in Phase 2. In Section 3, we describe in some details the basic principles of the core components, namely UCS and UCN protocols and their variants developed in the project. In Section 4 we describe two key problems in space applications and their corresponding solutions based on the protocols described in Section 3. Section 5 focuses on the development strategy that was used and justifies its interest from a validation viewpoint. Section 6 addresses the intermediate and final platforms used for the implementation. Section 7 gives a brief overview of the current status of A3M Phase 2 and Section 8 draws early conclusions and the perspectives of this work.

2. Architectural framework The architectural framework of the middleware comprises a basic sub- layer implementing core components for the development of fault tolerant and real-time applications. This sub- layer was developed in A3M phase 2 on top of an off-the-shelf Real- Time Operating System kernel (RTOS) and a protocol stack specific to space platforms (COM), but no particular assumption is made on the selected candidates. The middleware itself has two facets. On top of the basic layer (facet 1) providing key algorithms, some additional services and personalities could be developed (facet 2) for targeting new problems and new application context. The work done until now was to develop the basic layer, i.e. the first facet of the A3M middleware. The overall architecture of the middleware is depicted in Figure 2- A. The basic A3M middleware layer is built from two major protocols, Uniform Consensus (UCS) and Uniform Coordination (UCN), both relying on a Failure Detection Module (FDM). This layer is also populated with some specific services to tackle space related problems, for instance a Monitoring Process related to the distributed scheduling of concurrent processes and a Guardian Process related to the consistent updates of shared objects. This will be further discussed in Section 4.

A3M Middleware

Additional services and personalities UCN

UCS FDM (COTS) RTOS

Basic layer (core components)

COM

Figure 2/A: software architecture overview.

In order to allow the reuse of the basic components to solve different problems, several instances of each algorithm can execute in parallel. Each instance is customized with a “filter function” adapted to the problem.

3. Basic components and protocols 3.1. Uniform Consensus (UCS): basic principles UCS is the name of the protocol that solves the Uniform Consensus (UC) problem, in the presence of processor crashes and arbitrarily variable delays. UCS comprises two algorithms, one which consists of a single round of message broadcasting, made accessible to application programs via a UCS

3

primitive, another which is called MiniSeq, that runs at a low level in the system (above the physical link protocol level). Both algorithms run in parallel.

Basically, UCS works as follows. Upon the occurrence of some event, every processor runs UCS by (1) broadcasting a message containing its Proposal to every other processor, (2) invoking MiniSeq. A Proposal may be anything (the name of a processor to be shut down, a list of pending requests that must be scheduled throughout the system, the name of a processor chosen as the “leader”, etc.). Upon termination, UCS delivers a Decision. A Decision must be (1) unique, (2) some initial Proposal. Uniformity requires that even those processors that are about to crash cannot Decide differently (than correct processors) before crashing. Note that, in practice, Consensus without the Uniformity property is useless. The MiniSeq election algorithm, which is sequential, is run by processors according to their relative orderings (based upon their names, every processor having a known predecessor and a known successor). It is via MiniSeq that the Decision is made. MiniSeq uses an essential construct, which is a Failure Detector (FD). Every processor is equipped with an FD. An FD periodically broadcasts “I am alive” messages. When processor p has not been heard by processor q “in due time” (analytical formulae are used for setting timers), q puts p on its list of “suspects”. An interesting feature (F) of FDs is that it suffices to have just 1 correct processor never suspected by any other processor for solving the UC problem (even though lists of suspects are inconsistent). In turn, every processor waits until some condition fires, and then, it broadcasts an FM- message, that contains the name of the processor proposed as the “winner”. That name is the name heard last from its predecessors. Hence, the condition is that every predecessor has been heard of (or is suspected). Thanks to (F), the “winner” is unique system-wide. It is the Proposal sent by the “winner” that is the Decision. There is no a priori ordering of invocations of the UCS primitive (they may be truly concurrent). This above is referred to as the regular UCS invocation mode. Another invocation mode of UCS, called the Rooted UCS invocation mode is available. R-UCS serves to make a particular kind of unique Decision, namely whether to abort or to commit a set of updates computed for data variables replicated across a number of processors. The major difference with the regular mode is that instead of having processor number 1 in charge of starting MiniSeq, it is the processor that actually runs the updating task – called the root processor – that is in charge of starting MiniSeq. This serves to preserve data consistency despite processor failures.

3.2. Uniform Coordination (UCN): basic principles UCS delivers some Decision, which can be any Proposal submitted initially. As explained further, there are applications such that some Proposal is “better” than some others. This raises a problem called UCN (Uniform Coordination). Basically, UCN is UCS, preceded with a round where every processor broadcasts a message containing a Contribution (to some calculus). At the end of that round, having collected all Contributions sent, some of them missing (crashed processors), every processor runs some computation (called “filters” see below). The result of that computation is a Proposal. UCS is then run. This is needed for the reason that “views” of received Contributions may be different for any two processors. Upon completion, one unique result is chosen as the Decision. UCN may serve many purposes. One of the filters used in the A3M project is “majority voting”. Another is “aggregation of names, with some Boolean “and”/”or” functions”. The latter serves to do distributed scheduling.

4

4. Case studies A major objective of the A3M study was to assess the interest of UCS/UCN algorithms for space applications. For this purpose, two common problem has been selected, and a middleware has been developed to solve them: -

Problem 1: distributed consistent processing under active redundancy,

-

Problem 2: distributed replicated data consistency, program serialization, program atomicity.

After a presentation of the notations and hypothesis common to both problems, this section describes each problem and its algorithmic solution based on the UCS and UCN algorithms. The notations and hypothesis common to both problems are the following: -

(F1) Application level programs are denoted Pk. An instance of every such program is mapped and run, fully, on exactly 1 processor. Programs are assumed to be deterministic.

-

(F2) Application level programs, or instances of such programs, can be invoked, activated and run at any time. It means that triggering events can be aperiodic, sporadic or periodic, that programs can be suspended and resumed later, and speed of execution of a program need not be know in advance (asynchronous computational model)

-

(F3) There should be no restriction regarding the programming models. In other words, there is no restriction regarding which variable can be accessed by any given program, nor is there any restriction regarding which variables happen to be shared by which program,

-

(F4) Neither application level programs nor the system’s environment can tolerate the rolling back of a program execution. When a program has begun its execution, it should terminate – in the absence of failure – without replaying some portions of his code.

-

(F5) A processor is unambiguously identified by a name, unique for any given processor; names are positive integers; the total ordering of integers induces a total ordering on the set of processors,

-

(F6) The system includes n processors. Up to fm processors can fail by stopping (crash, no sideeffect) while executing some A3M middleware algorithm (UCS, UCN). This number respect the condition : 0