A Fault-Tolerant Software Architecture for Component ... - CiteSeerX

3 downloads 8528 Views 1MB Size Report
complicate the incorporation of fault tolerance mechanisms into C2 software architectures. ..... component when it terminates an unsuccessful recovery, after an external exception is .... values of an exception notification, which are represented as a set of data objects [10]. ..... Architectural mismatch: Why reuse is so hard.
A Fault-Tolerant Software Architecture for Component-Based Systems Paulo Asterio de C. Guerra^ Cecilia Mary F. Rubira^ and Rogerio de Lemos^

^ Institute de Computafao Universidade Estadual de Campinas, Brazil {asterio,cmrubLra) @ic.unicamp.br ^ Computing Laboratory University of Kent at Canterbury, UK r.delemos@ul4;c.ac.uk

Abstract. Component-based software built from reusable software components is being used in a wide range of applications that have high dependability requirements. In order to achieve the required levels of dependability, it is necessary to incorporate into these complex systems means for coping with software faults. However, the problem is exacerbated if we consider the current trend of integrating off-the-shelf software components, from independent sources, which allow neither code inspection nor changes. To leverage the dependability properties of these systems, we need solutions at the architectural level that are able to guide the structuring of unrehable components into a faulttolerant architecture. In this paper, we present an approach for structuring faulttolerant component-based systems based on the C2 architectural style.

1

Introduction

Modern computer systems are based on the integration of numerous existing software components that are developed by independent sources [6]. The source code and internal design of these components are generally not available to the organization using the component. Moreover, a component's configuration, environment, and dependencies can be changed at integration or deployment time. Thus, many of the challenges of using component-based software in critical applications cannot be effectively addressed via traditional software assurance technologies [29]. This means that new and improved approaches have to be sought in order to obtain trustworthy systems out of unreliable components that are not built under the control of the system developers. In this scenario, fault tolerance, which is associated with the ability of a system to deliver services according with its specification in spite the presence of faults [13], is of paramount importance as a means to dependability. Our aim is to structure, at the architecture level, fault-tolerant component-based systems that use off-the-shelf components. For that, we define an idealised architectural component with structure and behaviour equivalent to the idealised faulttolerant component concept [1]. This concept provides a means of system structuring R. de Lemos et al. (Eds.): Architecting Dependable Systems, LNCS 2677, pp. 129-149, 2003. © Springer-Verlag Berlin Heidelberg 2003

130

Paulo Asterio de C. Guerra et al.

which makes it easy to identify what parts of a system have what responsibilities for trying to cope with which sorts of fault. A system is viewed as a set of components interacting under the control of a design. Components receive requests for service and produce responses. When a component cannot satisfy a request for service, it will return an exception. An idealised fault-tolerant component should in general provide both normal and abnormal (i.e. exception) responses in the interface between interacting components, in a framework that minimizes the impact of these provisions on system complexity [19]. Moreover, this idealised architectural component can be used as a building block for a system of design patterns that implement the idealised fault-tolerant component for concurrent distributed systems [5]. For representing software systems at the architectural level, we have chosen the C2 architectural style for its ability to incorporate heterogeneous off-the-shelf components [17]. However, this ability of combining existing components is achieved through rules on topology and communication between the components that complicate the incorporation of fault tolerance mechanisms into C2 software architectures. For example, basic communication mode between C2 components is based on asynchronous messages broadcasted by connectors, which causes difficulties for both error detection and fault containment[7][Il]. Research into describing software architectures with respect to their dependability properties has gained attention recently [20] [25] [26]. Nonetheless, rigorous specification of exception handling models and of exception propagation at the architecture level remains an open issue [12]. The work on exception handling has focused on configuration exceptions, which are exceptional events that have to be handled at the configuration level of software architectures [12]. In terms of software fault tolerance, the traditional principles used for obtaining software diversity have also been employed in the reliable evolution of software systems, specifically, the upgrading of software components. The Hercules framework [8] employs concepts associated with recovery blocks [19]. The notion of multi-versioning connectors (MVC) [18], in the context of C2 architectures, is derived from concepts associated with N-version programming [3]. The architectural solution presented in this paper is distinct from the works referred above since its focus is on more fundamental structuring concepts, to be applied in a broader class of exceptional conditions, not only configuration exceptions, and for structuring specialized fault tolerance mechanisms, including those for reliable upgrade. The rest of this paper is structured as follows. Section 2 gives a brief overview of fault tolerance and the C2 architectural style. Section 3 describes the proposed architectural solution of the idealised component, which is then formalised in Section 4. An illustrative case study is presented in Section 5 to demonstrate the feasibility of the proposed approach. Final conclusions are given in Section 6.

A Fault-Tolerant Software Architecture for Component-Based Systems

2

Background

2.1

Fault Tolerance

131

The causal relationship between the dependability impairments, that is, faults, errors and failures, is essential to characterise the major activities associated with fault tolerance [13]. A. fault is the adjudged or hypothesized cause of an error. An error is the part of the system state that is liable to lead to a subsequent failure. A failure occurs when a system service deviates from the behaviour expected by the user. The basic strategy to achieve fault tolerance in a system can be divided into two steps [14]. The first step, called error processing, is concerned with the system internal state, aiming to detect errors that are caused by activation of faults, diagnose the erroneous states, and recover to error free states. The second step, called fault treatment, is concerned with the sources of faults that may affect the system, including fault localization and fault removal. The idealised fault-tolerant component is a structuring concept for the coherent provision of fault tolerance in a system. Through this concept, we can allocate fault tolerance responsibilities to the various parts of a system in an orderly fashion, and model the system recursively. Each component can itself be considered as a system on its own, which has an internal design containing further sub-components [1]. Service Request

Interface /N Exception

Normal Response

1v.'allure Exception

Return to Normal

•^ Normal Behavior

Abnormal Behavior

"TF' M^ Service Request

A^ A^ Internal Exception

Normal Response

Interface Exception

Failure Exception

Fig. 1. Idealised fault-tolerant component

The communication between idealised fault-tolerant components is only through request/response messages (Figure 1). Upon receiving a service request, an idealised component will react with a normal response if the request is successfully processed or an abnormal response, otherwise. This abnormal response may be due to an invalid service request, in which case it is called an interface exception, or due to a failure in processing a valid request, in which case it is called a failure exception. The internal structure of an idealised component has two distinct parts: one that implements its

132

Paulo Asterio de C. Guerra et al.

normal behaviour, when no exceptions occur, and another that implements its a.bnormal behaviour, which deals with the exceptional conditions. This separation of concerns, applied recursively to components, subsystems and the overall system, gready simplifies the structuring of fault tolerance systems, allowing their complexity to be manageable. An internal exception is associated with an error detected within the normal behaviour part of a component, switching the control flow of the idealised component to its abnormal behaviour part. This error may be recovered, allowing the operation to be completed successfully {return to normal) or, alternatively, an external exception is propagated to the caller component. An idealised component should provide appropriate handlers for all kinds of exceptions it may raise internally or may receive from a server component. To draw a more complete view of an idealised fault-tolerant component we should consider also a few additional messages, not explicitly represented in Fig. 1. During error recovery, the abnormal behaviour part may also send service requests to other components and receive the corresponding responses, which may be either normal responses or new external exceptions. Moreover, both normal behaviour and abnormal behaviour parts must have access to the component internal state, either by means of shared memory or additional internal request / response messages, to allow error recovery by the abnormal behaviour part. The present work is based on the following assumptions: (i) the synchronicity between requests and responses between components; and (ii) a single thread of control for each component, not allowing concurrent requests to be processed by the same component. 2.2

The C2 Architectural Style

A software architecture is an abstract representation of a software system described as a set of connected components in a specific configuration, which ignores implementation details [22]. The basic elements of software architecture are components and connectors. By architectural style we mean a set of design rules that identify the kinds of components and connectors that may be used to compose a system or subsystem, together with local or global constraints on the way the composition is done [23] .In particular, the C2 architectural style is a component-based style directed at supporting large grain reuse and flexible system composition, emphasizing weak bindings between components [27]. In this style, components of a system may be completely unaware of each other. This may be the case when one integrates various commercial off-the-shelf components (COTS), possibly with heterogeneous styles and implementation languages. The components communicate only through asynchronous messages mediated by connectors that are responsible for message routing, broadcasting and filtering. Interface and architectural mismatches are dealt with by using wrappers for encapsulating each component [9]. Both components and connectors in the C2 architectural style (Figure 2) have a top interface and a bottom interface. Systems are composed in a layered style. The top interface of a component may be connected to the bottom interface of a single connector. The bottom interface of a component may be connected to the top interface of another single connector. Each side of a connector may be connected to

A Fault-Tolerant Software Architecture for Component-Based Systems

133

any number of components or connectors. When two connectors are attached to each other, it must be from the bottom of one to the top of the other. [27]

A

Component

Component

Notifications

_lQjliJiterface Connector bottom interface JiaLinterface Requests

Component

V

bottom interface

Fig. 2. C2 style basic elements There are two types of messages in C2: requests and notifications. By convention, requests flow up through the system's layers and notifications flow down. This C2 convention is in reverse of the usual sense of "upper" and "lower" components in a client/server style, as is the case of the idealised fault-tolerant component shown in Fig. 1. In response to a request, a component may emit a notification back to the components below, through its bottom interface. Upon receiving a notification, a component may react, as if a service was requested, with the implicit invocation of one of its operations.

2.3

Software Architecture and Fault Tolerance

Dependable architectures demonstrably possess properties such as reliability, availability and security, which are achieved through fault tolerance mechanisms [26]. A fault-tolerant software architecture allows us to reason about whether a specified set of dependability properties is being satisfied, under the assumptions of a fault model. Examples of fault-tolerant software architectures can be found in [21][25][26]. The level of abstraction in which a fault-tolerant software architecture is described plays an important role in its applicability. Lower levels of abstractions simplify the instantiation of concrete configurations with the desired dependability properties but restrict the range of software systems where they may be applied. The Simplex architecture [21] and the Distributed Component Framework [30] are examples of such lower level architectures. Higher levels of abstractions, as in the CAATS approach [25], may be applied to a wider range of software systems, at the cost of greater design and implementation effort and, consequently, causing difficulties in obtaining assurances of its conformance to the fault-tolerant abstract architecture. Architecture description hierarchies [26] provides a strategy for refining a faulttolerant architecture, to a more concrete level and down to implementation, using only verified refinement patterns which guarantee that those fault-tolerant properties are preserved.

134

Paulo Asterio de C. Guerra et al.

Existing fault-tolerant software architectures also differ with respect to their underlying fault model. The simplest and most common fault model is restricted to fail-stop failures of nodes in a network, which are tolerated by means of groups of replicated software components residing in different nodes, as in [30]. Software design faults are far more complex to tolerate, requiring added redundancy through specialized exception handlers, as in [21] and [25]. In this work we are concerned mainly with design faults that may exist at the components implementation level.

3

The Proposed Architecture

The objective of this section is to define an idealised C2 component (iC2C), which should be equivalent, in terms of behaviour and structure, to the idealised faulttolerant component (iFTC) [ 1]. 3.1

Overall Structure of the Idealised C2 Component

The implementation of an iC2C (section 2.1) should be able to use any C2 component with minimum restrictions. It is also desirable to integrate idealised C2 components into any C2 configurations to allow the interaction of iC2Cs with other idealised and/or regular C2 components. C2 Message

C2 Request

C2 Notification

I

Service Request

o

Return To Normal

Nonnal Response

Reset

Interface Exception

Failure Exception

Abort

I

Internal Exception

Fig. 3 . Message types defiBed for the iC2C

The first task is to extend the C2 message type hierarchy to represent the different message types defined by the idealised fault-tolerant component concept (iFTC). Service requests and normal responses of an iFTC are directly mapped as requests and notifications in the C2 architecture, respectively. As interface and failure exceptions of an iFTC flow in the same direction as a normal response, they are considered subtypes of notifications in the C2 architecture (Figure 3).

A Fault-Tolerant Software Architecture for Component-Based Systems

135

In order to minimize the impact of fault tolerance provisions on the system complexity, we model the normal behaviour and abnormal behaviour parts of the iFTC as separated sub-components of the iC2C. This outcome leads to an overall structure for the iC2C that has two components and three connectors, as shown in Fig. 4. Interface Failure Exceptions Exceptions

Service Normal Request Response

/\

iC2C_top Aljort

1:

JL

A

NoAnal Resp' inse

NormalActivity Intern il " ? Excep ioi

~^

Y

±.

"TV"

/\

V t

iC2C internal J\ A Retu TiTo Rqset Se vice Re( uest N/ ^J/'No:mal

t

v]/ \l/

AbnormalActivity

±

±

1 i i

iC2C_bottom

Service Normal Interface Request Response Exception

1 i

Failwe Exception

Fig. 4. Idealised C2 component (iC2C) The NormalActivity component implements the normal behaviour, processing service requests from client components and sending back normal responses. It is also responsible for error detection during this processing, and the signalling of interface exceptions and internal exceptions. The Abnormal Activity component is responsible for error recovery, and the signalling of failure exceptions. For consistency, internal exceptions are mapped as a subtype of notification, and, the ReturnToNormal, flowing in the opposite direction, are mapped as a request (Figure 3). During error recovery, the AbnormalActivity component may also send service requests and receive the corresponding notifications to / from the NormalActivity component or any server components above the iC2C in the architecture. Two additional message types that may be sent to the NormalActivity component complete the full set of messages within the iC2C: Abort notifications and Reset requests. An Abort notification terminates the NormalActivity component after an external exception is received and before the internal control flow of the iC2C is switched to the AbnormalActivity component. A Reset request switches the control

136

Paulo Asterio de C. Guerra et al.

flow back to the NormalActivity component, and is sent by the AbnormalActivity component when it terminates an unsuccessful recovery, after an external exception is sent to the client component. The connectors of our iC2C shown in Fig. 4 are specialized, reusable, C2 connectors with the following roles: (i) The iC2C_bottom cormector connects the iC2C with the lower components of a C2 configuration, and serializes the Service Requests received. Once a request is accepted, this connector queues new requests that are received until completion of the first request. When a request is completed, a notification is sent back, which may be a Normal Response, an Interface Exception or a Failure Exception. (ii) The iC2C_internal connector controls message flow inside the iC2C, selecting the destination of each message received based on its originator, the message type and the operational state of the iC2C (normal or abnormal state). (iii) The iC2C_top connector connects the iC2C with other components above it in the C2 configuration. This connector synchronizes each Service Request sent by the iC2C to a server component with its corresponding notification, which may be a

Nornnal Response, an Interface Exception or a Failure Exception. The overall structure defined for the idealised C2 component makes it fully compliant with the component's rules of the C2 architectural style. This allows an iC2C to be integrated into any C2 configuration and interact with components of a larger system. When this interaction establishes a chain of iC2C components the external exceptions raised by a component can be handled by a lower level component (in the C2 sense of "upper" and "lower") allowing hierarchical structuring of error recovery activities. An iC2C may also interact with a regular C2 component, either requesting or providing services. 3.1.1 Structuring the NormalActivity Component The NormalActivity component can be built from scratch or reusing an existing component. In this section, we describe in more detail how the NormalActivity component can be implemented from existing C2 components. As previously mentioned, the NormalActivity component is responsible for the implementation of the normal behaviour of the idealised C2 component. It is also responsible for detecting errors that may affect the normal behaviour. In the case of a NormalActivity component to be built from existing C2 components that do not have error detection capabilities, there is a need to add error detection capabilities to them. A possible architectural solution for structuring the NormalActivity component is shown in Fig. 5. The BasicNormal component is an existing C2 component implementing the services provided by the iC2C to its client components. The CollaboratingComponent is another C2 component implementing additional error detection functions that may be required by the iC2C. A pair of special-purpose connectors (normaUop and normaLbottom) wraps these two components, following the pattern of the multi-versioning connector (MVC) [18]. The normaLbottom coordinates the collaboration between the BasicNormal and CollaboratingComponent components, providing the NormalActivity component with capabilities for error detection. Errors are detected by checking the pre- and postconditions, and invariants associated to each service provided by the BasicNormal

A Fault-Tolerant Software Architecture for Component-Based Systems component [25]. The proposed approach was inspired by the concepts of

137

coordination

contracts [2] and co-operative connectors [16]. The Normal Activity component can also interact with other components outside the scope of the iC2C. In this case, the CollaboratingComponent should be placed higher in the C2 configuration, and the normai_top connector should act as a proxy of the component in the context of the NormalActivity component.

Fig. 5. A composite NormalActivity component

3.1.2 Structuring the AbnormalActivity Component The AbnormalActivity component is responsible for both error diagnosis and error recovery. Depending on the complexity of these tasks, it may be convenient to decompose it into more specialized components for error diagnosis and error handhng. \Cl(l_toh

2

Normal Vctivit\ i( 2(

inuiii.il

AbnorinaKctivit}

Frror l)l.^L•nll^l^

ZIZ I

JL. mil

[

Handler (1)

IE

IIIM

Handki

(n)

:x iC2C_bottonr

r Fig. 6. Decomposition of the AbnormalActivity component

138

Paulo Asterio de C. Guerra et al.

Figure 6 presents a possible architectural solution for structuring the AbnormalActivity component. In this solution, the responsibility for coordinating the exception handling is allocated to the ErrorDiagnosis component that may propagate a new exceptional notification to a specialized ErrorHandler. 3.2

Integration of the iC2C into C2 Configurations

In this section, we describe how an iC2C can interact with other C2 components in an architecture. We assume that exception notifications share a special type name, which distinguishes them from any possible normal notification. Apart from its unique type name, an exception notification follows the same rules of a normal notification, that is, "a statement of what interface routine was invoked and what its parameters and return values where". The exception type and its description are part of the return values of an exception notification, which are represented as a set of data objects [10]. A regular C2 component (rC2C) is a C2 component that: (i) does not send exception notifications, and (ii) does not recognize an exception notification received as being the signalling of an exception. An exception notification's type name and its return values are meaningless to a rC2C, that will recognize an exception as a notification sent in response to a given interface routine invoked. These assumptions are valid for C2 component built without knowledge of the iC2C concept. We will not consider special cases, such as a C2 component capable to send a forged exception notification or built to recognize exception notifications but not behaving as an iC2C. Next we analyse the various kinds of interaction between an iC2C and other iC2Cs or rC2Cs: Case 1. An iC2C requesting services to another iC2C. A request can either result in a normal notification or an external (interface or failure) exception notification. In the case of an exception notification the client should handle it in a predicable way. Case 2. An iC2C requesting services to an rC2C. The iC2C should be able to translate notifications that may be sent by the rC2C, to notify its clients about abnormal conditions, into exception notifications. The iC2C top connector should act as a domain translator of exception notifications. This includes translating notifications sent by an rC2C signalling abnormal conditions into exception notifications understood by the iC2C abnormal activity component and recognize exception notifications that can be handled by the normal activity component. The normal activity component of the iC2C is responsible for preventing and detecting errors that may be caused by any other abnormal condition not properly signalled by the rC2C. Case 3. An rC2C requesting services to an iC2C. We have two distinct cases here: Case 3.1. The IC2C does not send exceptional notifications. This would be possible if an iC2C could be able to completely mask every possible exception. A request will always result in a normal (and fault-free) notification to the rC2C.

A Fault-Tolerant Software Architecture for Component-Based Systems

139

Case 3.2. The iC2C may send exceptional notifications. In this case, the iC2C may reply with a normal notification or an exception notification. The problem is that the behaviour of the rC2C cannot be predicted when the request results in an exception notification. To solve this, we may either (i) promote the rC2C to an iC2C and proceed as in case 1, or (ii) extend the fault tolerance capabilities of the iC2C in order to mask every possible exception notification sent by it and proceed as in case 3.1. For promoting an rC2C to an iC2C, the rC2C may be wrapped into a normal activity component, which adds error detection capabilities to it (Figure 7). This normal activity component is inserted into an iC2C with a new abnormal activity component responsible for handling exceptions raised by the normal activity component or by other iC2C providing services to it.

X rC2C 1—

Fig. 7. Promoting an rC2C to an iC2C The fault tolerance capabilities of an existing iC2C may be extended in two ways: a) Wrapping the existing iC2C as the normal activity component of a new iC2C (the wrapper) and providing an abnormal activity component responsible for handling some (or all) failure exceptions of that existing iC2C^ and / or external exceptions not handled by it (Figure 8); or b) Adapting the bottom interface of the existing iC2C through a new iC2C (the adaptor) responsible for handling some (or all) interface and failure exceptions sent by that existing iC2C (Figure 9). The adaptor's normal activity component acts as a simple redirector, delegating all requests to the existing iC2C. The extensions described in items (a) and (b) above produce a new interface, provided by the wrapper or adaptor, which is a specialization of the existing iC2C interface, as in a subtype relationship.

'The normal activity component may send interface exceptions and failure exceptions. It is the internal connector that classifies the failure exceptions as being "internal" or "external", depending on the port that receives the notification.

140

Paulo Asterio de C. Guerra et al.

Wrapper

Fig. 8. Wrapping an iC2C to improve fault tolerance

Adaptor

Fig. 9. Adapting an iC2C interface to improve fault tolerance A C2 architecture (or configuration) may itself be treated as a component, allowing us to use the iC2C structure to build an "ideal C2 architecture (iC2A)". In such configuration the AbnormalActivity component of this iC2A is responsible for handling exceptions that may propagate through all levels of the architecture, including reconfiguration of the architecture (configuration exceptions) [12].

A Fault-Tolerant Software Architecture for Component-Based Systems 3.3

141

From an Ideal Component to an Ideal C2 Connector

In distributed systems, connectors may be as complex and fault-prone as components. In our proposed approach, a fault-tolerant connector should be able to detect and recover from three kinds of errors: (i) errors in the messages received from other connectors or components; (ii) errors in the communication between it and the components connected to it, and (iii) errors in its internal logic. Figure 10 shows a C2 configuration for an idealised fault-tolerant connector, derived from the iC2C configuration (Figure 4). The TopConnector and BottomConnector are simple, more reliable, connectors, providing low-level communication services between the surrounding components and connectors. The NormalActivityConnector provides the more specialized and complex services, such as message filtering and multicasting, and implements the error detection mechanisms. The AbnormalActivityConnector implements the error recovery procedures. Service Requests

Normal Responses

Interface Exceptions eptions

T T

Failure Exceptions bxcep

TopConnector

yr

V NormalActivityConnector

7T

7\ Local Exceptions

Return to Normal

JJZ_ AbnormalActivityConnector

\/ Service Normal Requests Responses

M/

_NkL BottomConnector Interface Exceptions

Failure Exceptions

Fig. 10. The idealised C2 connector

4

Formal Representation of the iC2C

In order to demonstrate that C2 architectures can be built using iC2C, we have developed a formal model for the iC2C. Our formal is based on the UPPAAL tool [15], which allows the precise specification and verification of the iC2C protocols. The basis of the UPPAAL is the notion of timed automata extended with data variables, such as integer and Boolean variables. The automata consist of a collection

142

Paulo Asterio de C. Guerra et al.

of control nodes connected by edges. The control nodes of the automata are decorated by invariants that are conditions expressing constraints on the clock values. The edges of the automata are decorated with guards that express a condition to be satisfied for the edge to be taken, synchronisation actions that are performed when the edge is taken, and clock resets and assignments to integer variables. The first model developed (Figure 11) was a high-level representation of the iC2C, consisting of three very simple automata, modelled as processes in UPPAAL. process iC2C

process Client

r_iii? SO n out?

fe oui?

£^,

n out!

^' XK

ie_out!/

r_out!

v s 1^ V

v S 2 n_iii? , / f i _in?

ie_in?/ ie out? (a) process Server i_oui?

& out!

^^f

^^r_out! fe in? > ^^^^^ u^in 7 n_in?"*^^0S4 (b)

fe in!

(0

n2t