An approach to optimization of fault tolerant architectures ... - CiteSeerX

SOFTWARE – PRACTICE AND EXPERIENCE Softw. Pract. Exper. (2011) Published online in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/spe.1044

An approach to optimization of fault tolerant architectures using HiP-HOPS Masakazu Adachi1 , Yiannis Papadopoulos2, ∗, † , Septavera Sharvia2 , David Parker2 and Tetsuya Tohdo3 1 TOYOTA

CENTRAL R&D LABS., INC., Aichi, Japan 2 University of Hull, Hull, U.K. 3 DENSO CORPORATION, Aichi, Japan

SUMMARY New processes for the design of dependable systems must address both cost and dependability concerns. They should also maximize the potential for automation to address the problem of increasing technological complexity and the potentially immense design spaces that need to be explored. In this paper we show a design process that integrates system modelling, automated dependability analysis and evolutionary optimization techniques to achieve the optimization of designs with respect to dependability and cost from the early stages. Computerized support is provided for difficult aspects of fault tolerant design, such as decision making on the type and location of fault detection and fault tolerant strategies. The process is supported by HiP-HOPS, a scalable automated dependability analysis and optimization tool. The process was applied to a Pre-collision system for vehicles at an early stage of its design. The study shows that HiP-HOPS can overcome the limitations of earlier work based on Reliability Block Diagrams by enabling dependability analysis and optimization of architectures that may have a network topology and exhibit multiple failure modes. Copyright q 2011 John Wiley & Sons, Ltd. Received 11 July 2010; Revised 29 October 2010; Accepted 4 November 2010 KEY WORDS:

dependability analysis; fault tolerance; active safety; multi-objective optimization; genetic algorithms

1. INTRODUCTION Architectural design has traditionally been a difficult exercise which becomes more difficult with the increasing complexity of systems. The design of information and computer systems over the last two decades has shifted towards architecturally distributed and networked architectures [1, 2]. The design of dependable embedded systems in particular, such as those used in transport industries, is presently moving from standalone and partitioned systems to functionally integrated architectures, which are characterized by extensive sharing of information and hardware resources. In such architectures, shared processors and communication channels allow a large number of configuration options at design time and a large number of reconfiguration options at runtime. This creates more difficulties in design because as potential design spaces expand, their exploration for suitable or optimal designs becomes increasingly more difficult. Achieving a design that can meet dependability (i.e. reliability, availability and safety) requirements is not easy, and a number of challenges must be addressed. First, requirements need to be captured and communicated correctly through the various stages of development, from creation ∗ Correspondence †

to: Yiannis Papadopoulos, University of Hull, Hull, U.K. E-mail: [email protected]

Copyright q

2011 John Wiley & Sons, Ltd.

M. ADACHI ET AL.

of abstract designs through implementation in code to assessment via analysis and testing of the system. To enable this, a body of work on model-based development is emerging in which abstract and progressively more refined models of requirements and their expression in design are used to drive the design, development and verification of the system. This work has resulted in methods, such as UML [3] and SySML [4], as well as notations and tools like Matlab Simulink which enable system modelling that encompasses description of structure, behaviour and allocation of functions to hardware resources. More recently, architecture description languages (ADLs) such as the avionics AADL [5] and the automotive EAST-ADL2 [6] have emerged as potential future standards for model-based design in these sectors of the industry. However, the transparency and consistency that is achieved in a model-based process are not sufficient to guarantee dependability. Techniques are also required that could enable designers to predict whether design models meet the given requirements and many such techniques have recently emerged. Over the last 15 years, work on model-based dependability analysis has resulted in new approaches that partly automate, and simplify the synthesis of dependability evaluation models. A review of these techniques and of their relative merits can be found in Section 5. In this paper, we focus on one of those techniques called Hierarchically Performed Hazard Origin and Propagation Studies (HiP-HOPS) [7, 8]. HiP-HOPS is a well-established technique in which predictive system failure models, such as fault trees and Failure Modes and Effects Analyses (FMEAs), are constructed from the topology of the system and component failure models using a process of composition. HiP-HOPS and other modern model-based dependability analyses can help to automate and speed up assessment and, and in that sense they can enrich a model-driven development process. However, when a number of architectures can potentially deliver the functions of a system, designers are faced with additional complications. For instance, assuming that it is technically and economically possible to fulfil all dependability requirements, they must find the architecture that entails minimal development and other life cycle costs. On the other hand, in cases where it is too costly to meet all dependability requirements but some requirements could be reduced (e.g. availability), they must find those architectures that achieve key requirements like safety and give the best possible tradeoffs among other attributes of dependability and cost. It is widely accepted that the various formulations of the above represent hard, multi-objective optimization problems that can only be approached systematically with the aid of optimization techniques and computerized algorithms that can effectively search for optimal solutions in large potential design spaces. Some work has recently been done in this direction. Classical dependability models like Reliability Block Diagrams (RBDs) [9–12] and, more recently, HiP-HOPS [13] have been combined with meta-heuristics to assist in the automatic evolution of designs so that these designs can be automatically improved and meet dependability and cost requirements. HiP-HOPS in particular has contributed by introducing the concept of automated synthesis of dependability prediction models (fault trees and FMEAs) in the context of architectural optimization. However, although the principle has been sketched in [13], it has so far been applied only to simple redundancy allocation problems [14, 15] whereby a fixed architecture is assumed and the optimal level of replication of components in parallel channels is decided by HiP-HOPS. In this paper, we show an approach in which the architecture optimization capabilities of HiPHOPS are applied at the early stages of software design, where requirements for fault detection and fault tolerance must be established. The objective of optimization is to augment the architecture of a basic functional model of the system with fault detection and fault tolerance capabilities in a way that optimal or near-optimal tradeoffs between dependability and cost can be achieved. The optimization is performed by a multi-objective genetic algorithm which progressively improves a Pareto set of non-dominated design solutions that represent optimal tradeoffs among the parameters of the optimization. In this approach, optimization is driven by cost and dependability. Dependability is measured as system risk which in turn is calculated as a product of failure probability and severity of failure. The paper makes three contributions. The first is a new approach to optimization of the fault detection and fault tolerance capabilities of a system that is achieved via automatic model transformations. A key innovation is the use of evolutionary algorithms in conjunction with fault Copyright q


Softw. Pract. Exper. (2011) DOI: 10.1002/spe

OPTIMIZATION OF FAULT TOLERANT ARCHITECTURES

tree and FMEA synthesis algorithms which automate the continuous evaluation and re-evaluation of dependability. The approach supersedes the earlier work on reliability optimization using RBDs in two ways. First, it is applicable to systems that have a network topology, unlike the previous work with RBDs that has addressed only systems that could be represented as series-parallel configurations of components. Second, it moves beyond the classical ‘success-failure’ model assumed in RBDs by introducing a failure scheme in which components in the model can exhibit more than one failure modes which include loss but also commission of functions as well as value and timing failures. In the context of the proposed approach, the paper makes two additional contributions. First, it shows how a set of architectural patterns commonly used for fault detection and fault tolerance (self correction, self checking, checkpoint recovery and restart) can be modelled and then used for compositional dependability analysis and optimization, e.g. in the context of HiP-HOPS style analysis. Second, it defines a generic approach to modelling the detectability (or not) of errors propagated among the components of a system. In this approach, non-coherent compositional failure modelling (i.e. modelling that employs NOT gates) is employed for the first time to achieve analysis of the effect of detectability of errors on the dependability of the system. In Section 2, we outline the capabilities of HiP-HOPS that underpin the approach presented in this paper. In Section 3, we discuss four standard patterns of fault detection and fault tolerance and their fault modelling in HiP-HOPS. In Section 4, we apply the approach to an early design for a vehicle pre-collision system. We discuss the results of the optimization and highlight the benefits of this approach which include simplification of dependability analyses, the automation of complex optimizations and the establishment of a transparent, mathematical basis for achieving successful tradeoffs among dependability and cost in the design of systems. In Section 5, we review the relevant work, and in section 6 we draw the conclusions and highlight the further work.

2. THE HiP-HOPS ANALYSIS AND OPTIMIZATION PROCESS In HiP-HOPS, a topological model of the system (hierarchical if required to manage complexity) is augmented with formalized logical descriptions of component failures and then used as a basis for the automatic construction of fault trees and FMEAs for the system. Application of the technique can start early, once a concept of the system under design has been interpreted into a model which identifies components (i.e. functions or architectural elements) and material, energy or data transactions among components. Suitable models for the application of the technique include abstract functional block diagrams, engineering schematics, piping and instrumentation diagrams, hardware descriptions, data flow diagrams and other models commonly used in engineering and software engineering. HiP-HOPS can be performed on abstract or more detailed models of the system as these are produced and refined in the course of the design life cycle. This creates opportunities for reuse and refinement of earlier analyses and the ability to achieve a consistent and continuous assessment at the centre of which lies an evolving model of the system itself. At the early stages of design, the system model can be a block diagram which shows the functional composition of the system, input/output transactions among functions and the recursive refinement of functions into networks of lower level subfunctions. Later on, when functions are allocated to hardware, the model becomes a representation of the physical architecture of the system which shows components such as sensors, actuators, buses and programmable controllers enclosing software architectures hosted by those controllers. Analysis is possible at each stage of refinement, with greater detail in later stages producing correspondingly greater detail in the results. 2.1. Dependability analysis The HiP-HOPS dependability analysis process starts with the establishment of the local failure behaviour of each component (i.e. function, hardware or software element) in the model. This Copyright q



M. ADACHI ET AL.

local failure behaviour is represented as a set of logical failure expressions which show how output failures of the component can be caused by internal malfunctions and deviations of the component inputs. A variant of Hazard and Operability Studies (HAZOP) is used to identify the plausible output failures and then to determine the local causes of such events as combinations of internal component malfunctions and similar types of input failures. At this stage, analysts are free to define and refine the set of failure classes examined in the course of the analysis, depending on the context and application. Failure classes describe the type of input or output deviation in question. In general, deviations fall into one of several main categories: provision failures, including omissions and commissions; value failures, e.g. too high, too low and timing failures, e.g. late or early. Components can also transform input deviations of one failure class into output deviations of another; for example, a component that is designed to fail silent in response to input errors may convert an input deviation of the value or timing failure classes into an omission at its output. It is important to note that failure classes are user-defined and that other classification schemes are equally possible. In some cases, a general classification scheme involving omissions and commissions may be sufficient (and may be more transferrable to similar applications), whereas in other cases, more specific failure types may be required. As long as the failure types received at one end of each connection are generated at the other end, HiP-HOPS will be able to establish how errors propagate through the model. HiP-HOPS expressions can include Boolean operators, such as conjunction (AND, we use the symbol ∧) and disjunctions (OR, we use the symbol ∨). As an example of a HiP-HOPS annotation, we can describe an omission of output of a simple component using the expression: Omission−o = Omission−i ∨InternalFailure by which we define that a deviation of class omission of output o can be caused either by an internal failure or an omission of input at input i. It is also possible to be more precise and specify deviations of parameters on outputs; for example, a hydraulic link can be defined by the flow, pressure and temperature of material carried through the link. By using parameters, it is possible to restrict a failure class to refer to a specific attribute, e.g. an unexpectedly high value of the parameter ‘temperature’ at the input or output. Internal event and input/output events are not necessarily failure events. Note that, as part of the process, it is also possible to define normal events (and assign a probability of occurrence to them) and then assess the contribution of those events to fault propagation. It is also worth pointing out that HiP-HOPS supports non-coherent error modelling [16], i.e. the inclusion of NOT gates in failure expressions which gives the possibility to specify and assess correctly the effects of an event X and its complement not (X) in the same failure model. Treating an event and its complement as two separate events leads to incorrect qualitative and quantitative results in fault tree analysis and should be avoided. Once a component has been annotated with failure data, it can be stored together with its failure data in a library and can be reused across applications, where this is appropriate, to simplify the manual part of the analysis. The rest of the HiP-HOPS process is fully automated. Once failure annotations have been inserted for all components in a system model, the topology of the model is used to automatically determine how the local failures specified in those annotations propagate through connections in the model and cause functional failures at the outputs of the system. This global view of failure is captured deductively—moving from effects towards causes—in a set of fault trees which are automatically constructed by traversing the model and by evaluating the local failure expressions encountered during the traversal. The synthesized fault trees are interconnected and form a directed acyclic graph sharing branches and basic events that arise from dependencies in the model, e.g. common inputs which may cause simultaneous dependent failure of hypothetically ‘independent’ functions or physical components. Classical Boolean reduction techniques are applicable on this graph. Thus, in the final phase of a HiP-HOPS analysis, qualitative or quantitative analysis can be automatically performed on the graph to establish whether the system meets its dependability requirements. The logic contained in the graph can then be automatically translated into a simple table which is equivalent to a multiple failure mode FMEA. The FMEA records, for each component Copyright q




in the system and for each failure mode of that component, any direct effects on the system and any further effects caused in conjunction with other failure events. Note that when only abstract logical failure behaviour is specified for components in a model, fault trees and FMEAs can be used to determine the possible causes of system failures (this is known as qualitative analysis). When statistical data about failure are available, it is possible to calculate the probability of top events of fault trees, a process known as quantitative analysis, and the probability and severity values together can be used as the basis of risk calculations. However, even without this numerical data, a qualitative fault tree analysis and FMEA can reveal useful and valuable information about the behaviour of the system. A qualitative analysis will reveal, for example, which failure modes are single points of failure or even common causes for two or more system failures. Note that in hierarchical models that record the decomposition of systems, failure annotations may also be inserted at the subsystem level to collectively capture the effect of failure conditions that do not necessarily require examination at the basic component level. If, for example, a subsystem as a whole is susceptible to some environmental disturbance, such as high temperature, flood or electromagnetic interference, then the effects of this condition can be directly specified with a failure annotation at the subsystem level. Such annotations would typically complement other annotations made at the level of enclosed components to describe the aspects of failure behaviour at this level (e.g. mechanical and electrical failure modes of each component). In general, when examining the causes of failure at an output of a subsystem, the fault tree synthesis algorithm of HiP-HOPS creates a disjunction between any failure logic specified at the subsystem level and logic arising from the enclosed lower levels. Note that the same concept can be used for the analysis of programmable components, e.g. controllers which enclose software architectures. In such cases, annotations that show the effect of hardware failures on outputs of the controller can be automatically combined with annotations that show the effect of failures propagated through the enclosed software on the same outputs. 2.2. Dependability optimization The HiP-HOPS analysis of an architecture may show that dependability and cost requirements have been met, in which case the proposed system design can be realized. In practice, though, this analysis will often indicate that certain requirements cannot be met, in which case the design will need to be revisited. This is a problem commonly encountered in the design of safety critical systems. Designers of such systems usually have to achieve certain levels of dependability while working within certain cost constraints. Design is of course a creative exercise, however it relies not only on the technical skills of the design team but also on experience and successful earlier projects. We believe that some further automation can assist the decision making for the selection among alternative components or subsystem designs as well as on the level of replication of components in the model, such that the system ultimately meets its set dependability requirements with minimal cost. To address the above problem, HiP-HOPS has been extended with multi-objective optimization capabilities. These allow the tool to search the design space, defined by the variability of a design model, for potential design solutions that are optimal, or near optimal, in terms of dependability and cost. In this approach, a variable design model for a system is one in which components and subsystems have alternative user-defined implementations which can include standard fault tolerant configuration schemes. Each implementation is fully annotated with local component failure data and information about the cost of component. HiP-HOPS uses a multi-objective genetic algorithm to effectively search the design space defined by the permutations of the design that can arise following resolution of variability. The genetic algorithm exploits the automated fault tree and FMEA synthesis algorithms of the tool to calculate the fitness of candidate designs. Note that other parameters can be included in this approach as long as their evaluation model is compositional, i.e. the prediction of the parameter at system level can be done from component models. In the course of the evolutionary process, the optimization algorithm of HiP-HOPS typically generates populations of candidate designs which employ user-defined alternative implementations Copyright q



M. ADACHI ET AL.

for components and subsystems. It can also automatically replicate components and subsystems in parallel channels. For the genetic algorithm to progress towards an optimal solution, a selection process is applied in which the fittest designs survive and their genetic makeup is passed to the next generation of candidate designs. The fitness of each design relies on cost and dependability. To calculate fitness, therefore, we need ways in which to automatically calculate those two elements. An indication of the cost of a system is calculated as the sum of the costs of its components (although for more accurate calculations life cycle costs should also be taken into account, such as production, assembly and maintenance costs). However, while calculation of cost is relatively easy to automate, the automation of the evaluation of dependability is more difficult as conventional methods rely on manual construction of the dependability prediction model (e.g. the fault tree, RBD or FMEA). HiP-HOPS, though, automates the development and calculation of such models, and therefore facilitates the evaluation of fitness as a function of dependability. This in turn enables a selection process through which the genetic algorithm can progress towards optimal or near-optimal solutions which can satisfy dependability requirements at minimal cost. When dealing with multiple conflicting objectives, such as dependability and cost, a single optimum solution is unlikely to exist. The goal of optimization is therefore to generate a Pareto front of undominated solutions that represent optimal tradeoffs among the objectives of optimization. For a solution to exist on the Pareto front there must be no other solution in the set that is equal or better than it across all objectives. Choosing between the solutions on the Pareto front means making a compromise on at least one of the objectives, e.g. moving towards higher dependability will also mean moving towards higher cost. HiP-HOPS uses a variant of the Non-Dominated Sorting Genetic Algorithm (NSGA-II) [10] for optimization. The original NSGA-II algorithm allows for both undominated and dominated solutions to exist in the population (i.e. the current set of design candidates). To help decide which solutions pass on their characteristics to the next generation, they are ranked according to the number of other solutions they dominate. The more dominant solutions are more likely to be used than the less dominant solutions. HiP-HOPS, optionally, is also able to discard all but the dominant solutions. This is known as a pure-elitist algorithm (since all but the best solutions are discarded) and in our experimentation this provided a significant increase in the performance and the quality of the solution set was not compromised. The pure-elitist approach was used in the case study. To further enhance the quality of solutions and the speed with which they can be found, a number of other modifications were made to the algorithm implemented in HiP-HOPS. One improvement was to maintain a solution archive similar to those maintained by tabu search and ant colony optimization; this has the benefit of ensuring that good solutions are not accidentally lost during subsequent generations. Another improvement was to allow constraints to be taken into account during the optimization process, similar to the way the penalty-based optimization functions: the algorithm is encouraged to maintain solutions within the constraints and solutions outside, while permitted, are penalized to a varying degree. In addition, younger solutions—i.e. ones more recently created—are preferred over ones that have been maintained in the population for a longer period; again, this helps to ensure a broader search of the design space by encouraging new solutions to be created rather than reusing existing ones. In general, when choosing solutions to ‘breed’, those with a lower rank (i.e. less dominated) are favoured. Where a selection must be made between two solutions with the same rank (as is always the case with pure-elitism), they are distinguished by their crowding. As it is desirable to have a wide and evenly space Pareto front, solutions that are in uncrowded regions of the front are favoured as this represents areas of the search space that have not been explored. Crowding is calculated by ordering all of the solutions in each of the dominance ranks by each of their objective values in turn. The difference between the values of the nearest neighbour on each side is averaged for all objectives. This gives an individual crowding distance. Once the individuals have been selected for breeding they have genetic operators applied to them to generate children solutions. These operators are mutation and recombination in the form of uniform crossover. The mutation operator makes a random change to the encoding of the solution to promote diversity in the population and resist premature convergence. The crossover operator Copyright q




combines aspects from two parent solutions into a pair of children solutions. This amounts to a more detailed exploration of a particular region of the search space and encourages convergence.

3. HiP-HOPS MODELLING OF FAULT TOLERANT MECHANISMS FOR DEPENDABILITY ANALYSIS AND OPTIMIZATION When we consider optimization of software architecture with respect to dependability, it is important to define the types of fault tolerance mechanisms that can potentially be added to locations of the original architecture in order to improve its failure detection, handling and overall dependability characteristics. The presence or not of the various combinations of these mechanisms defines a set of possible design configurations which form the design space to be searched through the optimization. In this section, we examine the use of four well-established software fault tolerance techniques as architectural transformation patterns that can be employed in the course of such optimization. For each technique, we specify a pattern of local failure logic in the syntax of HiP-HOPS [7, 17], which defines how input failures are blocked or mitigated by the respective fault tolerant technique, or whether such failures are propagated to outputs. Fault tolerance techniques have been classified into two major groups, single-version and multiversion software techniques [18, 19]. Single-version techniques are generally easier to implement than multi-version techniques since they are based on simple error detection/recovery mechanisms and redundancy of a single version of software. On the other hand, multi-version fault tolerance techniques use more intricate mechanisms [20]. They typically employ two or more variants of software, and they are deployed in a structured manner to ensure that failures in one version never impact on others. The validity of this approach is based on the assumption that versions built by different designers using different algorithms, and tools, are unlikely to fail simultaneously in the same failure mode, an assumption that has been claimed in numerous studies [21, 22] but also criticized in [23]. In the aerospace or nuclear domains, expensive multi-version fault tolerance techniques are often applied due to the ultra high requirements for dependability which to a large extent override concerns about increased development time and cost. Automotive systems, however, are required to be not only safe and reliable but also time and cost effective. Therefore, multi-version fault tolerance techniques are not practical for the automotive domain. For the above reasons, and given that in this paper we are concerned with automotive systems, we focus on the usage of single-version fault tolerance techniques for the optimization of fault tolerant architecture. Before addressing the failure expressions of single-version fault tolerance techniques, we have to consider plausible failure modes of software behaviour in order to explicitly define the causal relationships between input and output deviations. The types of failure modes that should be defined are strongly related to the granularity of the design. In this study we specifically deal with abstract functional structures, where each component is just a composite element and regarded as a black-box, and thus only qualitative failure modes could be examined. At this level of analysis, four types of qualitative failure modes or failure classes have often been introduced [7, 24, 25]: omission (O), commission (C), timing (T ) and value (V ) failures. In general, a commission failure (C) describes situations where a faulty system produces an output when a correctly functioning system would have produced no output at all. In an omission failure (O), the faulty system does not produce the expected output while in timing (T ) and value (V ) failures the output is produced in incorrect time or at incorrect value. How the effects of such outputs propagate through a system depends on the protocol of communication between components. In a pool protocol for example, with a single writer and multiple readers that can access the pool anytime, omission by the writer means a failure to update the pool and a consequent value failure for the writers who can only access old data. On the other hand, in a channel protocol where writer and reader are synchronized, omission by the reader means that the writer cannot proceed and therefore also fails by omission. Copyright q



M. ADACHI ET AL.

Figure 1. Self-protection. Table I. Failure annotation of self-protection component. Self-protection(SP) Internal malfunction: failure (failures per hour) Missing failure: miss ∈ [0, 1] Output failure mode: O-sp out = failure ∨ O-sp in ∨¬ miss ∧ (C-sp in ∨ T -sp in ∨V -sp in) C-sp out = miss ∧C-sp in T -sp out = miss ∧T -sp in V -sp out = miss ∧V -sp in N/A Output failure mode: O-sp out = O-sp in C-sp out = C-sp in T -sp out = T -sp in V -sp out = V -sp in

For further analysis of the meaning of failure classes (O, C, T, V ), the reader is referred to [26, 27] where extensive discussions can be found. For the purposes of HiP-HOPS analysis, it is essential to formally represent how each singleversion fault tolerance technique responds to these four failure modes propagated by other components. At the abstract functional level, we do not refer to the concrete behaviour (algorithms) of components, but once details of implementation are obtained qualitative failure modes will be refined hierarchically (ideally in an automatic way) into more fine-grained ones. Consequently, a strict analysis can be performed only in a lower level of design. In this context, the failure expressions given in the following sections are often approximations or conservative with respect to the quality of analysis. For our purpose, however, it is rather important to reduce unnecessary design iterations by previously identifying the potential design flaws and critical functions as early as possible without the detailed design, and we need not persist in the accuracy of analysis at this point. 3.1. Self-protection Error detection is a simple but still effective fault tolerance technique in single-version software. There are two basic functions which realize error detection: self-protection and self-checking. Self-protection means that a component must be able to protect itself from external disturbances by detecting errors propagated from other components. Self-checking means that a component must be able to detect internal errors to block or mitigate the propagation of those errors to other components. First, we address a formal failure expression of self-protection. In this paper, a self-protection component is conceptually assumed to be a function added to a target component. Figure 1 depicts a block diagram representation of a self-protection component while Table I shows the HiP-HOPS analysis for this component. Self-protection is considered to be able to detect all failure modes (O, C, T, V ). However, selfprotection itself does not have a mechanism to recover from detected failures, and we assume that Copyright q




the component responds by failing silent in response to detected failures. As a result, through the self-protection component, all of the detected failures will be transformed into the omission failure (O) of outputs. The failure expression of self-protection is given by O −sp out = failure ∨ (O −sp in ∨ C −sp in|∨ T −sp in ∨ V −sp in),

(1)

where failure collectively denotes all internal malfunctions of the component. Intuitively, failure is an abstract representation of hardware failures (stuck memory, processor failure, etc.) and software failures (logical defects, etc.). Since, as mentioned before, we will not elaborate the details of failure at this level of analysis, the difference between hardware failures and software failures is not distinguished. We assume that fault tolerant mechanisms are reliable and do not create subtle and difficult to detect timing and value errors in outputs as a result of internal failure. This assumption may not hold in practice but it is introduced here for simplicity. However, we should note that this assumption is not necessary. For example, modelling of components in the case study includes cases where internal failures lead to timing and value errors at outputs. By assigning a failure rate to failure we can perform quantitative dependability analysis (failure probability calculation). To estimate the failure rate is inherently a difficult problem especially for software [28–31], but the most important thing for us is not setting a correct value to but relatively validating the effectiveness of adding fault tolerant components over the possible design configurations. In this sense, the failure rate discussed here is a characteristic which measures the relative merits between two components with regard to their reliability. We show how the value is set through a case study presented later. Note that the failure expression (1) means that whenever input deviations occur self-protection always detects them and fails silently with the probability of 1. In practice, however, there must be a case where self-protection misses some failures and the effects are propagated to other components. In order to describe such situations, we introduce an additional event miss to fault tolerant components as shown in Figure 1. A variable ∈ [0, 1] is associated with miss to represent the probability that a fault tolerant component fails to detect the input deviations. Table I shows the complete failure annotation of self-protection in HiP-HOPS. By introducing miss, three input failure modes, commission, timing and value, are propagated to outputs with a constant probability . Thus, self-protection fails silently when and only when the errors are successfully detected. In Table I we also define an alternative failure annotation N/A (i.e. Not Applicable) which represents a situation where the self-protection function is not provided. In the context of architectural optimization, this allows HiP-HOPS to choose to implement or not the self-protection mechanism. In the case N/A is chosen, all failure modes propagate through the component. In general, N/A has been used in this paper to denote situations in which a fault tolerance mechanism specified in a model is not implemented. In a given model where different fault tolerant mechanisms have been specified, the choice between their implementation and N/A creates variants of the architecture that define the design space to be explored by HiP-HOPS. 3.2. Self-checking Self-checking is another mechanism of error detection which additionally reacts to internal malfunctions occurred in a target component. This fault tolerance technique is also assumed to be connected to the target component as shown in Figure 2. Self-checking needs internal information about the target component, and thus ports which can transmit the occurrence of internal malfunctions from a target component to a self-checking component are required. Consequently, the failure expression of the target component is slightly modified. Let the ports be det and rec. When an internal malfunction Target. failure‡ takes place in the target component, this information will be sent to the self-checking component through det port. Then, if the self-checking component successfully ‡

Events and ports from different components may have the same name, such as failure. When we need to explicitly identify an element, a component name will be given with the name of events or ports. The component name can be omitted if it is clear from the context.

Copyright q



M. ADACHI ET AL.

Figure 2. Self-checking. Table II. Failure annotation of self-checking component. Self-checking(SC) Internal malfunction: failure (failures per hour) Missing failure: miss ∈ [0, 1] Output failure mode: F-sc out=(failure ∨ miss) ∧ F-sc in N/A Output failure mode: F-sc out=F-sc in

detects the failure, it is reduced or recovered by the function and fed back to the target component through rec port. First, we show how the failure expression of the target component is modified by adding these ports. Let the original failure expression of the target component be ∗

−out = g(failure, ∗ −in),

(2)

where g denotes the Boolean function of failure expression, and ∗ ∈ {O, C, T, V } is a failure mode belonging in this set. In the modified target component the occurrence of Target. failure is indirectly interpreted as an output of det, F −det = failure,

(3)

where a failure mode F represents a situation in which a failure actually took place in the target component. Moreover, through the self-checking component, this information is again substituted to the original failure expression by ∗

−out = g(F −rec, ∗ −in).

(4)

Owing to the existence of the self-checking component, an internal malfunction Target. failure is not directly propagated to the output, but bypassed via det and rec ports. Thus the failure expression of self-checking defines a causal relationship between F −sc in(F −det) and F − sc out (F-rec). Table II shows the failure annotation of self-checking. When self-checking is implemented, the internal malfunction Target. failure (= F −det = F −sc in) is blocked and the target component will continue to work appropriately except in cases where the self-checking component itself fails (Self-checking. failure) or misses the propagated failures (Self-checking. miss). 3.3. Checkpoint and restart In contrast to error detection techniques, the checkpoint and restart mechanism not only detects failures, but recovers from errors by simply restarting the target component. Software faults which Copyright q




Figure 3. Checkpoint-restart: (a) logical architecture [19] and (b) abstract functional structure.

are not eliminated in the development process will be exposed as failures at runtime, but it is difficult to predict when these failures will happen. Moreover, software failures are likely to occur only under special, infrequent and non-repeatable, circumstances. From the above empirical rules, simply restarting the target component is practically an effective countermeasure for unanticipated runtime failures. In order to restart the execution, a checkpoint and restart component must be able to store a set of states as snapshots. According to how to create checkpoints, there are two restart strategies; static and dynamic. Static restart has predefined states to return and resume execution from those possible reset states. Because of the reset property, all computations which have been done before restarting will be lost. Dynamic restart, by contrast, creates snapshots adaptively at fixed execution time intervals or at particular checkpoints embedded in the code. The dynamic restart strategy offers a more reliable mechanism because information that is obtained before the occurrence of failures can be exploited to recover from the error states. The left of Figure 3 shows the logical architecture of the checkpoint and restart technique. The two restart strategies employ different algorithms, and thus behave in a different way in the context of failures. This means that the fault tolerance capability of the mechanism depends on how to detect the errors, create the checkpoints and determine the restart points. However, when we consider the checkpoint and restart technique in the context of our failure expressions, it is difficult to define them concretely because we focus only on the abstract functional structure and implementation details should be examined only at a lower level of design. Therefore, we simply regard the checkpoint and restart technique as a mechanism which can detect the output deviations (O, C, T, V) of the target component and try to complete the execution by returning to the states before failure. The right of Figure 3 shows an abstract functional structure of checkpoint and restart, and its failure annotation is given in Table III. The restart of execution does not essentially change the function or intention of the software. Thus, we presume that recovery from omission and prevention of commission failures is possible as long as they are successfully detected. However, re-execution from a checkpoint in the past will potentially cause a timing failure, and a value failure will also arise as a side effect of timing failure. Consequently, the checkpoint and restart technique can respond to all failure modes although there is a possibility of remaining timing and value failures in compensation for its recovery property. 3.4. Process pairs Among single-version software fault tolerance techniques, the process pair is the only technique that employs redundancy realized by two identical software components. The technique resembles the checkpoint and restart in terms of the error detection and recovery mechanisms. The difference is that when a failure is detected, the process pair can complete the execution without returning to past checkpoints unlike the checkpoint and restart technique. In a process pair component, two redundant software components always run concurrently on different processors, and the restart is Copyright q



M. ADACHI ET AL.

Table III. Failure expression of checkpoint and restart. Checkpoint-restart(CR) Internal malfunction: failure (failures per hour) Missing failure: miss ∈ [0, 1] Output failure mode: O −cr out = failure∨(miss∧ O −cr in) C −cr out = miss∧C −cr in T −cr out = ¬miss∧(O −cr in∨C −cr in)∨ T −cr in∨ V −cr in V −cr out = ¬miss∧(O −cr in∨C −cr in)∨ T −cr in∨ V −cr in N/A Output failure mode: O −cr out = O −cr in C −cr out = C −cr in T −cr out = T −cr in V −cr out = V −cr in

Figure 4. Process-pair: (a) logical architecture [19] and (b) abstract functional structure.

immediately done by just switching the processors from primary to secondary after the occurrence of a failure (see Figure 4). Because of this redundancy of execution, the data synchronicity, which is not restored by checkpoint and restart technique, is conceptually guaranteed. In practice, the process pair is structurally and logically complex and how to respond to the failures is dependent on the manner of implementation. For the purposes of abstract analysis, we simply regard it as an advanced mechanism of the checkpoint restart at the abstract functional level. In other words, we consider that not only omission and commission failures but timing and value failures are fully recovered unless failure or miss occur in the process pair component. We should also note that a process pair does not protect against systematic failures, e.g. failures in the specification. It only protects against random failures that affect the execution of one component only. The failure annotation of process pair is shown in Table IV. Obviously, it is impossible to prepare such an ideal fault tolerant component that perfectly handles all types of failures. The motivation in this paper is to search among possible architectural configurations that employ different fault tolerant mechanisms. The focus, therefore, is to emphasize the relative advantages among fault tolerance techniques rather than aim for a detailed description of failure behaviour. For example, we assumed that the self-protection component reacts to failures in a fail-silent way, but this is not always true depending on the implementation details. As a result, the actual failure behaviour will be different from our definition. In addition, there also must be several options for inclusion of fault tolerance techniques other than those presented here. Overall, it should be noted that failure behaviour can potentially be defined in different levels of abstraction, and we have just presented an abstract representation of fault tolerance techniques which is reasonably sufficient for our purpose. Copyright q




Table IV. Failure annotation of process pair. Process-pair(PP) Internal malfunction: failure (failures per hour) Missing failure: miss ∈ [0, 1] Output failure mode: O −pp out= failure ∨(miss∧ O −pp in) C −pp out = miss∧C −pp in T −pp out = miss∧ T −pp in V −pp out = miss∧ V −pp in N/A Output failure mode: O −pp out= O −pp in C −pp out = C −pp in T −pp out = T −pp in V −pp out = V −pp in

Figure 5. Possible design configurations using fault tolerance techniques.

3.5. Architecture optimization Once all the options of fault tolerance techniques and the corresponding failure behaviours are defined, the optimization capabilities of HiP-HOPS can be applied to search for optimal architectures which employ some of these fault tolerant techniques to maximize dependability and minimize cost. In the experiments described in this paper, dependability has been measured in terms of the risk arising from malfunctions of the system, where risk has been calculated as the sum of failure probability multiplied by the severity of each malfunction. Figure 5 shows a generic conceptual diagram of possible add-on components that improve the fault detection and fault tolerance capabilities of a target component that uses the four fault tolerance techniques discussed in this paper. The inclusion or not of one or more of those techniques results in 12(= 2×2×3) possible design options for a single target component. For each of those design options at the component level, and the architecture of a system as a whole, the overall failure behaviour can be obtained by HiP-HOPS from Tables I–IV and the failure annotations of the target components. After constructing the overall failure expression, the probability of failure of the fault tolerant architecture is calculated using the failure rates and probabilities assigned to the basic events of failure and miss in each component. Assuming that the 12 options are available for all components in a system, the whole design space to be explored grows by O(12n ), where n denotes the number of target components. It is clear Copyright q



M. ADACHI ET AL.

that this exponential growth of the design space makes it infeasible to find an optimal architecture by hand, and thus an algorithmic approach which efficiently explores the huge solution space is necessary. Furthermore, in general, dependability is not a unique quality metric of software, and we have to take account of other metrics at the same time. If dependability is the only objective of optimization, we have a trivial solution that we decide to implement all of the fault tolerance techniques. Such a decision, however, is often meaningless because it is highly inefficient in development time and cost. To successfully achieve tradeoffs between orthogonal requirements, such as above, we need a multi-objective optimization technique. In this context, HiP-HOPS is a promising tool which provides a tight integration of automatic dependability analysis and architecture optimization [13]. We show a case study of architecture optimization using failure expressions of fault tolerance techniques with the aid of HiP-HOPS in the following section.

4. PRE-COLLISION SYSTEM CASE STUDY A Pre-collision system (PCS) is an emerging automotive safety technology that avoids or reduces the damage caused by a collision. A PCS first emits warnings to the driver when a potential collision threat is identified. Then, even if the driver fails to apply the brakes, a break assist function will pre-emptively increase the braking force in advance of a collision to reduce the collision injury. Therefore, PCS can support drivers mainly in two phases, early warnings and emergency braking, according to the environmental information. Figure 6 shows the abstract functional structure of the basic PCS. In Figure 6, the Detection Module processes information gathered by several sensors, such as millimetre-wave radar, to detect obstacles on the road ahead, and then sends signals regarding the distance and relative speed of the obstacle to the kernel of the PCS. The Monitoring Module and Switch Controller monitor external inputs, for example driver’s operation and system main switches to supervise the control modes of PCS. The kernel of PCS consists of PCS Logic1 and PCS Logic2. These logics compute the imminence of a collision based on the pre-processed sensor data, and judge whether or not the warnings and break assist should be activated. Then, the Decision Module coordinates the outputs of PCS Logic1 and PCS Logic2, and determines an appropriate action depending on the priority of events and/or the urgency of situations. Finally, the Actuator Driver sends control inputs to the breaking system and the humanmachine interface. Except for sensors and internal memories represented by small boxes in Figure 6 (i.e. Radar Sensor, Pedal Sensor, Vehicle Sensor, Switch Sensor and Memory),

Figure 6. Abstract functional structure of PCS. Copyright q




all blocks are purely software components and collectively referred to as PCS software components. The targets of optimization are the PCS software components, and fault tolerance techniques will be applied to them. 4.1. Failure annotation of the PCS Before optimizing the architecture of PCS, we have to give the failure expression of each component. As mentioned before, in this type of early analysis, it makes sense to simply regard the causes that lead to internal malfunctions as an abstract event that we can call failure. In this case study, however, we refine failure into two basic events hard and soft for the PCS software components. These two basic events are introduced to specifically describe the causal relationships between input and output failure modes (O, C, T, V ) and internal malfunctions. The event hard implicitly represents physical breakdowns of hardware equipments, and is strongly related to the omission and value failures. We assume that the self-checking technique can detect only the occurrence of hard, that is, Equation (3) is replaced by F −det = hard. The event soft represents malfunctions triggered by the defects of software. As buggy software erroneously processes the data and behaves in a way that is not expected by the designer, commission and timing failures will be caused by soft. Table V shows the failure expressions of PCS components. These are tentatively defined from the envisioned functions of each component and data types processed between the components as a basis of discussion, and therefore are not strictly corresponding to the behaviour of real systems. For sensors and memories that are not PCS software components, the failure rate = 2.28×10−7 is assigned to failure. For software components, the failure rates of hard and soft are uniformly set to = 5.7×10−7 and = 1.14×10−6, respectively. It should be noted that the values of failure rates are not based on any empirical data, but hypothetically given to illustrate the proposed approach. Arguably, work on software reliability growth models could be used to estimate such failure rates (see pioneering work in [32] and more recent papers). For simplicity, failure rates here are assumed to be constant and therefore failure probability follows the exponential distribution over time. Having annotated the model, first, we analysed the failure probability of PCS without any fault tolerance technique. In the analysis, we identified a hazardous scenario, in which the breaking force is lost when there is breaking intention, as one of the most critical system-level failure of PCS§ . This failure mode can be formally written as O-Actuator Driver.out. For this hazardous functional failure, it is also referred to as the top event in the context of fault tree analysis, HiP-HOPS will first generate the fault tree by parsing the failure expression, and then compute the probability of the top event based on the minimal cut-set analysis. We cannot show the fault tree due to the lack of space, but 13 cut-sets are found for the top event O-Actuator Driver.out as shown in Table VI. For each cut-set, the probability is calculated from the failure rates of basic events taking into account the system operating time (we set it to 10 000 h). Then, the system failure probability can be obtained by adding the probability of each cut-set. For the top event O-Actuator Driver.out, the failure probability of PCS is evaluated as 6.077×10−2. This result shows that the system is not sufficiently reliable. Moreover, the derived cut-sets have the order of 1, which means that the occurrence of a single failure immediately leads to the hazard. In this sense, we also know that the original PCS is not fault tolerant. In such circumstances, a typical solution is to improve the system design by enhancing it with fault tolerant mechanisms that can improve the overall dependability. In this particular study the optimization algorithms of HiP-HOPS have been used to select the optimal location and types of such mechanisms in an enhanced version of the PCS.

§ Identifying

the critical hazards of PCS has been examined based on MISRA-SA (safety analysis) [32]. However, how we derive and assess the risk of hazards through preliminary safety analysis is beyond the scope of this paper, and the details are omitted.

Copyright q



M. ADACHI ET AL.

Table V. Failure expression of PCS components. Sensors and memories

Detection Module

Internal malfunction: failure: 2.28×10−7 (failures per hour) Output failure mode: O-D = failure C-D = failure T -D = failure V -D = failure Monitoring Module Internal malfunction: hard: 5.70×10−7 (failures per hour) so t: 1.14×10−6 (failures per hour) Output failure mode: O-out = hard ∨O-in1 ∨O-in2 C-out = soft ∨ (T -in1 ∧T -in2) ∨ (V -in1 ∧V -in2) T -out = soft ∨C-in1 ∨C-in2 V -out = hard PCS Logic1 Internal malfunction: hard: 5.70×10−7 (failures per hour) so t: 1.14×10−6 (failures per hour) Output failure mode: O-out = hard ∨O-in1 ∨O-in2 ∨T -in1 C-out = soft ∨C-in1 ∨C-in2 ∨T -in2 ∨V -in2 T -out = soft V -out = hard ∨ (V -in1 ∧V -in2) Decision Module Internal malfunction: hard: 5.70×10−7 (failures per hour) so t: 1.14×10−6 (failures per hour) Output failure mode: O-out = hard ∨O-in1 ∨O-in2 ∨O-in3 ∨O-in4 C-out = soft ∨C-in1 ∨C-in2 ∨C-in3 ∨C-in4 ∨V -in2 ∨V -in3 ∨V -in4 T -out = soft ∨T -in1 ∨T -in2 ∨T -in3 ∨T -in4 V -out = hard ∨V -in1 ∨V -in2 ∨V -in3 ∨V -in4

Internal malfunction: hard: 5.70×10−7 (failures per hour) so t: 1.14×10−6 (failures per hour) Output failure mode: O-out = hard ∨O-in1 ∨O-in2 C-out = soft ∨C-in1 ∨C-in2 T -out = soft ∨T -in1 ∨T -in2 V -out = hard ∨ (V -in1 ∧V -in2) Switch Controller Internal malfunction: hard: 5.70×10−7 (failures per hour) so t: 1.14×10−6 (failures per hour) Output failure mode: O-out = hard ∨O-in1 ∨O-in2 C-out = soft ∨ (T -in1 ∧T -in2) ∨ (V -in1 ∧V -in2) T -out = soft ∨C-in1 ∨C-in2 V -out = hard PCS Logic2 Internal malfunction: hard: 5.70×10−7 (failures per hour) so t: 1.14×10−6 (failures per hour) Output failure mode: O-out = hard ∨ (O-in1 ∧O-in2 ∧O-in3) ∨T -in1 ∨ T -in2 C-out = soft ∨ (C-in1 ∧C-in2 ∧C-in3) T -out = soft ∨ (T -in1 ∧T -in2 ∧T -in3) V -out = hard ∨ (V -in1 ∧V -in2 ∧V -in3) Actuator Driver Internal malfunction: hard: 5.70×10−7 (failures per hour) so t: 1.14×10−6 (failures per hour) Output failure mode: O-out = hard ∨O-in1 C-out = soft ∨C-in1 ∨V -in1 T -out = soft ∨T -in1 V -out = hard ∨V -in1

Table VI. Cut-sets of O-Actuator Driver.out. Cut-set Detection Module.soft Monitoring Module.hard PCS Logic1.hard Decision Module.hard Radar Sensor.failure Vehicle Sensor.failure Memory.failure

Probability

Cut-set

Probability

1.134×10−2 5.684×10−3 5.684×10−3 5.684×10−3 2.277×10−3 2.277×10−3 2.277×10−3

Detection Module.hard Switch Controller.hard PCS Logic2.hard Actuator Driver.hard Pedal Sensor.failure Switch Sensor.failure

5.684×10−3 5.684×10−3 5.684×10−3 5.684×10−3 2.277×10−3 2.277×10−3

4.2. Optimizing the fault tolerant architecture In each generation of the evolutionary optimization performed by HiP-HOPS, child solutions are created by mutation and crossover operators until the number of children equals the size of the Copyright q




Figure 7. Possible design configuration of PCS.

existing population. Once the creation cycle has been completed the new solutions are added to the population with the existing solutions. If the total population size exceeds the maximum population size, then solutions are sorted by their domination rank and then by their crowding value. Solutions are then culled from the bottom of the list (the most dominated, most crowded) until the population is within the limits once more. For the case study, a variable population limit was used. Rather than fixing the population limit the algorithm begins without any limits. This is allowed to continue for a fixed number of generations (usually about 10% of the total number). Then the population limit is set to the number of feasible undominated solutions in the population. Solutions are feasible if they do not violate any user set constraints. Finally to ensure that no undominated solutions are lost due to population culling, a separate archive population is maintained. Figure 7 shows all the possible design alternatives for each component in the initial unoptimized architecture. In Figure 7, sp (selfprotection), sc (self-checking) and cr pp (checkpoint and restart, and process pairs) correspond to the options of fault tolerance techniques presented in Figure 5. As it was shown, each of these blocks has two alternative ‘implementations’ one representing the situation in which the fault tolerant component is included in the architecture and one representing the situation where the component is omitted. The failure expressions that represent each of the two options are embedded in each block and HiP-HOPS is therefore able to analyse the entire failure propagation in the system for different configurations where fault tolerant components have been included or not. We should note that the introduction of these fault tolerant components results in the generation of non-coherent fault trees due to the inclusion of NOT operators in the failure logic (see modelling of ‘¬ miss’). Although non-coherent modelling generally increases the complexity of analysis, in this case it generates more accurate results. We found that treating failure ‘¬ miss’ as a separate independent event (to maintain coherency) results in significant discrepancy in the calculation of failure probability. For example, the probability of O-cr pp7 with and without the use of non-coherent modelling is 0.023 and 0.017, respectively. This can be attributed to the generation of ‘hidden’ cut-sets (termed ‘prime implicants’ in non-coherent fault trees) by the Consensus algorithm [33] used in non-coherent analysis. Fault tolerant components contain several design parameters which strongly affect the result of optimization as performed by HiP-HOPS. Hereafter, we discuss how to determine their values in this case study. A fault tolerant component has three parameters: (1) the failure rate , (2) the probability of miss and (3) the cost. First, regarding the failure rate of fault tolerant components, Copyright q



M. ADACHI ET AL.

Table VII. Parameters of fault tolerant components. Fault tolerance technique A

1.14×10−7 1.14×10−7 1.14×10−7 1.14×10−7

0.10 0.15 0.20 0.30

2 4 8 10

Component

Cost

Self-protection(SP) Self-checking(SC) Checkpoint-restart(CR) Process pair(PP)

sp1-6 sc1-3 cr pp1-3 cr pp1-3

B


sp7-11 sc4-5 cr pp4-5 cr pp4-5

1.14×10−7 1.14×10−7 1.14×10−7 1.14×10−7

0.12 0.18 0.24 0.36

3 6 12 15

C


sp12-16 sc6-7 cr pp6-7 cr pp6-7

1.14×10−7 1.14×10−7 1.14×10−7 1.14×10−7

0.18 0.27 0.36 0.54

5 10 20 25

it was considered reasonable that we give a sufficiently small value compared to the failure rate of original components because fault tolerant components represent well-tested, mature and therefore reliable components. From this reason, the failure rates of fault tolerant components are uniformly set to = 1.14×10−7. On the other hand, it is not easy to determine the probabilities of missing failures and development costs because they depend on the nature of the systems. For this problem, we take note of the degree of dependence of components in the system as a criterion of how we give the appropriate value to the parameters. For example, Decision Module has a high degree of dependence on other components because it requires information processed by modules further upstream. In such components, the implementation of fault tolerance techniques, especially the error detection mechanism, will become quite intricate, and therefore the probability of missing failures and the development cost are likely to be high. In this case study, we decide the parameters by classifying the PCS software components into three subsets according to the degree of dependence: A: Detection Module, Monitoring Module, Switch Controller, B: PCS Logic1, PCS Logic2, C: Decision Module, Actuator Driver. Table VII shows the parameters of each fault tolerant component. In a subset which has the same degree of dependence, the probability of miss and the cost increase in the order of Self-protection