YYy - NASA Technical Reports Server (NTRS)

7 downloads 0 Views 2MB Size Report
b) The input to the program is either an intermediaterepresentation. (e.g.Fault Tree), or ...... _(I ,M). Figure 3-I: Parse tree of requirement expression. (3 I). 3.4 Generation of Models for PMS ...... if not Dead[I] and not Complete[I] then. °. { Stopping ...
///iL_7:¢/-YYy 7 NASA Technical Memorandum 89009

NASA-TM-89009 19860021844

Towards Automatic Markov Reliability Modeling of Computer Architectures

Carlos A. Liceaga and Daniel P. Siewiorek

FoR REFEp,_;r,_CE August 1986

"--"-----'NoT TO t_

r/'t._g_i FIlOt.! Tlh'3 llO0!f

,.

,

._.";.."

NationalAeronauticsand Space Administration Langley ResearchCenter Hampton, Virginia:>3665

o •

f_; fq>_i1

LI_R,_Ry,NASA ............ uL

Summary

The analysis varying

and evaluation

Markov

models

of reliability

has gained

architectures

that use standby

of generating

these models

interconnection

redundancy

however,

error due to the large

number

reasonable

Existing

structure.

in importance

for arbitrary

structures,

measures

using

time-

for computer

or can be repaired.

The task

Processor-Memory-Switch

is tedious

of states

and prone

and transitions

programs

that evaluate

(PMS)

to human

involved these

in any

models

make the following assumptions: a) The case analysis of success states of the system has been carried out. Such analysis must be done manually. b) The input to the program is either an intermediaterepresentation (e.g.Fault Tree), or the state transitionmatrix (STM).

This

is the first attempt

involved

in the automatic

Markov

models

level,

and (b) generate

This STM

generation

for arbitrary

interconnection

of the reliability

structures.

The advantages

larger

of users,

not necessarily

constructed

of human

(Automated

as a research

a) The interconnection

graph

problems.

and generation

and availability

expert error

Reliability

vehicle.

at the PMS

to these

of such an approach

and (b) a lower probability

named ARM

solutions

the problems

and availability

structures

the task o_ case analysis

in the computation

A program

and analyze

of reliability

and implement

work will automate

class

to (a) identify

are

of the

of PMS

(a) utility

in reliability

to a

analysis,

in the computation.

Modeling)

ARM will accept

will

be

as inputs:

of the PMS structure.

b) The behavior of the PMS structure components in terms of their internal communication structure, and their distributions and corresponding parameters of performance and reliability. c) The groups

of redundant

components

(e.g. processor

triads).

d) A succinct statement of the operational requirements on the PMS structure in the form of a modified Booleanexpr_ssion.

i

The operational

requirements

may be, for example, communication considered determine reliability

"two processor

structures

in addition

of a redundant

triads

to the explicitly

and availability. or availability

use by evaluation

(e.g. buses)

stated

structure

The output STM.

programs.

ii

multiproeessor

and two memory

in the PMS system

how the interconnection

tlle reliability direct

in the case

triads".

will be

requirements

affects

the system

of the ARM program

The STM will

to

will be

be formulated

for

The

Acknowl edgcmento

The authors NASA-LaRC

in defining

calculating developing detailed

are very

the number a useful

comments

grateful

for the assistance

the various

types

of repetitions

state

space

have greatly

of time-varying

simulations

reduction improved

iii

of Larry

Markov

required,

technique. the clarity

D. Lee of models,

and

His numerous

and

of this document.

Table of Contents

I.

Introduction................................................... I 1.1 Background .................................................4 1.2 Previous Work .............................................. 11 1.3 Motivation •................. 15 1.4 Organization ............................................... 15

2. System Description ............................................. 16

3.

2.1 2.2 2.3 2.4

Component Types .......,_. ...................................17 Redundant Groups ........................................... 20 System Watchdog Timers .....................................21 PMS Structure .............................................. 23

2.5 2.6 2.7 2.8 2.9

Intracomponent Port Connections •............... Intra Component-Type Communication .................. ....... Component Clustering ........................................ System Requirements ........................................ Example ....................................................

Automated 3.1 3.2 3.3 3.4 3.5 3.6

Reliability



Reliability

..................

Modeling

Examples

........................

Cm* Computer Module ........................................ Effect of the System Requirements .......................... Cm* Cluster ........................................... eo oeo Effect of the PMS Interconnection ............. • .....

5. Plans for Future 6. Conclusion A. ARM Program A.I A.2 A.3 A.4

Considerations

Detection of Symmetry in the PMS Graph ..................... Segmentation of the PMS Graph .............................. Identification of Success and Failure States ............... Generation of Models for PMS Graph Segments ..... ...... Merging of Models for PMS Graph Segments ................... Reduction of the State Space ........................... ...

4. Automated 4.1 4.2 4.3 4.4

Modeling

Work

..........................................

..................................................... Algorithms

.........................................

Symmetry Detection Algorithm ............................... Segmentation Algorithm ..................................... Success and Failure State Identification Algorithm ......... Minimal Subtree Model Generation Algorithm .................

References

........................................................

iv

24 25 26 26 27 30 31 32 33 35 37 37 40 41 43 44 46 49 50 51 51 52 54 55 58

List of Figures Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure

I-I: I-2: 2-I: 2-2: 2-3: 2-4: 3-I: 4-I: 4-2: 4-3: 4-4: 4-5: 4-6: 4-7: 4-8:

Reliability Graph of a Triad with I Spare .. Hierarchy of Time-Varying Markov Models ............... Use of Component Type Information in Reliability Models Reliability Graph of a Triad with a Watchdog .......... Grammar of Requirements ............................... PMS Diagram of Multiprocessor Described in Table 2-2 .. Parse tree of requirement expression (3.1) ............ Cm* Architecture ...................................... Cm* Computer Module ................................... Model of Figure 4-2 Cm* Requiring I P & I M ........... Model of Figure 4-2 Cm* Requiring I P & 2 M ........... Cm* Cluster ........................................... Model of Figure 4-5 Cm* Requiring 2 P & 5 M ........... Nonsymmetrical connection of Figure 4-5 Cm* Cluster ... Model of Figure 4-7 Cm* Requiring 2 P & 5 M ...........

v

9 11 19 22 26 29 35 41 41 42 44 44 45 46 48

List

Table Table Table Table Table Table

2-I: 2-2: 3-I: 3-2: 3-3: 4-I:

of'Tables

Redundancy Technique Specification ..................... Multiprocessor System Description Example .............. Automated Reliability Modeling Steps ................... Minimal Subtree Modeling Steps ......................... Two Model Merging Steps ................................ Failure Rates of Cm* Modules ...........................

vi

21 28 30 36 37 41

1.

Introduction

Computer

systems

multiprocessor

and

widespread

use to

growth

being

is

complex

fault-tolerance

design are in

measures reliability

The

to

make

and

analysis

computer

evaluation

is

very

of

assume

an

in

preliminary achieved.

more

in

system Although

of

Thus

the

one of the system

in the literature system

providing

and

reliability

designers

system

reliability

and

to

With

Section

understanding

therefore are

by

tedious

experienced reliability analysts. program, discussed

parameters.

computing

efficieht

more

the importance

has become

of

into This

of successively

design

as

with

tools.

and

systems

task

coming

and reliability.

have been reported

more

evaluation

as

measures

the

are

has increased

reliability

efforts

systems

availability

trend

reliability

and sophistication

performance

the

This

and system

easier

computer

by

blocks.

Several

complexity

higher

assisted

progress

in

distributed

of system

tasks.

growing

achieve

building

computation

are

1.2,

of the

prone the

nature

of

decomposition

and

ADVISER

not

does

error

complex even

for

exception of the ADVISER

existing

reliability

for

software tools usually

analysis

techniques

and

computational

aids

once the

analysis

been

manually

make

combinatorial techniques and is therefore

has

this

assumption it uses

limited in the complexity of

systems and fault types it can analyze.

More

advanced

architectures susceptible

that use

Markov

models

analysts

are that

concurrent

system

is

represented

redundancy,

The

they

are

programs,

to solve them.

analyze

required

or intermittent

model.

and several

developed

are

standby

to transient

time-varylng Markov

techniques

events.

reconfiguring by a transition

faults.

discussed

itself a

in

state

a fault a

and are

One possibility

Markov

is a

by time-varying

use among

reliability

1.2, have models

that arrives

previous

where

computer

be repaired,

Section

time-varying

around

analyze

offered

widespread

For example,

to

can

advantages in

However

to

fault

two faults

been

can not

while would

the be

are present.

2

This new state would spent reconfiguring

Another described

from

possibility in [Dugan

can analyze detail

not take

the first

is

its 'tokens'

at independent

algorithm

of the process be converted

is not

possible times

makes

process

the

depend

on

past

models.

The

a

and

tha% are

are

not

states).

modeling

Markov

an ESPN

it of

is the

analytically

model.

This conversion

concurrently

at independent

(i.e.

transition

the

general

level

that can simulate

distributed,

In

is that

capability

exponentially

non-Markovian

(ESPN)

to move concurrently

To solve

moving

net

at a lower

enabled

counters

time-varying

tokens

already

ESPN can be concurrent

The low level

queues

Petri

by the ESPN

systems

being modeled.

to

if

transition

offered

model

times. as

thesystem

stochastic

can be simultaneously

such

it must

and

Markov

transition

due to mechanisms

extended

events

the time

fault.

The advantages

than time-varying

because

the

84].

concurrent

into account

an

ESPN

because

this

probabilities

must

be

solved

by

simulation.

Simulations

can include

but many repetitions For example, with

of the

error no more

The relative

of

detail,

simulation

say the probability

a relative

95%.

any level

of

•.

are needed

failure

to ensure

P is going

than 10% within

error E is defined

and are thus flexible, accuracy.

to be estimated

a confidence

interval

of

as:

JP- PI

E -

(1.1) P

A

where

^

P is the estimate

of P.

P is defined

as:

^

P = F / N where Then

F is the number an expression

of

failures

observed

for N must be found

(1.1)

into

(1.3)

and

and N is the sample size.

such that:

Pr(E _ .I) = Substituting

(I .2)

.95

multiplying

(1.3) the

inequality

by P

gives: ^

Pr(IP

-

PI _ .Ie) = .95

(1.4)

"

Substituting

(1.2)

into

(1.4)

and

multiplying

the

inequality

by N

gives:

Pr(Ie - NP I _ .INP) = .95 Substituting

_ for NP in (1.5) gives: er(IF-

The

inequality

lJl _ .I_) = .95

in (1.6) can be expressed Pr(.9_

If N is large and with mean

(1.5)

P

is

_ = NP and (1.7)

as:

_ F S 1.1_)

small

(1.6)

= .95

(1.7)

F is approximately

can be expressed

1.1_J

Poisson

distributed

as:

l_le-_ - .95

i=.9_ Therefore

in life

in the order repetitions analytic

critical

10-9

is

required,

generation

a probability

approximately general

of this paper

of reliability

and

interconnection

structures

The result

this

validated

of

those

in the

ARM

operational

requirements

efficiently

operational

PMS

wi!l

be

applications

states

Markov

require

an

based

in the automatic

models

implemented

the on

level.

and experimentally

and

a

The program

interconnection

program

which

simple

set of

will attempt

divide-and-conquer the

for arbitrary (PMS)

Modeling)

structure

on the structure. using

the issues

_eliability

interconnection

analyze,

system

methodology, structure

to the

and the

requirements.

output

of

the

ARM

program

will

be

a

file

reliability or availability state transition matrix. will vary depending on matrix.

of failure

3.8 x 1011 simulation

the Processor-Memory-Switch

(_utomated

the

to explore

availability

at

effort

will accept

The

In

where

approach.

It is the intent

various

applications

are necessary!

(I .8)

i!

the

program

The evaluation programs whose

to

containing the

The output format

evaluate the state transition format the user will be able to

specify are: SURE, HARP, and ARIES (described in Section 1.2).

4

The following calculation Previous

sections

at

work

surveyed.

the

will present

PMS

level

in the generation

using

efforts.

The final

background

be stated

section

on reliability

time-varying

and evaluation

The goals for ARM wil!

previous

a brief

Markov

of reliability and compared

will present

models. models

is

those

of

with

the Organization

of

this paper.

1.1 Background

Present detail,

day computer

and therefore

them.

Four

[Siewiorek

systems so

levels

82a].

can

were

These levels,

to

digital

is one

where

systems

switches,

transducers,

primitives

Hardware

components

intermittent

faults

erroneous change or

state

of

faults

irreversible

example,

the primitives opposed

are

as a function

to

and

Intermittent or

varying

view

memories,

level

where

transient,

82b].

A fault

resulting

the

and

faults

and is an

from a physical

the environment.

Permanent

result

result

from

an

from temporary

faults are occasionally hardware

of

etc.

stable,

Transient

the logic

PMS level

logic

[Siewiorek

from

and Newell

through

permanent,

software

continuous change.

the

multiplexers,

in

or

The

of

and analyzing

Bell,

level,

levels

are processors,

to

susceptible

varying

Siewiorek,

level.

or interference

hardware,

Fault-tolerant

PMS

at

of designing

the circuit

registers,

conditions.

due to unstable

from

hardware

physical

environmental

by

as discussed

in the hardware

hard

defined

the

are

viewed

process

etc. as

may be gates,

be

the

range

and programming

can

or software

present

states

(for

of load or activity).

computer

systems

can be

affected

by a limited

set of °

faults

without

interruptions

in their

achieve

fault-tolerance

by

using

perform

the same operations.

The

correct

output

Swarz more

using

[Siewiorek relevant

82b]

ones

diagnostics discuss

are defined

operation.

redundant system or

the below.

groups

must

majority various

Some computer

of components

determine voting.

redundancy

systems

which

to

is the

Siewiorek

and

techniques,

the

STATIC REDUNDANCY - In

static

redundancy

majority vote involving a fixed when the

masking

redundancy

group is

faults are masked through a

of redundant components.

exhausted

by

Thus,

component faults, any

further faults will cause errors at the output.

DYNAMIC REDUNDANCY - In

dynamic

the faulty components are the system.

The

redundancy faults

detected°

faulty

are not masked but

isolated, and reconfigured out of

components

may

be

replaced

by spares if

available.

HYBRID

REDUNDANCY

majority

vote

reconfigured exhausted

- In

hybrid

involving

when

a

spares

by component

redundancy group

of

faults redundant

are available.

faults,

any

are masked

Thus,

further

through

components

when

a

that is

the redundancy

faults 'will cause

is

errors

at

the output.

ADAPTIVE

VOTING

majority

vote

without

adjusted redundancy

In

Faulty

them

to reflect

components

a

smaller

In

voting

by

faults

variable

the

will cause errors

-

a

from

is exhausted

ADAPTIVE HYBRID

adaptive

involving

spares.

by excluding

occur

-

group

are

voting number

of

faults,

through

a

components

out of the system the voter

components.

adaptive

hybrid

threshold

Thus,

any further

variable

faults

when

faults

the that

faulty components are reconfigured out from the voting process and adjusting the masking redundancy is

exhausted

faults that occur before a faulty

are

masked through a

group of redundant components that

is reconfigured when spares are available.

°

redundant

reconfigured and

masked

at the output.

majority vote involving a

_

of

process

component

are

If spares are not available

of the system by excluding them the voter threshold. by

Thus, when

component faults, any further

component

is replaced by a spare or

reconfigured out of the voting process will cause errors at the output.

For example, a triad

is

a

group

of

redundancy to tolerate at least one fault.

3

components that use hybrid If a triad recovers from a

6

fault by replacing

the

tolerate a

fault.

second

faulty

component

Recovery

isolating, and reconfiguring the The fault coverage of a

is

faulty

component

with

a

the

spare

process

is

component

it can then of detecting,

out of the system.

the probability that the system

can survive a fault in this component and successfully recover.

If the

system can always recover it has a "perfect" coverage of I.

Reliability the failure Siewiorek more

measures

are

processes

in

and Swarz

relevant

defined

in

hardware

[Siewiorek

terms

components

82b] discuss

ones are defined

conditional

because

are non-deterministic.

these

various

measures,

the

below.

RELIABILITY - The reliability, R(t), of t is the

of probabilities

probability

interval [0, t] given that it

a system as a function of time

that

was

the

system

operational

has survived the

at time zero.

It is a

non'increasing function whose initial value is one.

MTTF

- The MTTF

first

system

AVAILABILITY time

t ks

instant

(Mean

failure

Time assuming

the

probability

in systems

The

life-cycle

Reliability infeasible from

the expected

a new (perfect)

A(t),

that

exists time

the

as

of

system

a

system

system is

time of the

at time zero.

as a function

operational

that

t goes to infinity, the

Availability

in which

to perform

consequences. system

of

computations.

periods

is

of

at that

of time t.

expected fraction

merit

Failure)

- The availability,

If the limit of A(t)

useful

To

is

service

preventive

system

can

is

available

typically

used

be delayed

maintenance

availability

is

or

important

it expresses

in

to perform

as a figure

or denied

repair

the

of

for short

without

serious

the computation

of

costs.

is used to describe

such as aerospace

R(t) as follows:

systems

applications.

in which The

repair MTTF

is typically

can be derived

7

MTTF

= f R(t) dt o

The most based

commonly

on a Poisson

called

used reliability process

the exponential

with

function

an exponential

reliability

function,

R(t) = e where

A is the hazard

which reflects components

the reliability

is usually

exponential

such as when

that, after a burn-in a

function

period,

relatively

exponential

reliability

is used when

reliability

most

common reliability

This

is called

functions function

the Weibull

rate

is a constant

and for highly per

million

the failure

not age. faults

hours.

rate

It is often

in electronic

rate.

The

reliable

MTTF

The

is timeobserved

components for

the

has the form:

MTTF

Many other

do

failure

function

!

The failure

permanent

constant

is

and has the form:

failures

components

This

is

-At

rate.

in

component

distribution.

of the component

expressed

reliability

independent,

follow

or failure

for a single

I = -

have is

been

based

reliability

formulated.

on the Weibull

function,

The second distribution.

and has the form:

R(t) = e • where i is the scale parameter and e

is the shape parameter (other

reparameterizedforms are also common). exponentialfunction when e is one.

It is equivalentto the

The Weibull reliabilityfunction

•is used when the failure rate is time-dependent. Permanentfaults for componentsthat age can be described using an increasingfailure rate (alpha greater than one) and in this ease the system is not as good as new when repair takes place. Data presented in [McConnel81] indicates that transientfaults follow a decreasingfailure rate (alpha less than one).

8

The failure independent as when affect

processes

of each

of different

other.

electrical,

mechanical,

other components

in practice

This

at

any

components

fail

state are called to be known

is not made a transition

times are assumed

give n time

interval

system

their state

system

does

to

be

systems

b) discrete-state

and continuous-time and discrete-time

d) continuous-state

and continuous-time

The

correspond state

to system

transitions.

distribution from

transition

states

and

Each

arc

is

exponential Weibull

For example, distribution,

the has

distribution,

distributions.

the node.

the label the

is

used.

of a

If it is

classified

aceordlng

to

graph.

The nodes

directed a

scale

or the filename

arcs

label

history

distribution.

If the state

directed

a

the previous

originating

if this assumption

(STD) may be

destination

at the

are assumed

diagram

node

initially

of

transition

the originating

of

states

changes

to some multiple

be

probability

to the

These

at any time a continuous-time

conditional

given

of the

state

diagram

to describe such as when

used.

model

can

c) continuous'state

drawn.

enough

as

and discrete-time

a

is

restricted

a) discrete-state

system

is used;

system

space and time parameter

For a discrete-state

be known

changes,

model

cart occur

Most

such

in one component

its state.

model

system

a discrete-time

is used.

to be

true,

it is close

If all possible

system

that state transitions model

the

so

continuous-state

However

that must

state transitions.

a discrete-state

conditions

all

'As

or are repaired,

be assumed

the analysis.

represents

instant.

will

is not strictly

proximity.

to be used to simplify

the system

assumed

assumption or' thermal

in its

The state of a system

components

system

and of

that

that

allowable

identifies

the

the system will

go

node of that directed

arc

and

was

The _ label could

indicate

that

the system

used depends

be the hazard shape

rate

parameters

a histogram

on the for the _ for the

for more general

If transitions are allowed from failed states to operational states then the STD is an Availability graph and A(t) may be obtained from it. R(t) may be obtained by specifically disallowing failed to working state transitions from the STD thus making it a Reliability graph.

A Reliability it is assumed horizontal

The coefficients

represent

distribution

defining

occurrence

of the first simplex otherwise

fault wins

(i.e. only noted

being actively

Key:

the race,

used

follow

uses one of the two working descriptions,

The vertical a general

of the

between

of the first.

failure

occurs.

the

If the

If the removal

reconfigures

components).

all working

into a Unless

processors

in the configuration.

Description

I 2 3 4 5

3 working 2 working system faile(_ 2 working, uses system failed

I-I: Reliability

I

Graph

of a Triad

with

an

of working

follow

is a race

then the system

State

Figure

There

then system

These

The

the constant

the filename

ana the removal

in the state

These

the number

from a fault.

the race,

of I.

I represents

_ represents

fault

In this model

in the configuration.

the distribution.

of a second

fault wins

used

I-I.

coverage

of i represent

recovery

and consequently

in Figure

fault arrivals.

and consequently

are being actively

transitions

is given

h_s a perfect

represent

distribution

processors

second

that the system

rate.

histogram

of a triad

transitions

exponential hazard

graph

I Spare

are

10

The information matrix

called

conveyed

the state

row i and column

The

defined

used

Markov

illustrated

in

The

in Figure

is often

matrix

summarized

(STM).

paper

and

the

hierarchy

to

denote

i to state

the various

assumptions of

in a square

The STM element

in the arc from state

this

models,

below.

STD

transition

j is the labe!

terminology

time-varying

by the

j.

types

they are based

time-varying

Markov

in

of

on are

models

is

I-2.

TIME-VARYING MARKOV PROCESS -

A

stochastic process whose future state

depends only upon the present state,

and not upon the history that led

to its present state.

HOMOGENEOUS whose

MARKOV

MODEL -

state transition

continuous-time transition discussed

follow

process

- A

are

Markov

an

state

spent in the

process an

this

the pure Markov upon the global

time.

this implies

exponential distribution, is discussed

but

a

-

state.

A

whose

that the

the

This model

is

of the pure

depend

upon

For the continuous-tlme state they

transition might

a

the

semi'

times

follow

This model

model state

state

Often

they can follow

and applied

that the state

generalization

do not Weibull

is discussed

and

that uses a generalization

of

transition

For the continuous-time

distribution.

implies

For the

in [White 84].

MODEL

process

this

distribution.

systems

MARKOV

time-independent.

probabilities

distribuCion,

or any other

process

70].

uses

that

a pure Markov

distribution.

transition

implies

exponential

to computer

that

present

uses

process

exponential

model

whose

NON-HOMOGENEOUS

process

probabilities

homogeneous

MODEL

distribution applied

that

in [Chung 67] and [Romanovsky

local time Markov

model

times follow

SEMI-MARKOV Markov

A

transition

_hey

are

any

to computer

depend

non-homogeneous

Markov

times

assumed

other

systems

probabilities

do not follow

to follow

distribution. in [Trivedi

81].

an

a Weibull This

model

11

Time-Varying Markov / / Homogeneous (tlme-lndependent)

\

I \-Semi-Markov (local time-dependent)

Non-Homogeneous (global time-dependent)

Figure I-2: Hierarchy of Time-Varying Markov Models

The probability of being in

a

particular state for a discretelstate

and continuous-time Markov model equation. these

be expressed with a differential

The set of simultaneous differential equations that describe

models

equations.

can

are

called

the _continuous-time Chapman-Kolmogorov

For homogeneous Markov models these equations can be solved

using matrix or Laplace transformations.

If the state transition quite difficult to

probabilities

obtain

explicit

Chapman-Kolmogorov equations. reaching a state through a solution of a

multiple

obtain

to the continuous-time

the

exact

probability of

particular path of transitions requires the

integral, the

integrals

using

approximated

[Stiffler 79].

time-dependent it may be

solutions

To

probability of making one of are

are

where

each integral represents the

transitions numerical

in the path.

Often the

integration

techniques

An alternative method is to approximate the continuous-

time process with dlscrete-time equivalents [Siewiorek 82b]. •

difficulty with the second are effectively zero in

method is

the

The major

that many transition rates that

continuous'time

process assume small but

nonzero probabilities in a discrete-time process.

1.2 Previous Work

There are several evaluate .

the

p_ograms

reliability

standby redundancy or can

that

and/or

use

be

time-varying Markov models to

availability

repaired,

and

systems

that use

are susceptible to hard,

transient, and intermittent

faults,

SURF, and HARP.

programs can evaluate both the reliability

All

and availability of

these a

system,

such

of

except

for

as

CARE

CARE

III, ARIES, SURE,

III

which can only

12

evaluate

the reliability.

the system

CARE

specification

III

[Bavuso

84],

can

that

technique

described

events

while

frequent using

the

Weibull

in the

models.

fault trees, model.

The

the transition transition

to

of

matrices

that

can not be

at Raytheon,

analyzed

and uniform using

use exponential

model.

specified

by

Numerical

using

Markov extended

to the non-homogeneous

is specified

semi-Markov

accepted

an and

is reflected

these time-varying

is

it is written

assumes

of relatively

analyzed

Markov

converted

fixed

is

behavior

behavior

the

the

infrequent

is separately

can

solve

behavior

fault-handling

technique

composed

handling

automatically

parameters

III was developed

by providing

model.

Therefore

as input directly.

in FORTRAN

CARE

77, and runs on

or a VAX.

ARIES

(Automated

described

Reliability

in [Makam 82],

The system series

used

fault-occurrence which are

is

non-homogeneous

are

use

not repair

of relatively

behavior

model

fault

aggregate

techniques The

a Cyber

The

that

can use exponentlal

occurrence Markov

distributions,

integration

state

fault

non-homogeneous

parameters

Markov

The

This

behavior

that

do

in

decomposition/aggregatlon

is composed

fault-handling

described

systems

but

81].

behavior

model

of

faults

as one of

matrix.

Estimation),

[Trivedi

behavior

semi-Markov

distributions. aggregate

The

transition

behavioral

fault-handling

events.

a fixed

a

in

they all have

reliability

component

uses

the fault-occurrence

III,

the state

the

to tolerate It

for CARE

Reliability

evaluate

components.

solution

methods

(Computer-_ided

reconfiguration faulty

Except

can be specified

of independent

are either solution

Butler

active

technique

transition

is

matrix.

[Butler

_nreliability

restricted using

subsystems

or serve that

_nteractive

as

a

assumes

describes

Range Evaluator)

transition

each containing It uses

distinct

It was developed

84]

to homogeneous state

spares.

a

which

Estimation

at UCLA

_ystem),

Markov matrix,

identical a matrix

models. or as a

modules

that

transformation

eigenvalues

for the state

and runs on a VAX.

program evaluates

named

SURE

(Semi-Markov

the unreliability

upper

13

and lower bounds

of

semi-Markov

models.

theorems proven in [White 84] and

[Lee

means of bounding the probability of model within a specified time.

assume

that

transitions

slow

describe

of

system reaching any death state

very

respect

the

close

to

the

occurrence

transitions describe the.recovery

of

bounds.

mission

These theorems

time) exponential

faults,

process.

permanent, transient, or intermittent. state transition matrix.

These theorems provide a

traversing a specific path in the

the

usually

(with

85].

uses new mathematical

By applying the theorems to every path

of the model, the probability can be determined within

It

and

fast

general

Faults can be modeled as

Its

only input method is the

SURE was developed at NASA's Langley Research

Center, it is written in VAX-11 Pascal, and runs under VAX/VMS.

SURF, described in [Landrault 78],

can solve semi-Markov models that

use exponential distributions or non-exponential distributions that are related to the exponential (e.g. stages [Cox 68] is used to transformations are used

Gamma,

Written in PL/I, facility

in

it

to

obtain

values, runs

Yorktown

The method of

produce a homogeneous Markov model. time-independent

MTTF and the limiting availability. obtain time-dependent

Erlang, etc.).

a

as

IBM

Heights,

values, such as

The Laplace transform is used to

such

on

Matrix

availability and reliability. System/370

New

York.

at the IBM research

SURF

was

developed in

Toulouse, France.

For

HARP

(Hybrid

[Trlvedi

85],

uniform,

Weibull,

Automated

the state or

HARP

can only evaluate

rates.

HARP

has

behavior

converted

to

behavior

extended

be

the availability additional

specified

models.

stochastic

histogram

transition

(e.g. fault

Petri

_redictor),

probabilities

(i.e.

by providing

the CARE

be provided)

is given with

repair

the fault-

are automatically The fault-handling

the transition models

III model,

by the user

constant

of specifying

model.

The fault-handling net,

list must

all of which

Markov

in

exponential,

of systems methods

described

can have

matrix

trees),

non'homogeneous

can also

of one of several

state

several

occurrence

a

transition

general

distributions. _ If the

Reliability

parameters

available

the ARIES

are:

model,

an and

14

the SURE model. solution hybrid

It

technique

fashion.

uses

the same behavioral

as CARE III, but Time-varying

using numerical

integration

nets are solved

by simulation.

a VAX.

It is

still

the various

Markov

models

models

techniques,

under

decompositlon/aggregation

are solved

and extended

It is written development

are solved

analytically

stochastic

in FORTRAN

in a

Petri

77 and runs on

at Duke University

and Clemson L

University.

An abstract described

specification

by

Butler

langu_ge

[Butler

specify

(a) the state space

range,

(b)

the

start

85]. by

by

(c) the death states

variables,

and (d) the state

the possible

states

all in terms of

in the SURE

input language.

Section

3.

is written

Kini

[Kini

_nteractive generates

Pascal,

81]

modifications

a

Evaluator

reliability

are:

independent,

(b) the PMS system has

components

(a) all

faults

are not repaired

and

input is the interconnection

program

inputs

types, reliability communicate as

input

clusters, used

the

with components the

requirements

in the form

in ADVISER

of the

of

for

modified

for detecting

of

models

Center,

in it

PMS

structures.

and (c) failed state.

the PMS structure

its

expressions.

symmetries

Its Other

by their

and ability

The program and

It_

and stochastically

the PMS structure.

type.

Boolean

(Advanced

automatically

a non-faulty

system,

PMS graph

has been

to generate

ADVISER

port connections,

same the

Research

coverage,

to

graph of

internal

destination

reliability

which

permanent

a perfect

components

functions,

named

for

returned

primary

describe

are

that

to ARM as described

Reliability)

functions

assumptions

rules

VAX/VMS.

program of

of the state

in ASSIST

Langley

and runs under

the state

This language Markov

to

and their

of

and their

used

at NASA's

describes

_ymbolic symbolic

algorithm

with

variables

expression

rates,

was

has statements

values

variables.

The

models

by a set of if-then

to generate

was developed

in VAX-11

Boolean

their

program

be applicable

ASSIST

a

state

language

initial

transitions

the

in the ASSIST

the model will

by

reliability

the state

the

transitions,

implemented

The

defining

state

variables,

define

for Markov

to

also takes

subsystems

or

The methods

and tree structures

15

will also be Section 3.

applicable

with

modifications

ADVISER was developed

at

to

ARM

as described in

CMU, it is written in BLISS, and

runs on a PDP-10. J

1.3 Motivation

The goal of this research

and

development

computer architect a powerful and assume the burden of

an

easy

advanced

to use software tool that will

reliability analysis that considers

intermittent, transient, and permanent

faults

high complexity and sophistication.

The

description was selected because (a)

it

digital systems and therefore well known

to

computer

effort is to provide the

the

for computer systems of

PMS level of computer system is

easiest

architects.

the highest level view of to

The

specify, and (b) it is

time-varying Markov model

technique of reliability and availability analysis was selected because (a) it is powerful enough for

concurrent

reliability

events,

to accurately analyze most situations except

analysts

and

and

(b)

it

several

is

in widespread

evaluation

programs

use

among

have

been

developed.

Previous efforts have been limited in one of two ways. a computational reliability

aid

analysis

once

the

preliminary

system

had

been

manually

achieved.

computer systems of less

Most provided

decomposition and Alternatively

complexity and sophistication were considered

without transient and intermittent faults.

1.4 Organization

The

system

description

availability Markov model

required

involved in the

is

automatic

to

described

Markov models are discussed

generation

generated Markov reliability

in

in

Section

a

reliability 2.

or

The problems

of reliability and availability

Section

models

generate

are

3.

Examples of automatically

presented

in

Section 4.

A

summary of the research and a plan for its accomplishment are presented in Section 5. The Appendix A.

algorithms used

by the ARM program are described in

16

2. System

Description

It is important to have a general system description method that will accommodate new fault-tolerant

techniques

and

system

designs.

This

section presents the system description method currently envisioned for the ARM

program.

The

generality

of

this

method

needs

to

be

system

of

investigated to correct any deficiencies.

When

calculating

components,

a

reliability

measure

four items of information

a) The reliability b) The fault components

behavior

for

an arbitrary

are necessary,

of the system

namely:

components

(Section

tolerant function of individual components in the system (Sections 2.2 and 2.3).

2.1).

or groups

of

c) The communication paths that components in the system may use, and which are the components that need to exchange information (Sections 2.4 to 2.6). d) The operational requirements placed subsystems (Sections 2.7 and 2.8).

Item

(b) is the only one that is not necessary

ARM program

will use eight

information

for any arbitrary

input categories These

description

of the component

structure

multiprocessor

that The

system

the ARM program.

categories

system. to

ARM

convey input

types

fina! can be

the

section

for some systems]

to obtain

categories

are:

2.1),

requirements

the purpose ARM will

specified

these only

the information

(Section

will discuss

provide

and its

For some systems

2.4), and the system

sections

categories

information.

of

minimum

(Section

The following

input

are required

and (d).

input

on the system

program give using

items

of

three ARM

in (a), (c), a reliability

the interconneetion (Section

2.8).

and necessity with

an

The

these

example

all the input

of the items

of

of

how a

categories

E

17

2.1

Component

The first

Types

input category

in the PMS structure. identical

Components

in function

is natural

and

alternative

is a list

describing

of

the

and reliability.

reduces

the

same

of components

type are assumed

The concept

system

would be to specify

the types

of component

specification

burden.

the characteristics

to be types

The other

of each particular

component.

Each type declaration rates of

the

components

various

and the

more

specify

failure,

of that type.

distribution follow

will

recovery,

Rates

will

parameters

of

than one distribution

The function

of the system

the coverage

a

Section

2.8.

A

distribution

(i.e. a histogram

The nine defined

classes

below.

two classes. Used

must

Figure

in reliability

type

state.

the distribution

expression

of a

as defined

Weibull,

type declaration must

illustrates

how

contain

in

or general

can contain

at least

the first

is the name of the component

assumed

caused

produce

TRANSIENTS - The

A rate may

seven

are

the first

classes

are

faults

are

models.

class

continuously

a

declaration

2-I

HARD - The second be

by a probability

of the system

be exponential,

information

TYPE - The first class

to

can

Boolean

for

be provided).

of

Each

function

and the

processes

distribution.

that determines

rate will be in the form of a modified

repair

specified

that

as

state

be

and

probability,

is

the

Hard

by

permanently

errors when

third

class

transient failure rate _.

rate _. damaged

Hard

components

that

exercised.

consists

Another

that is the rate at which the

failure

type.

is

of

two

rates.

One is the

the transient duration rate 6,

transient stops producing errors.

It is

assumed transients are not caused by or produce any permanent damage to the components.

18

INTERMITTENTS

- The fourth

intermittent

failure

rate

_, that is the rate producing

errors.

an intermittent and starts caused

fault

producing

component Coverage must

that

class

be

following:

becomes

producing

benign

benign

errors

It is assumed

rate

or stops

becomes

active

intermlttents

are

components.

the

fault

can

on

coverage

survive

recover'.

a

This

accurately

analytic

C expressed

fault

in

probability

t1_e reliability

very

simulation,

One is the

rate e, that is the rate at which

stopped

system

estimated

three rates.

an intermittent

is

impact

of

is the intermittent

once more.

damaged

and successfully has a great

Another

that had

the

consists

is the active

errors

- The fifth

probability

_. which

Last

by permanently

COVERAGE

it

at

class

this type of defaults

of a system.

using

methods,

one or

as the

or

to I.

Therefore

more

fault

of the

injection

experiments.

REPAIR which Only

- The sixth components

class

is

of this

if the repair

rate

the

type

repair

are

is specified

rate

repaired

_, that is the rate at and returned

can the availability

to service.

of the system

be modeled.

RECOVERY

- The seventh

at which

the system

components

rate

at which

active

can detect,

the active

- The eighth

component

that

The purpose

can

is performing with

of shadows

a

different

memory

triad.

provide all the

redundant

group

memory A

hot

up and activating

rate p, that is the rate

and reconflgure

shadow

shadow

is to increase

at which a

recovery

isolate,

the exception

this is the rate

that a

the

from faults

(a hot or powered

in

up spare

component).

class is the

the system

components

powering

is

of this type by using

that is imitating

SHADOW

class

shadow or

a

activation

rate o, that is the

shadow.

functions

A shadow

of a redundant

that its output the recovery module can

powered

up

rate.

a cold or unpowered

provided spare

group

is not being

used. of

to shadow

by changing

is imitating,

spare.

of

An example

can be reloaded be

is a spare

a

the

or by

19

©

\

Key: State

D_scription

I 2 3 4 5

no faults hard fault transient fault active intermittent benign intermittent

6 7

correct fault detection, isolation, and reconfiguration incorrect fault detection, isolation, and reconfiguration

Figure

2-I: Use of Component

DEGRADATION

- The ninth

rate at which

the system

class

fault fault

Type Information

is the degradation

can gracefully

degrade

in Reliability

rate

8.

That

Models

is the

by elimlnatlng'one"

20

redundant

group of components

set of components output

performing

can be selected

is necessary

when

a

replace

and

the

it,

requirements failed

the

using group

components

need

rcquirements),

a

to

of

fail

This

is

done

is

is a

of

group

not

fails

and

there

Degradation

are no spares above

because

the

the system

vote.

there

groups

probability

for

group

2.3)

fails,

A group

such that the correct

or majority

these

greater

and if a

in Section

same operations

component

number

has

are all of this type.

diagnostics

for the system.

component

(defined

which

to

the minimum

a group

failure

to

with

a

(fewer group

meet

its minimum

is no watchdog

timer

fails.

2.2 Redundant Groups

The second group

category

of components

performing selected will

input

in the

the same

operations

the maximum

the requirements, adaptive

voting

The adapted adaptive

or

A

that

the

correct

vote.

Each

of this

components

in

adapted with

to the

output

group

type,

group

can be

declaration

the group

the group,

name,

and if uslng

and the adaptive

the adjusted

time

any redundant

is a set of components

of groups

the

is the group

that specifies group

majority

of

the name of

rate corresponds

list

such

number

the type

group

a

system.

using diagnostics

contain

is

voter

involved

rate.

threshold.

in changing

The

the voting

threshold.

Currently

the redundancy

by three things. not. zero

One is

The other two or

not.

specified. extended

Table

so systems

component

the

with

in a group

minimum

used

whether

it

whether

its

2-I of

shows

using

hybrid

to replace requirements

each

it,

redundancy

are

for

rates

or are

technique must

is be

can be described.

the hybrid

the number the

group

specification

techniques

or adaptive and

and adaptive

technique

category

is specified

part of a redundant

recovery

how

redundancy

input

for a component

is

new redundancy

for this

there are no spares above

are

This method

The semantics

technique

system,

following. redundancy of

When

fails,

these groups

then

the

a

is

system

!

21

gracefully degrades by eliminating the groups is not above

the

minimum

group.

If the number of these

requirements

for the system and the

group uses adaptive hybrid redundancy, then the system reconfigures the faulty component out of the voting process.

If there are shadows, then they among the groups.

If a

group

are assumed to be evenlydistributed

has

to

be able to transmit to another

group, then each component of the

transmitting group has to be able to

transmit to all the components of

the receiving group.

the latter is

so

each

independent majority

component

vote

on

of

the

the

receiving

information

The reason for group can do an

from the transmitting

group.

REDUNDANT GROUP

RECOVERY RATE

ADAPTIVE RATE

STATIC REDUNDANCY

yes

zero

zero

DYNAMIC REDUNDANCY

no

nonzero

zero

HYBRID REDUNDANCY

yes

nonzero

zero

ADAPTIVE VOTING

yes

zero

nonzero

ADAPTIVE HYBRID

yes

nonzero

nonzero

i

Table 2-I: Redundancy Technique Specification

2,3 System Watchdog Timers

The third input category

is

a

list

component type or group of components system, and the rate at which

the

watchdog is assumed to have a

timer

out or the watchdog will restart probability that multiple will cause system failure.

faults

that

specifies which (if any)

acts as a watchdog timer for the

watchdog can restart the system.

that must be reset before it runs

the system. in

A

a

A watchdog decreases the

redundant group of components

22

Key:. State

Description

State

Description

I 2 3 4

3 working 2 working system crashed I working

7 8 9 10

system failed watchdog failed 2 working, no watchdog system failed

5 6

2 working, uses I system crashed

11 12

2 working, uses I, no watchdog system failed

Figure 2-2:

Reliability

Graph

of a Triad

with

a Watchdog

23

The semantics for this input category are the following. no watchdog, and

any

group

fails,

then

the

watchdog fails, and any group fails,

then

the system fails.

words, there has to be failure.

a

watchdog

for

Adding a watchdog timer modifies

system

If there is

fails.

If the In other

the system to survive a group

the system model by preventing some

}

i

=!

states from being

failure

states

and

by

creating

new states.

For

!i

example, if a watchdog with failure rate _ and system restart rate 0 is

ii!

added to a triad the system model changes from the one in Figure I-I to

ili

the one in Figure 2"2.

i

has a perfect coverage of I

i

cause system failure.

In this new model it is assumed that the system and

The

the

watchdog

failure of

the watchdog wili not

prevents a system crash caused by

i

the failure of two processing

elements

from causing system failure by

il

restarting the system as a simplex without spares.

2'4 PMS Structure

The fourth input structure.

category

It is assumed

is

an

that

interconnection

list

critical components which are required

for the system to be operational must be able to communicate. i,i

i

purpose of'

the

_

failures will

• interconnection

prevent

of the PMS

list

communication

is

The main

to

analyze which component

between

critical components and

therefore cause system failure.

The

interconnection

list

can

also

be

used

to

detect

which

substructures in the PMS graph are symmetrical in their component types and neighboring components. be identical in function ]

and

Syc_etrical

substructures are assumed to

reliability.

Therefore the reliability

models of symmetrical substructures are identical and only have to be generated once. These models can then be duplicated and merged to obtain the reliability model of the system.

Each i

component

will

have

an

interconnection

specifies its type and neighboring components.

declaration

that

Since the PMS graph is

24

non-directed it

is

occurrence in one

possible

noted that each

arc

must

making

the

an

arc

by its

However, it will be

on two interconnection declarations. is

the

twofold. system

Secondly, a reader

comprehend

specify

declaration.

occur

redundancy

can be detected thus

easily

completely

interconnection

The purpose of this

contain errors.

to

Firstly, inconsistencies

specification less likely to

of a system specification can more

structure

if

the

connection

is

made quite

of

the

current

work,

the system

expllcit with two-way links.

Although not

within

the

scope

specification could be further eased interface.

by a graphics based user friendly

The interconnection list would then be provided by an input

interface that would accept a graphic description of the PMS structure, Since

this is not part

accept its input from

of

the current

an

interface

generated independently by support Future Net program for the IBM

research and ARM could easily

program,

this interface could be

personnel

PC.

using

tools such as the

Such a graphics interface already

exists for the PERQ personal work stations at CMU.

2.5 Intracomponent Port Connections

i

The fifth input

category

is

a

list

specifying

the internal port

connectivity of some components and/or component types. the interna! port connectivity will prevent communication cause system reliability

is

This

modeling

to analyze which component failures

between

failure.

critical

information

program

from

would

components and therefore is

needed

assuming

paths through intermediate components to this behavior into account

The purpose of

lead

to prevent the

incorrect communication

other components.

Not taking

to an optimistic evaluation of

the system reliability.

This

input category

reliability

modeling

is

needed

program

component types that will be

to

because have

designed.

modified to be a directed graph,

the

it

this Even

is

impossible

knowledge

for

for a all the

if the PMS graph where

program would still need to know

if information passed from A to B can be passed from B to C.

i

25

If not specified, the default is for

every port of a component to be

connected bidirectlonally to all other ports internal port connectivity component type all port explioit.

Each

is

specified,

connections

connection

and

of the component. then

If the

for that component or

their direction must be made

declaration

contains

the

following

parameters: VERTEX

The specific components or component type whose port connections are being specified.

TRANSMITTER

A transmitter port.of the VERTEX. It is specified by the component or component type connected to it.

RECEIVER

A port that receives from the previous transmitter port. It is specified by the component or component type connected to it.

i _: _

r_

2.6 Intra Component-Type Communication j,

it !,!

The majority of components of like to communicate.

are passive and do not need

of

passive

components are memories, buses,

and Input/output transducers.

Active

or self-talklng components need

to

exchange

Examples

type

information

amongst

each

other.

Examples

of active

_I

components are processors, direct-memory-access device controllers, and

_J

other "smart"

_

components to bepassive

controllers.

If

not

specified

the

default

is for

and not communicate with their own type.

The sixth input category is a list specifying the component types for which communication between components_of purpose of the intra

component

which component failures

will

type

like type is necessary.

The

communication list is to analyze

prevent

communication between critical

components that need to exchange information and therefore cause system failure.

This

information

is

modeling program from requiring

needed

of the same type

that

never

behavior into account would system reliability.

to

prevent

the reliability

communication paths between components exchange

lead

to

information. a

Not taking this

pessimistic evaluation of the

26

2.7 Component Clustering

The seventh

input

category

is

a

components form clusters, that

is

subsystems

requirements.

list

If the cluster requirements

specifying

which (if any)

with their own separate

are not met all the cluster

components fail but the system ma___yy continue to operate depending on the system requirements.

The

purpose

dependencies that sometimes

of

exist

between

declaration will contain the name its requirements

in

the

form

clusters

of of

is

to represent the

components.

Each cluster

the cluster, its components, and a

modified

Boolean expression as

defined in Section 2.8.

System Requirements

2.8

The eighth input category is a

succinct statement of the minimum set

of critical component types and/or for the system to be

operational.

critical resource set (MCRS). system may

only

function

the

of

success

Together they constitute a minimum

The set is minimum in the sense that the

if

(depending on the status other words,

component groups which are required

a

MCRS

other of

an

of

components

components MCRS

is

are functional

a

in the structure). necessary,

In

though not

sufficient, condition for system success.

The MCRS will simple form

grammar

in Figure

be defined

using

of requirements

is

a

modified

shown

Boolean

expression.

in the traditional

The

BackusCNaur

2-3.

::=

i OR



::= i I AND I ()



::= OF I OF

Figure

2-3: Grammar

of Requirements

i_'

27

2.9 Example i



In this example

a

multiprocessor

system

is

described using ARM's

i_

tabular format in Table 2-2.

Failure rates are assumed to be specified

_

in failures per million hours, all

other

per hour basis.

to zero and are assumed to follow a

All rates default

rates are assumed to be on a

single exponential distribution unless otherwise indicated. _,! I W_:

distribution rate is specified with a file containing the necessary specifications.

For

the

followed by the name of the

discriminating function and distribution

exponential

•distribution is

only its constant

rate is given.

f Iii I;

followed by the scale and shape parameters. A general distribution•is specified with a 'G' followed by the name of the file containing the

,:

necessary histogram.

_

Weibull

distribution

I_

_.

The

'M'

A multiple

specified

wfth a 'W'

The first component type described in Table 2-2 is a processor P with the following characteristics: hard failure rate:

i = 200 failures per million hours

transient failure rate:

oL= 10000 failures per million hours

transient benign rate:

6 = 3600 per hour

i

i!

ii

_j

intermittent failure rate: i = 10000 failures per million hours intermittent benign rate:

_ = 3600 per hour

intermittent active rate:

e = 360 per hour

coverage probability:

C = I

repair rate•:

_ = Weibull distribution of scale=1 ond shape=1.1

recovery rate:

p = multiple rates defined in the file RECP

shadow rate:

o = general distribution defined in the file SHADP

degradation rate:

8 = general distribution defined in the file DEGP

The PMS diagram of the multiprocessor has 10 LRU LRU.IO]

multiprocessor (Line

is

shown if Figure 2-4.

The

Replaceable Units) clusters, LRU.I to

LRU.i has a processor P.i, a memory M.i, and a watch dog timer

28

Component Types (Section 2.1): TYPE HARD TRANSIENT INTERMITTENT

P M WT B WB

200 210 50 10 I0

(10000, 3600) (10500, 3600) (2500, 3600) (500, 3600) (500, 3600)

COVERAGE

(20, 3600, 360) (21, 3600, 360) (5, 3600, 360) (I, 3600, 360) (I, 3600, 360)

I I I I I

REPAIR

RECOVERY

SHADOW

W W W W W

M M M M M

G G G G G

1 I 1 1 1

1.1 1.1 1.1 1.1 1.1

RECP RECM RECW RECB RECWB

DEGRADATION

SHADP O DEGP SHADM G DEGM SHADW SHADB SHADWB

Redundant Groups (Section 2.2): SIZE GROUPNAME REQUIREMENTS

TYPE

ADOPTS

ADAPTATION

3 I 2 I I I I I

P P M M WT WT B WB

PSimplex

G ADAPTP

MSimplex

G ADAPTM

WSimplex

G ADAPTW

System

PTriad PSimplex MTriad MSimplex WTriad WSimplex BTriad WBTriad Watchdog

PMS Structure COM PONEN T

2 I 2 I 2 I 2 2

Timers

OF OF OF OF OF OF OF OF

3 I 3 I 3 I 3 3

(Section

2.3): WTriad

(Section 2.4): TY PE

P.I-I0 M. I-I0 WT.I-IO B.I-5 WB.I-5

NEIGHBORLCOMPONENTS

P M WT B WB

Intracomponent VERTEX

B.I-5, WB. I-5 B.I-5, WB. I-5 B.I-5, WB. I-5 P.I-I0, M.I-I0, P.I-I0, M.I-I0,

Port Connections TRANSMITER

B B B WB WB

(Section 2.5): RECEI VER

P P M WT WT

Intra Component-Type

M WT P P M Communicators:

(Section 2.6):

Component Clusters CLUSTERNAME

(Section 2.7): COMPONENTS

REQUIREMENTS

LRU.I-IO

P.i, M.i,

I OF M.i

WT.i

Syste m Requirements (Section 2.8): (I OF PTriad OR I OF PSimplex) AND

Table

WT.I-5 WT.I-5

2-2: Multiprocessor

(I OF MTriad

P

OR I OF MSimplex)

System Description

Example

29

WT.i, and for any of its components to be available M.i must be working properly.

Components of the

same

type

are

grouped

into 2 out of 3

triad subsystems.

The system other that

uses adaptive

than a bus fails component

into a simplex i !

_i !!

The system must memory

buses,

and

(a

single

have

B.I to B.5.

is

component of

so

that if a component

only one triad

reconfigures

a minimum

and memory

redundancy

there

type, then it

triad or simplex

Processor

hybrid

without

spares

the two remaining

emulating I processor

a triad) triad

of

components

with a spare.

or simplex,

and I

to be operational.

triads

transmit

A processor

triad

on

a

bus triad

can transmit

formed

out of 5

to any kind of triad

i i

including

another

i !

processor

triads.

formed

out of 5

transmit

_'

processor The

watchdog

buses,

to processor

triad.

WB.I

triad to

and memory

LRU. I

A memory

triad

transmits

WB.5.

can only

transmit

on another

The watchdog

to

bus triad

triad

can only

triads.

.....

LRU. I0

,t

i: w i

0 7oi0r

......

B.I B. 2,-" .... B.3 B.4 !

.....

:

B.5

_.

WB. I WB .2......

.................._,L,_,__,L_W B .3-L---L

......

•WB.4--L--' WB. 5

Figure

2-4: PMS Diagram

of Multiprocess0r

Described

in Table

2-2

3O

3. Automated Reliability Modeling Considerations

The ARM

program

will

attempt

reliability

model

operational

requirements.

selected

to

based

increase

on

the

to

the The

efficiently

interconnectlon

structure

divide-and-conquer

computational

program development complexity.

generate the system

The steps

and the

methodology

efficiency

and

was

reduce the

the ARM program is going to

follow in generating reliability models are shown in Table 3-I.

I) Interface with user and obtain system description. 2) Detect symmetries in the PMS graph. 3) Segment the PMS graph. t

4) Identify the PMS system success and failure states based onthe operational requirements. 5) Generate the models for the PMS graph segments. 6) Merge the models for the PMS graph segments. 7) Reduce the state space of the resulting model. 8) Format and output the state transition matrix of the model.

Table

Steps 2 and 3

3-I: Automated

of

Table

derived from those presented they are mature, effort will

be

well the

Reliability

3-I

Modeling

Steps

have been implemented using algorithms

in

Kini's dissertation [Kini 81] because

documented,

and

identification,

simple.

analysis,

fundamental problems in each of steps I,

The major research and

solution

of the

and 4 through 7 of Table 3-I.

The research will also include the development of efficient algorithms, and methods algorithms.

The

to

feasibility

efficiency

theoretically

of

the

due to the large

in any reasonable

structure.

and

algorithms number The

experimentally

developed

of states validity

validate

depends

and transitions of

on

the

their

involved

these algorithms

is

I

particularly

important

31

for

life

critical

applications

where

a

probability of failure in the order 10-W is required.

The following sections steps 2 through 7.

will

Progress

the problems involved,

discuss

the

purpose

and necessity Of

and

already made in identifying and analyzing developing

and implementing algorithms to

solve them is also presented.

il I r

3.1 Deteotion of Symmetry in the PMS Graph Substructures in the PMS graph G will be considered symmetric if they are isomorphic and the corresponding identical component type assumed to be identical reliability models of purpose of detecting

in

models

symmetrical

substructures willbe

then

be

and

reliability.

substructures

symmetrical

will

Symmetrical

function

substructures

duplication of effort by generating These

of the two graphs have

labels.

vertices

are

Therefore the

is

identical.

The

to avoid needless

their reliability model only once.

duplicated

and

merged

to

obtain

the

reliability model of the system.

The symmetry detection algorithm based on the component type _

the graph. it has.

The degree of

a

ks

labels

shown

and

in

Appendix A.I.

It is

the degree of the vertices in

vertex is the number of neighbor vertices

Two vertices are neighbors if they are interconnected.

The algorithm requires three steps

to

partition the vertex set_of a

labelled graph into equivalence classes whose vertices are symmetrical. In the first step the partition is based on the component type label of each vertex.

For the second step

of each vertex.

The

third

step

the partition is based on the degree attempts

to partition based on the

number of neighbors each vertex has in each equivalence class.

The last step must be repeated until there are no more changes in the equivalence classes. changes

the

number

The of

reason

neighbors

for in

this each

is

that each partition

equivalence

class,

and

32

therefore other partitions may this repetition will element.

Each class

is

stop

vertices

in

when

related

because the vertices in other

become

to the

necessary.

each

other class

classes.

In the worst case

equivalence

classes

in

class has a single

i

a connectivity sense

are symmetrically connected to the

These

equivalence

classes

and their

,

connectivity relationships may be viewed

as defining another graph G'.

The vertices of G' correspond uniquely to the equivalence classes in G. Unlike the basic non-directed graph without self-loops, which was taken to be the model

for

This would be the equivalence

G,

G'

result

class

are

may

of

a

have vertices which have self-loops. case

connected

in

to

which

each

vertices in the same

other

in

some symmetric

fashion, thus making the equivalence class its own neighbor.

Also, the

number of links or connection density between two vertices of G' can be greater than one.

This would be the result of a case in which multiple

vertices in the same

equivalence

class

are

connected to one or more

vertices in another equivalence class.

3.2 Segmentation of the PMS Graph

The purpose of segmenting the PMS conquer methodology.

graph is to follow the divlde2and-

The segmenting proceeds by searching for what are

termed Pendant Tree Subgraphs (PTS). they are not part of another tree. path between any

pair

of

vertices

These are maximal trees, that is In these tree subgraphs the slmple is

vertices in the overall graph, in other is common

to

find

PTS's

in

most

the

only

path between those

words there are no cycles.

PMS

structures.

It

In particular

input/output subsystems typically assume this character.

If the PMS interconnection graph G

is

not

a PTS and all its PTS's,

excluding their roots, are removed then the remaining vertices and arcs form a subgraph of G that is not tree-connected. This to as the Kernel.

The root

of

the PTS as well as the Kernel.

each

will be referred

PTS has dual status as member of

The PTS's along with the Kernel form a

33

natural

set of segments

computation

the PTS's

algorithm

in a given

G' which represent trees"

neighboring trees

G

on

is

are then

vertices

of

"grown"

of these

at their

of

point

a

set

on the number

upward

roots

this

tree

instance

"stopping

of further

growth,

of the tree

no longer

or

of the tree has a

tree, with is when

a connection

the previous

respectively.

assume

states

(step I).

the root

merging of

Steps

These

by adding

on

the germinal

2 and 3 continue

of trees G'

of

is possible.

have

At

been generated. by the root,

each

of G or a set of PTS's.

In

in the set will be symmetric.

conditions"

to

which

a

tree is not

that cycles

would

The first

condition

is when

The second

single

neighbor,

density

greater

merged

under

the fact

itself.

with

the system

model The

during

are

termed

the generation

because

some sequence and therefore

is

identification

a very different

other states, through

G

vertices

which

condition

The

formed

the root

is when

is not already

than one. another

be

the

in that

third condition

tree that meets

one of

of Success and Failure States

on whether

the reliability

essential

It discovers

conditions.

3.3Identification

Depending

3).

one PTS

due to

the tree has been

of

of G represented

be a tree.

is a neighbor

A.2.

(step 2), and merging

of vertices

are three

and it would

root

the reliability

those leaf

towards

subgraphs

all PTS's

Appendix

(step

of these trees in G' may represent the latter

in

vertices

leaves

of vertices

capable

of which

by collecting

leaf

until no more adding

There

basis

shown

PMS structure

classes

that overlap

Depending

the

task may be divided.

The segmentation

"germinal

of

form.

success of

of

must

transitions.

can not have

or not the states

states

success

and

the reliability

Success

the system of

operational

states must

or failure failure model have

be able to reach Failure

any transitions

states

states

because

a failure

to other

is

they

transitions

states

in

to

state

are trapping states.

34

The

identification

unnecessary states

of

generation

of

are not needed

them is by being

For example,

is

consider For

have

that

being

failure

developed

in which

state

to identify

the system

a system

is assumed

state

to include

to

where

components.

The Boolean into

prevent

some

some failure can arrive

at

2 out of 3 processors

to

state

can

arrive

algorithm parse

all three

The reason

at that state

has already

been

is shown

in Appendix

A.3.

tree searching

due

to

for some way

The system

the failure

of requirements

sum-of-products

is by

states

that are not operational

paths,

for

have failed.

the requirements.

components

where

be generated.

two processors

expression a

reason

and failure

This

communication

The

failure

system

can satisfy

have

been transformed

the

success

those

also

way the system

that requires

requirements

they do not

contain

the

may

state.

system

and implemented.

It traverses

only

does not have

only way

An algorithm

the

a system

this is thatthe in another

states.

failure

that

failed

states

failure

in another

be operational. processors

failure

form

is assumed

so

that

state

because of other to have

it does not

any parenthesis.

The parse • levels.

tree of a sum-of'products

The bottom

processors".

The

requirements, top level

level•represents intermediate

that is an

represents

requirements,

that

AND

the is

Boolean

expression

atomic

requirements

level

expression

represents of atomic

sum-of-products an

OR

such as "2 of

pure

conjunctive

requirements.

expression

expression

only has three

of

of

pure

The

the system conjunctive

requirements.

For example, processor memory.

consider

a

and two memories, For readability

system or the

that

with symbol

one

can

operate

processor,

_(N,X)

with either one disk,

will represent

one

and one

the atomic

35

requirement

"N of X".

requirements

The

sum-of-products

OR _(I,P)

tree of such an expression

The algorithm and returns

is a Boolean

true

three levels

the system

AND _(I,D) AND _(I,M)

(3.1)

if it

is

it will return

sum-of-products

expression

level it

will

a

to the true of

are

meet.

requirements

are meet. L

in Figure

that takes

success

levels

of the parse

The

third

all

tree.

is

atomic

level

works

in the

meet.

At the

requirements

determines

at

At the•

requirement,

requirements, if

as an argument

The algorithm

if any conjunctive

true

3-I.

a state

state.

system

return

conjunction

is shown

function

that correspond

first level

second

of

is

_(I,P) AND _(2,M) The parse

expression

which

in a atomic

OR

/

\

/

\ AND

AND

/

I

\

\

_(I ,P)

_(2,M)

Figure

of Models

The generation

of

divide-and-conquer transitions

for PMS Graph

the system

corresponding

For

to the

reliability

in the model model

will

generation

An algorithm already Appendix

model.

be

algorithm

for

what

been developed A.4.

Minimal

and

that

using

subtrees

and

of

then merged

algorithms

states

and

to produce

and transitions derived

subtrees

This are

the

from the

85].

minimal

PTS's

the

follow

of the equivalence

of states

in [Butler

implemented.

will also

purpose,

The generation

termed

(3 I)

..

segments

generated

presented

are

expression

model

different

implemented

_(I ,M)

Segments

reliability

methodology.

\

_(I ,D)

tree of requirement

class graph G', will be separately the system

I

_IJ(1 ,P)

3-I: Parse

3.4 Generation

/ I \

/

algorithm

those

of PTS's has is shown

that are below

in the

36

minimum

system

minimal

subtree

fai! because other

requirements of a PTS

fails

This algorithm repairable

minimal

generating

the minimal

The minimal

subtree

model

repairable

transient

and

developed

to

generate

algorithm

will

generate

in a minimal produce

the kernel

and merge

reliability

model.

I) Initialize 2) While until

models

minimal

the

system for

and merge

it

The it

second

with

those with

the set of new states

the New Set is not empty, a success state is found.

transition

a) If the destination

are susceptible

to

algorithms

will

models

The

subtree

generate

models

Table

to for

the system

state.

out of the New Set

subtree, if more the transitions

generated: state

is new then add it to the New D Set.

b) Add the transition to the model by obtaining the two factors whose product is the transition's rate: the number of working components in the class whose failure is described by the i transition, and their failure rate.

:

be

f_rst

a model

to produce

New Set to the start get a state

must

of a PTS that are not

3) For every equivalence class node in the minimal components of this class can fail then generate out of the success state. 4) For every

to

the minimal

algorithm PTS

must be extended

3-2.

model.

nodes

and nonin

in Table

more

from

follows

which

Two

subtree

by itself.

algorithm

algorithm

of a

isolated

in non-redundant

reliability

the

become

requirements

are shown

faults.

the root

in that minimal

the

subtrees

a model

the PTS model.

steps

generation

intermittent

subtree,

faults

The

When

it, which

the system

hard

subtrees.

subtree

and

to

exactly.

the nodes

within

can meet

is limited

them

all

none of the subtrees

nodes in the graph,

redundant

or meet

3-2: Minimal

Subtree

Modeling

Steps

_

37

3.5 Merging of Models for PMS Graph Segments

For

the purpose

models

of

following

of the segments

to generate

with N and M

most NM states. incoming Table

of the equivalence

the PTS models

two models

and also

states

All

transitions

the

generate

that follow

I) Retain

only

are

merged

that

in

must

steps

and also

the

model

the system

states,

be merged

model

When has at

and

their

with new ones.

to merge

two models.

and implemented

reliability

start

the

model.

models

along

must be developed

original

G' will

original

be followed

one of the two identical

2) Retain all the other states.

graph

the resulting

in the resulting

these

the PTS models

class

methodology,

the system reliability

states

appear

3-3 shows the steps

Algorithms

the divide-and-conquer

to

model.

states.

which

amount

to N . M - 2

3) Produce at most NM - N - M + I new states by combining each original state in one model with all the original states in the other model, except for the start states.

Table

3.6 Reduction

3-3: Two Model

of the State

The use of time-varying into

three

First,

problems

the models

alleviated

when

such as the ARM computational

of

Third,

the evaluation

become

impossible

program

will

technique

state

space

the

to the system

to

others

option

two of

reliability

a

iarge. can be

and evaluation

tools

Second,

computer

the state

the

prohibitive. systems

may

For

the

limitations.

problems

runs

this

may become

space

model.

extremely

discussed.

certain

applying

systems

human, but

model

memory

last

any

modeling

using

complex

becomes

already

the

model

their

the

for

aided

evaluating

due

of alleviating have

and

of the

purpose

to analyze

computer

program

cost

the

models

intractable

by the use of

Steps

Space

Markov

become

Merging

user space

of the ARM reduction

38

The number computing 72].

of states

can be reduced

the equivalent

The equivalent

transition

transition

IAB =

where

lij is the transition

subset

B.

State merging

equivalent

transition

any state

i in subset

merged

into a subset

The number relatively purpose

[Singh

space

biggest

the states examined

the states

with space

between

truncation

have been

state

i in subset

be reduced

A to state

First,

j in

the

(3.2) must be the same

for

of all the states

by deleting

states

that can be used

truncation

with

a

for this

[Singh 72] and sequential

must

follow

two conditions. state

in t_e remaining

truncated,

space

the states

truncation

has generated

these

may be divided

consisting

conditions

truncation

should

including

the next

different

from the previous

otherwise

one more

failures.

be selected. subset.

subset

should

states

At first

be included

should

be

or

states

components The state states

an arbitrary

level

of a of

can then be repeated

by

are not significantly can be stopped,

and the computations

to systems

to transient

after

be deleted

subset having

the computation

be extensible

are susceptible

should

of N identical

each

that

any new

this new absorbing

If the new values

should

diagram

hasgenerated

The computation

ones,

the

Second,

are not hard to achieve.

into N + I subsets,

level of coincident

space.

transition

of truncation

In systems

First,

should be less

state

whose

which

(3 2)

the probabilities

the new absorbing

components

A and B is

two conditions.

Either

This

[Singh

for any icA

states.

repeated.

subsets

by equation

in the truncated

probability

two states

the subsets

and

75].

be retalned.

certain

between

Two techniques

to see if the process

absorbing

should

can also

state space

probability

the smallest

follow

Second,

low probability.

truncation

State

rate from

them into subsets

must be equal.

of states

are called

rate

rate given A.

rates

_ I.. j_B ij

must

by merging

with

different

and intermittent

types

of

faults.

39

In sequential truncation the state probabilities are calculated every time a hew state is generated and states with probabilities less than a reference value are

deleted.

This

time than state space truncation

method consumes more computation

but

does

not have to be repeated to

insure the accuracy of the approximation.

The state space truncation conservative estimate.

technique

failures can be made

The states failure

reached

with

be

extended to produce a

a certain level of coincident

states.

transitions of the new failure could only be

can

This eliminates the out going

states, and truncates those states that

through

the

new

produce a conservative estimate because

failure

states.

This will

the truncated states will also

be analyzed as though they were failure states.

Only the extended state the ARM program.

space

truncation technique is applicable to

The reason for

this is that the ARM program will not

be evaluating the reliability attempted during segments.

the

model.

generation

Algorithms that follow

implemented.

of

State space truncation will be the

models

for

the

PMS graph

this technique must be developed and

4O

4. Automated Reliability Modeling Examples

Currently redundant

the

program

is

and non-repairable

to transient categories failure

ARM

minimal

or intermittent have been

implemented.

requirements.

has been implemented.

Only

These

types,

Only

to

to

that

input

minimum

categories

output

format

illustrate

ARM

structure,

for the SURE

architecture

the current

input

are: the hard

the interconnection

the

are non C

and are not susceptible

the three

The Cm* multiprocessor

[Swan 77] will be used

systems

subtrees,

faults.

rate of the component

the system

limited

and

program

described

capabilities

in

of the

ARM program.

The

Cm*

multiprocessor

microcomputer.

Figure

architecture. module

connected

modules.

address

composed

the Kmap.

shared

the figure

models

architecture.

by

PMS

validate

the models

program,

and compared equations.

the models

where

Table

4-I.

access

obtained

and

in

from

will

which

figure).

the Kmaps

of

they

exponential

more

external

allow

the automatic

generated

graph be

of manually

to

processors

in

clusters

via

marked

L in

[Siewiorek

78]

Buses.

generation of

the

of Cm*

to the system

using

derived

rates

The

references

is demonstrated.

evaluated

failure

Cms.

is

cluster)

versions

models

will

results

or

to the Intercluster

the

the

Each cluster

or in other

several

the

realize

The components

illustrate

of

or more memory

or in a different

interconnection

from

two

the cluster

modeling

one

passes

controllers

with the The

and

LSI-11

of one processor

processors.

mapping

generated

of failure

the

to

the

collectively

cluster

The sensitivity and the

(Slocal)

on

version

is composed

(Kmap)

(B in the

sections

(Cm)

in the

elsewhere

are interfaces

reliability

requirements

memory

Buses

following

by

based

possible

structure

controller

are

memory

the

is one

interface

space

elsewhere

the Intercluster

module

in

The Kmaps

Cms to access

shows

memories

local

(i.e. to memory

The

an

of a cluster

controls

4-I

computer

via

The

virtual

Slocal

Each

architecture

To

the SURE

probability

used to evaluate

and are reproduced

in

41

B

B

I\

/ \

/

\.

/

L

L

Kmap /\ I Slocal

.

/ P

Kmap /\ \ Slocal

', M

/

M

\ L

L

P

I Slocal

', M

/

M

\ Slocal

\

P

M

Kmap /\ I Slocai

/t\ M

P

\ Slocal

/1\ M

M

P

M

/ M

P

', M

M

Key: B L

Intercluster Intercluster

Kmap

Mapping

Bus Bus Interface

Controller

Slocai P

Local Switch Processor

M

Memory

Figure 4-I: Cm* Architecture

Processor Memory Local Switch Mapping Controller Intercluster Bus Interface Intercluster Bus

Table

4.1Cm*

Computer

connected

The PMS diagram

4-I: Failure

Rates of Cm* Modules

Module

The Cm* computer module

29.893E-6 46.278E-6 24.059E-6 130.935E-6 34.836E-6 O.O00E-6

module

via

an

to

be

modeled

interface

of the Cm* computer

is composed

(Slocal) module

t•o three memory

is shown

Slocal

/ P

Figure

_II i\_ / \ \ M

M

M

4-2: Cm* Computer

of one processor,

Module

in Figure

modules •. 4-2.

42

The model ARM automatically Figure

4-2 requires

function

is shown

failure

states.

in state

failure, model

during

one processor in Fig_e

component I with

when

Only states

module

other

will fail

than a memory

all its components

the first

the Cm* computer

and one memory

4-3.

The computer

or if any single starts

generated

ten hours

to perform

module

in

its

I, 4, and 6 are not if three fails.

working.

of operation,

memories

fail,

The system

The probability obtained

from

of

this

is 5.39375E-4.

_

Q

/

Key :

i

State

Failed components

I 2 3 4 5 6 7 8

None I S I P I M I M& I P 2M 2M & I P 3M

Figure

The

equatlon

4-3: Model

of Figure

for the probability

4-2 Cm* Requiring

of failure

Pf = I - RsRp(R3m + 3_m(I ..2

I P & I M

Pf is

- Rm) . 3Rm(1

- R m)2)

"

(4.1)

43

where Rs, Rp, and R m are the reliability functions of a local switch, a processor, and a memory. The R3 term corresponds •to the state in which m all three memories state -

"

function.

in which one memory

The

failed

3R2(I

- R m) term corresponds

and two are functional.

3Rm(1 - Rm )2 term corresponds

to the state

in which

and one is functional.

probability

of failure,

ten hours of This

The

operation,

obtained

is the same result

obtained

from

from

this

two memories during

equation

the model

to the

The failed

the first

is 5.39375E-4.

in Figure

4-3.

4.2 Effect of the System Requirements

The number function failure.

components

affects The

probability

For

of

of failure

requirements

the

of

starts model

number

model

states.

or if any single I •with

all

six and the

probability

operation, • increased

The equation

module

other

its

of

4-2 are Only

will fail

than a memory

failure,

during

of

of N.

The

when

the

states

to

I and 4•

if two memories The system

Comparing

of states the

increased

fails.

working.

the number

its

of N.

4-4.

components 4-3,

function

generated

in Figure

module

to perform

the probability

function

in Figure

computer

to the one shown in Figure

and

automatically

is shown

component

system

a non-increasing

ARM

The

a

states

is a non-decreasing

and 2 memories

in state

for

of

is

the Cm* computer

are not failure fail,

the

required

number _of states

example,

I processor

both

N

this

decreased

firstten

hours

to of

to 5.40016E-4.

for the probability

of failure

Pf = I - RsRp(R _ + 3R_(I

Pf is

- Rm))

(4"2)

.

where Rs, Rp, and Rm are the reliability functions of a local switch, a processor, and a memory. The R3 term corresponds to the state in which m

-

all three memories state

in

probability

function.

which

one

of

failure,

The

memory

during

obtained

from

this equation

obtained

from

the model

3R_(I

failed

is

in Figure

the

- Rm)

and first

5.40016E-4. 4-4.

term corresponds

two

are

ten

functional.

hours

This

to the The

of operation,

is the same result

44

Key :

State

Failed

I 2 3 4 5 6

None I S I P IM IM& 2M

Figure

components

I P

4-4: Model

of Figure

4-2 Cm* Requiring

I P & 2 M

4.3 Cm* Cluster

The Cm* cluster

to be modeled

connected

via a cluster

composed

of one processor

two memory Figure

modules.

is composed

controller module

of three

(Kmap).

connected

The PMS diagram

Each

computer

computer

via an interface

of the Cm* cluster

modules

module

is

(Slocal)

is shown

to

in

4-5.

Kmap

/

\

Slocal

Slocal

I P

, M

M

Figure

P

Slocal

\ M

I M

P

M

\ M

4-5: Cm* Cluster

The model ARM automatically generated when the Cm* cluster in Figure 475 requires 2 processors and 5 memories to perform itg function is shown in Figure 4-6.

All the failure states have been collapsed



into state 2. fail.

45

The cluster will fail if two memories or two processors

The system starts in state I with all its components working.

In state 5 one processor and one memory in the same computer module E have failed, therefore if their local switch fails no other components will be affected.

In state 6 one processor and one memory in a

different computer module have failed, therefore if a local switch fails other components will also be affected.

The probability of

failure, during the first ten hours of operation, obtained from this model is 2.03253E-3.

Key:

-

State

Failed components

I 2 3 4 5 6

None I K or I S or 2 P or 2 M I P I M I M & I P in the same Cm I M & I P in a different Cm

Figure 4-6: Model Of Figure 4-5 Cm* Requiring 2 P & 5 N

-

46

The

equation

for the probability

of failure

Pf = I -R'R3(R3K s p + 3R_(I

Pf is

-Rp))(R6m

+ 6R5(1m -Rm))

(4.3)

where Rs, Rp, and Rm are the reliability functions of a local switch, a processor, and a memory. The R3 term corresponds to the state in which



p

all three processors function. The 3R (I - Rp) term corresponds state in which one processor failed and two are functional. term corresponds

to the state

6R_(I - Rm) term corresponds and five

are functional.

ten hours This

of

operation,

of

interconnection

state

probability

obtained

in which

one memory

of failure,

from

this

during

equation

from the model

of

module

affects

failure.

both

For

of

one

is composed

PMS diagram

the

failed

the first

is 2.03253E-3.

in Figure

in

one

let

Figure

processor of

number

example,

the Cm* cluster

are composed

resulting

the

all six memories

4-6.

of the PMS Interconnection

probability

computer

to

obtained

The PMS interconnection

modules

which

The

is the same result

4/4 Effect

in

to the The R6 m function. The

and

us

change

and the the

PMS

4-5 so that two computer one memory,

processor

of the Cm* cluster

of states

and the third

and four memories.

is shown

in Figure

The

4-7.

Kmap

zi\ /

I

Slocal.

Slocal I

I

I

" \

PI

Figure

The Figure

\ MI

4-7: Nonsymmetrical

model

ARM

4-7 requires

is shown in Figure into state 2.

\

automatically 2 processor 4-8.

The cluster

All will

Slocal_

"\

_"1

\

PI

I MI

connection

I

\

\

P2 M2 M2 M2 M2

of Figure

generated

when

and

5 memories

the

failure

4-5 Cm* Cluster

the

Cm*

to perform

states

cluster

its function

have been

fail if two memories

in

collapsed

or two processors

47

fail. ._

The system starts

Comparing this model to

in

state

the

one

states increased to twelve and

I with all its components working. shown

the

in

Figure 4-6, the number of

probability of failure, during the

first ten hours of operation, decreased to 1.55366E-3.

The equation

for the probability

of failure

Pf is

_--'-_k__ +_ (_-__ +_-_m_-. _. C_-__ where Rs, Rp, and Rm are the reliability processor, equation

and a memory.

The only

(4.3) is the addition

corresponds

to the state

other two local switches

in are

functions

difference

of a local

between

of the 2R k R2(I s - Rs)R

which

one

functional.

local

R

switch

_,._

switch,

this equation term

This

S I failed

The probability

a

and term

and the

of failure,

i

during

the first

1.55366E-3. 428.

This

ten hours 'of operation, is the same

result

obtained

obtained

from

this equation

from the model

is

in Figure

48

Key: State I 2 3 4 5 6

Failed

components

None I K, I S2, 2 P, or 2 M I SI & I P! & I MI I P! I MI I P2

State

Failed

components

7 8 9 10 11 12

I I l I I I

M.l in the same Cm M. in a different Cm 11MI M2 " I M2

M2 P! P! P! P2 P2

& & & & &

Figure 4-8: Mode! of Figure 4-7 Cm* Requiring 2 P & 5 M

49

5. Plans for Future Work

The architectures and

fault

_.ypes the

increase in complexity in phases

as

research

described

will address will

below.

The reason for

breaking the research work into phases is to keep the complexity of the problem being addressed at phase

of

the

a

research

manageable

will

be

level.

theoretically

validated before proceeding to the next phase. used as part of the performance and evaluated.

experimental

range of

Based on the

The results of each

results

of

experimentally

The ARM program will be

validation

applications

and

of each phase. the

ARM

Next the

program must be

of the validation and evaluation the

approach will be reformulated as necessary.

The first phase Of the

research

redundant and non-repairable root to be operational. steps I, 4, and 5

of

will

PMS

This Table

address hard faults, and non-

tree

structures

phase

3-I.

that require their

will only involve research into

All subsequent phases will involve

research into steps I, and 4 through 7 of Table 3-I.

Phase two of the research general structures with no

will address hard faults and non-redundant

will address hard faults,

repair.

can

have

imperfect

The

third phase of the research

and dynamically redundant architectures that

coverage,

and

repair.

Examples

of

such

architectures are a multiprocessor at CMU, named Cm* [Swan 77], and the Electronic Switching Systems (ESS)

used

in

the Bell System [Toy 78].

Cm* and ESS will be used in the experimental validation of this phase/

Phase

four _ of

the

research

architectures but only for hard "

the research will For

the

last

two

address

faults.

the

multiprocessor, Cm*, and ESS.

transient,

architectures

Fault-Tolerant

Langley Research Center [Lala

address

83],

an

hybrid

The fifthand

intermittent,

phases

validation will be the

will

used

redundant

last phase of

and hard faults. for experimental

Multiprocessor (FTMP) at NASA's Intel 432 [Siewiorek 82] based

50

6. Conclusion

The previous reliability approach

sections

and availability

consists

The Automated implement

these

system

The

also

program

the

graph,

the behavior and

the

input

method

steps Section

the

most

already

and developing

made

is being to

obtain

Markov the

computer

model

purpose

in identifying

and implementing

system

tch

(PMS)

the faultSection

2

for the

input categories architectures.

from

this

and necessity

and analyzing algorithms

to

a

envisioned

These

3-I.

developed

requirements.

program.

This

in Table

the PMS components,

of the current

the

summarized

currently

ARM

Markov

architectures.

Processor-Memory-Swi of

discussed

for automatic

is

operational

generate 3

step

categories

of

of describing

Progress

involved,

(ARM)

of

eight

descripLion.

were

consisting

the

other

which

first

description

steps.

Modeling

steps.

Strategies,

are capable

steps

approach of computer

The

interconnection

described

an

modeling

of eight

Reliability

description

tolerant

presented

System of these

the problems

to solve

them was

presented.

Section

4 presented

program.

The

requirements Section

examples

sensitivity

and the PMS

5 presented

the ARM program

of

the the

current

models

interconnection

the current

to include

of

plans

capabilities

generated

graph

the system

was also demonstrated.

for extending

all of the steps

to

of the ARM

in Table

the capabilities 3-I.

of

51

A. ARM Program

Algorithms

A.I Symmetry Detection Algorithm Function definitions: Split class(R, C, L) - If relation R is not partitions class C and creates a new class after Returns the number of equivalence classes. Size(C) class C.

- Returns

Element(E,

the

number

C) - Returns

of

element

elements

in the vertex

E of the vertex

E of class

Equal_Neighbor Classes(E, C) - True if same number of neighbors in each class class C.

procedure function

Equivalent(Current

return return

function

= Degree

Element,



C is equivalent

Relation);

Equal_Degree(Current Element, class) Equal_Neighbor_Class_s(Current_Element,

This_class,

in

C has the same degree

then

Split_Class(Relation,

C.

element E of class C has the as the preceding elements of

Class,

begin Split := false; for I := 2 to Size(This Class) do begin Current Element := Element(I, This "

class

Symmetry;

begin if Relation else end;

equivalence

equivalence

Equivalent(E, C, R) - True if element E of class terms of relation R to the preceding class elements. Equal Degree(E, C) ' True if element as the _receding elements of class C.

satisfied it then the last class L.

Class);

Last_class);

Class);

if not Equivalent(Current_Element,-This_class, Relation) then begin if not Split then begin Split := true; Last Class := Last Class + I; ( Create a new Last class with the degree and neighbor attributes of the Current element of This class. ); end;

52

( Move the Current_Element end; end; return Last Class; end; { Split C_ass } begin

{ Symmetry

{ Step

to the Last class.

);

}

I: Split

based on equal

Last Class :: Last Type; for _ :: I to Last Class ( Add elements

of

}

do

on equa±

I :: I; while I