b) The input to the program is either an intermediaterepresentation. (e.g.Fault Tree), or ...... _(I ,M). Figure 3-I: Parse tree of requirement expression. (3 I). 3.4 Generation of Models for PMS ...... if not Dead[I] and not Complete[I] then. °. { Stopping ...
///iL_7:¢/-YYy 7 NASA Technical Memorandum 89009
NASA-TM-89009 19860021844
Towards Automatic Markov Reliability Modeling of Computer Architectures
Carlos A. Liceaga and Daniel P. Siewiorek
FoR REFEp,_;r,_CE August 1986
"--"-----'NoT TO t_
r/'t._g_i FIlOt.! Tlh'3 llO0!f
,.
,
._.";.."
NationalAeronauticsand Space Administration Langley ResearchCenter Hampton, Virginia:>3665
o •
f_; fq>_i1
LI_R,_Ry,NASA ............ uL
Summary
The analysis varying
and evaluation
Markov
models
of reliability
has gained
architectures
that use standby
of generating
these models
interconnection
redundancy
however,
error due to the large
number
reasonable
Existing
structure.
in importance
for arbitrary
structures,
measures
using
time-
for computer
or can be repaired.
The task
Processor-Memory-Switch
is tedious
of states
and prone
and transitions
programs
that evaluate
(PMS)
to human
involved these
in any
models
make the following assumptions: a) The case analysis of success states of the system has been carried out. Such analysis must be done manually. b) The input to the program is either an intermediaterepresentation (e.g.Fault Tree), or the state transitionmatrix (STM).
This
is the first attempt
involved
in the automatic
Markov
models
level,
and (b) generate
This STM
generation
for arbitrary
interconnection
of the reliability
structures.
The advantages
larger
of users,
not necessarily
constructed
of human
(Automated
as a research
a) The interconnection
graph
problems.
and generation
and availability
expert error
Reliability
vehicle.
at the PMS
to these
of such an approach
and (b) a lower probability
named ARM
solutions
the problems
and availability
structures
the task o_ case analysis
in the computation
A program
and analyze
of reliability
and implement
work will automate
class
to (a) identify
are
of the
of PMS
(a) utility
in reliability
to a
analysis,
in the computation.
Modeling)
ARM will accept
will
be
as inputs:
of the PMS structure.
b) The behavior of the PMS structure components in terms of their internal communication structure, and their distributions and corresponding parameters of performance and reliability. c) The groups
of redundant
components
(e.g. processor
triads).
d) A succinct statement of the operational requirements on the PMS structure in the form of a modified Booleanexpr_ssion.
i
The operational
requirements
may be, for example, communication considered determine reliability
"two processor
structures
in addition
of a redundant
triads
to the explicitly
and availability. or availability
use by evaluation
(e.g. buses)
stated
structure
The output STM.
programs.
ii
multiproeessor
and two memory
in the PMS system
how the interconnection
tlle reliability direct
in the case
triads".
will be
requirements
affects
the system
of the ARM program
The STM will
to
will be
be formulated
for
The
Acknowl edgcmento
The authors NASA-LaRC
in defining
calculating developing detailed
are very
the number a useful
comments
grateful
for the assistance
the various
types
of repetitions
state
space
have greatly
of time-varying
simulations
reduction improved
iii
of Larry
Markov
required,
technique. the clarity
D. Lee of models,
and
His numerous
and
of this document.
Table of Contents
I.
Introduction................................................... I 1.1 Background .................................................4 1.2 Previous Work .............................................. 11 1.3 Motivation •................. 15 1.4 Organization ............................................... 15
2. System Description ............................................. 16
3.
2.1 2.2 2.3 2.4
Component Types .......,_. ...................................17 Redundant Groups ........................................... 20 System Watchdog Timers .....................................21 PMS Structure .............................................. 23
2.5 2.6 2.7 2.8 2.9
Intracomponent Port Connections •............... Intra Component-Type Communication .................. ....... Component Clustering ........................................ System Requirements ........................................ Example ....................................................
Automated 3.1 3.2 3.3 3.4 3.5 3.6
Reliability
•
Reliability
..................
Modeling
Examples
........................
Cm* Computer Module ........................................ Effect of the System Requirements .......................... Cm* Cluster ........................................... eo oeo Effect of the PMS Interconnection ............. • .....
5. Plans for Future 6. Conclusion A. ARM Program A.I A.2 A.3 A.4
Considerations
Detection of Symmetry in the PMS Graph ..................... Segmentation of the PMS Graph .............................. Identification of Success and Failure States ............... Generation of Models for PMS Graph Segments ..... ...... Merging of Models for PMS Graph Segments ................... Reduction of the State Space ........................... ...
4. Automated 4.1 4.2 4.3 4.4
Modeling
Work
..........................................
..................................................... Algorithms
.........................................
Symmetry Detection Algorithm ............................... Segmentation Algorithm ..................................... Success and Failure State Identification Algorithm ......... Minimal Subtree Model Generation Algorithm .................
References
........................................................
iv
24 25 26 26 27 30 31 32 33 35 37 37 40 41 43 44 46 49 50 51 51 52 54 55 58
List of Figures Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure
I-I: I-2: 2-I: 2-2: 2-3: 2-4: 3-I: 4-I: 4-2: 4-3: 4-4: 4-5: 4-6: 4-7: 4-8:
Reliability Graph of a Triad with I Spare .. Hierarchy of Time-Varying Markov Models ............... Use of Component Type Information in Reliability Models Reliability Graph of a Triad with a Watchdog .......... Grammar of Requirements ............................... PMS Diagram of Multiprocessor Described in Table 2-2 .. Parse tree of requirement expression (3.1) ............ Cm* Architecture ...................................... Cm* Computer Module ................................... Model of Figure 4-2 Cm* Requiring I P & I M ........... Model of Figure 4-2 Cm* Requiring I P & 2 M ........... Cm* Cluster ........................................... Model of Figure 4-5 Cm* Requiring 2 P & 5 M ........... Nonsymmetrical connection of Figure 4-5 Cm* Cluster ... Model of Figure 4-7 Cm* Requiring 2 P & 5 M ...........
v
9 11 19 22 26 29 35 41 41 42 44 44 45 46 48
List
Table Table Table Table Table Table
2-I: 2-2: 3-I: 3-2: 3-3: 4-I:
of'Tables
Redundancy Technique Specification ..................... Multiprocessor System Description Example .............. Automated Reliability Modeling Steps ................... Minimal Subtree Modeling Steps ......................... Two Model Merging Steps ................................ Failure Rates of Cm* Modules ...........................
vi
21 28 30 36 37 41
1.
Introduction
Computer
systems
multiprocessor
and
widespread
use to
growth
being
is
complex
fault-tolerance
design are in
measures reliability
The
to
make
and
analysis
computer
evaluation
is
very
of
assume
an
in
preliminary achieved.
more
in
system Although
of
Thus
the
one of the system
in the literature system
providing
and
reliability
designers
system
reliability
and
to
With
Section
understanding
therefore are
by
tedious
experienced reliability analysts. program, discussed
parameters.
computing
efficieht
more
the importance
has become
of
into This
of successively
design
as
with
tools.
and
systems
task
coming
and reliability.
have been reported
more
evaluation
as
measures
the
are
has increased
reliability
efforts
systems
availability
trend
reliability
and sophistication
performance
the
This
and system
easier
computer
by
blocks.
Several
complexity
higher
assisted
progress
in
distributed
of system
tasks.
growing
achieve
building
computation
are
1.2,
of the
prone the
nature
of
decomposition
and
ADVISER
not
does
error
complex even
for
exception of the ADVISER
existing
reliability
for
software tools usually
analysis
techniques
and
computational
aids
once the
analysis
been
manually
make
combinatorial techniques and is therefore
has
this
assumption it uses
limited in the complexity of
systems and fault types it can analyze.
More
advanced
architectures susceptible
that use
Markov
models
analysts
are that
concurrent
system
is
represented
redundancy,
The
they
are
programs,
to solve them.
analyze
required
or intermittent
model.
and several
developed
are
standby
to transient
time-varylng Markov
techniques
events.
reconfiguring by a transition
faults.
discussed
itself a
in
state
a fault a
and are
One possibility
Markov
is a
by time-varying
use among
reliability
1.2, have models
that arrives
previous
where
computer
be repaired,
Section
time-varying
around
analyze
offered
widespread
For example,
to
can
advantages in
However
to
fault
two faults
been
can not
while would
the be
are present.
2
This new state would spent reconfiguring
Another described
from
possibility in [Dugan
can analyze detail
not take
the first
is
its 'tokens'
at independent
algorithm
of the process be converted
is not
possible times
makes
process
the
depend
on
past
models.
The
a
and
tha% are
are
not
states).
modeling
Markov
an ESPN
it of
is the
analytically
model.
This conversion
concurrently
at independent
(i.e.
transition
the
general
level
that can simulate
distributed,
In
is that
capability
exponentially
non-Markovian
(ESPN)
to move concurrently
To solve
moving
net
at a lower
enabled
counters
time-varying
tokens
already
ESPN can be concurrent
The low level
queues
Petri
by the ESPN
systems
being modeled.
to
if
transition
offered
model
times. as
thesystem
stochastic
can be simultaneously
such
it must
and
Markov
transition
due to mechanisms
extended
events
the time
fault.
The advantages
than time-varying
because
the
84].
concurrent
into account
an
ESPN
because
this
probabilities
must
be
solved
by
simulation.
Simulations
can include
but many repetitions For example, with
of the
error no more
The relative
of
detail,
simulation
say the probability
a relative
95%.
any level
of
•.
are needed
failure
to ensure
P is going
than 10% within
error E is defined
and are thus flexible, accuracy.
to be estimated
a confidence
interval
of
as:
JP- PI
E -
(1.1) P
A
where
^
P is the estimate
of P.
P is defined
as:
^
P = F / N where Then
F is the number an expression
of
failures
observed
for N must be found
(1.1)
into
(1.3)
and
and N is the sample size.
such that:
Pr(E _ .I) = Substituting
(I .2)
.95
multiplying
(1.3) the
inequality
by P
gives: ^
Pr(IP
-
PI _ .Ie) = .95
(1.4)
"
Substituting
(1.2)
into
(1.4)
and
multiplying
the
inequality
by N
gives:
Pr(Ie - NP I _ .INP) = .95 Substituting
_ for NP in (1.5) gives: er(IF-
The
inequality
lJl _ .I_) = .95
in (1.6) can be expressed Pr(.9_
If N is large and with mean
(1.5)
P
is
_ = NP and (1.7)
as:
_ F S 1.1_)
small
(1.6)
= .95
(1.7)
F is approximately
can be expressed
1.1_J
Poisson
distributed
as:
l_le-_ - .95
i=.9_ Therefore
in life
in the order repetitions analytic
critical
10-9
is
required,
generation
a probability
approximately general
of this paper
of reliability
and
interconnection
structures
The result
this
validated
of
those
in the
ARM
operational
requirements
efficiently
operational
PMS
wi!l
be
applications
states
Markov
require
an
based
in the automatic
models
implemented
the on
level.
and experimentally
and
a
The program
interconnection
program
which
simple
set of
will attempt
divide-and-conquer the
for arbitrary (PMS)
Modeling)
structure
on the structure. using
the issues
_eliability
interconnection
analyze,
system
methodology, structure
to the
and the
requirements.
output
of
the
ARM
program
will
be
a
file
reliability or availability state transition matrix. will vary depending on matrix.
of failure
3.8 x 1011 simulation
the Processor-Memory-Switch
(_utomated
the
to explore
availability
at
effort
will accept
The
In
where
approach.
It is the intent
various
applications
are necessary!
(I .8)
i!
the
program
The evaluation programs whose
to
containing the
The output format
evaluate the state transition format the user will be able to
specify are: SURE, HARP, and ARIES (described in Section 1.2).
4
The following calculation Previous
sections
at
work
surveyed.
the
will present
PMS
level
in the generation
using
efforts.
The final
background
be stated
section
on reliability
time-varying
and evaluation
The goals for ARM wil!
previous
a brief
Markov
of reliability and compared
will present
models. models
is
those
of
with
the Organization
of
this paper.
1.1 Background
Present detail,
day computer
and therefore
them.
Four
[Siewiorek
systems so
levels
82a].
can
were
These levels,
to
digital
is one
where
systems
switches,
transducers,
primitives
Hardware
components
intermittent
faults
erroneous change or
state
of
faults
irreversible
example,
the primitives opposed
are
as a function
to
and
Intermittent or
varying
view
memories,
level
where
transient,
82b].
A fault
resulting
the
and
faults
and is an
from a physical
the environment.
Permanent
result
result
from
an
from temporary
faults are occasionally hardware
of
etc.
stable,
Transient
the logic
PMS level
logic
[Siewiorek
from
and Newell
through
permanent,
software
continuous change.
the
multiplexers,
in
or
The
of
and analyzing
Bell,
level,
levels
are processors,
to
susceptible
varying
Siewiorek,
level.
or interference
hardware,
Fault-tolerant
PMS
at
of designing
the circuit
registers,
conditions.
due to unstable
from
hardware
physical
environmental
by
as discussed
in the hardware
hard
defined
the
are
viewed
process
etc. as
may be gates,
be
the
range
and programming
can
or software
present
states
(for
of load or activity).
computer
systems
can be
affected
by a limited
set of °
faults
without
interruptions
in their
achieve
fault-tolerance
by
using
perform
the same operations.
The
correct
output
Swarz more
using
[Siewiorek relevant
82b]
ones
diagnostics discuss
are defined
operation.
redundant system or
the below.
groups
must
majority various
Some computer
of components
determine voting.
redundancy
systems
which
to
is the
Siewiorek
and
techniques,
the
STATIC REDUNDANCY - In
static
redundancy
majority vote involving a fixed when the
masking
redundancy
group is
faults are masked through a
of redundant components.
exhausted
by
Thus,
component faults, any
further faults will cause errors at the output.
DYNAMIC REDUNDANCY - In
dynamic
the faulty components are the system.
The
redundancy faults
detected°
faulty
are not masked but
isolated, and reconfigured out of
components
may
be
replaced
by spares if
available.
HYBRID
REDUNDANCY
majority
vote
reconfigured exhausted
- In
hybrid
involving
when
a
spares
by component
redundancy group
of
faults redundant
are available.
faults,
any
are masked
Thus,
further
through
components
when
a
that is
the redundancy
faults 'will cause
is
errors
at
the output.
ADAPTIVE
VOTING
majority
vote
without
adjusted redundancy
In
Faulty
them
to reflect
components
a
smaller
In
voting
by
faults
variable
the
will cause errors
-
a
from
is exhausted
ADAPTIVE HYBRID
adaptive
involving
spares.
by excluding
occur
-
group
are
voting number
of
faults,
through
a
components
out of the system the voter
components.
adaptive
hybrid
threshold
Thus,
any further
variable
faults
when
faults
the that
faulty components are reconfigured out from the voting process and adjusting the masking redundancy is
exhausted
faults that occur before a faulty
are
masked through a
group of redundant components that
is reconfigured when spares are available.
°
redundant
reconfigured and
masked
at the output.
majority vote involving a
_
of
process
component
are
If spares are not available
of the system by excluding them the voter threshold. by
Thus, when
component faults, any further
component
is replaced by a spare or
reconfigured out of the voting process will cause errors at the output.
For example, a triad
is
a
group
of
redundancy to tolerate at least one fault.
3
components that use hybrid If a triad recovers from a
6
fault by replacing
the
tolerate a
fault.
second
faulty
component
Recovery
isolating, and reconfiguring the The fault coverage of a
is
faulty
component
with
a
the
spare
process
is
component
it can then of detecting,
out of the system.
the probability that the system
can survive a fault in this component and successfully recover.
If the
system can always recover it has a "perfect" coverage of I.
Reliability the failure Siewiorek more
measures
are
processes
in
and Swarz
relevant
defined
in
hardware
[Siewiorek
terms
components
82b] discuss
ones are defined
conditional
because
are non-deterministic.
these
various
measures,
the
below.
RELIABILITY - The reliability, R(t), of t is the
of probabilities
probability
interval [0, t] given that it
a system as a function of time
that
was
the
system
operational
has survived the
at time zero.
It is a
non'increasing function whose initial value is one.
MTTF
- The MTTF
first
system
AVAILABILITY time
t ks
instant
(Mean
failure
Time assuming
the
probability
in systems
The
life-cycle
Reliability infeasible from
the expected
a new (perfect)
A(t),
that
exists time
the
as
of
system
a
system
system is
time of the
at time zero.
as a function
operational
that
t goes to infinity, the
Availability
in which
to perform
consequences. system
of
computations.
periods
is
of
at that
of time t.
expected fraction
merit
Failure)
- The availability,
If the limit of A(t)
useful
To
is
service
preventive
system
can
is
available
typically
used
be delayed
maintenance
availability
is
or
important
it expresses
in
to perform
as a figure
or denied
repair
the
of
for short
without
serious
the computation
of
costs.
is used to describe
such as aerospace
R(t) as follows:
systems
applications.
in which The
repair MTTF
is typically
can be derived
7
MTTF
= f R(t) dt o
The most based
commonly
on a Poisson
called
used reliability process
the exponential
with
function
an exponential
reliability
function,
R(t) = e where
A is the hazard
which reflects components
the reliability
is usually
exponential
such as when
that, after a burn-in a
function
period,
relatively
exponential
reliability
is used when
reliability
most
common reliability
This
is called
functions function
the Weibull
rate
is a constant
and for highly per
million
the failure
not age. faults
hours.
rate
It is often
in electronic
rate.
The
reliable
MTTF
The
is timeobserved
components for
the
has the form:
MTTF
Many other
do
failure
function
!
The failure
permanent
constant
is
and has the form:
failures
components
This
is
-At
rate.
in
component
distribution.
of the component
expressed
reliability
independent,
follow
or failure
for a single
I = -
have is
been
based
reliability
formulated.
on the Weibull
function,
The second distribution.
and has the form:
R(t) = e • where i is the scale parameter and e
is the shape parameter (other
reparameterizedforms are also common). exponentialfunction when e is one.
It is equivalentto the
The Weibull reliabilityfunction
•is used when the failure rate is time-dependent. Permanentfaults for componentsthat age can be described using an increasingfailure rate (alpha greater than one) and in this ease the system is not as good as new when repair takes place. Data presented in [McConnel81] indicates that transientfaults follow a decreasingfailure rate (alpha less than one).
8
The failure independent as when affect
processes
of each
of different
other.
electrical,
mechanical,
other components
in practice
This
at
any
components
fail
state are called to be known
is not made a transition
times are assumed
give n time
interval
system
their state
system
does
to
be
systems
b) discrete-state
and continuous-time and discrete-time
d) continuous-state
and continuous-time
The
correspond state
to system
transitions.
distribution from
transition
states
and
Each
arc
is
exponential Weibull
For example, distribution,
the has
distribution,
distributions.
the node.
the label the
is
used.
of a
If it is
classified
aceordlng
to
graph.
The nodes
directed a
scale
or the filename
arcs
label
history
distribution.
If the state
directed
a
the previous
originating
if this assumption
(STD) may be
destination
at the
are assumed
diagram
node
initially
of
transition
the originating
of
states
changes
to some multiple
be
probability
to the
These
at any time a continuous-time
conditional
given
of the
state
diagram
to describe such as when
used.
model
can
c) continuous'state
drawn.
enough
as
and discrete-time
a
is
restricted
a) discrete-state
system
is used;
system
space and time parameter
For a discrete-state
be known
changes,
model
cart occur
Most
such
in one component
its state.
model
system
a discrete-time
is used.
to be
true,
it is close
If all possible
system
that state transitions model
the
so
continuous-state
However
that must
state transitions.
a discrete-state
conditions
all
'As
or are repaired,
be assumed
the analysis.
represents
instant.
will
is not strictly
proximity.
to be used to simplify
the system
assumed
assumption or' thermal
in its
The state of a system
components
system
and of
that
that
allowable
identifies
the
the system will
go
node of that directed
arc
and
was
The _ label could
indicate
that
the system
used depends
be the hazard shape
rate
parameters
a histogram
on the for the _ for the
for more general
If transitions are allowed from failed states to operational states then the STD is an Availability graph and A(t) may be obtained from it. R(t) may be obtained by specifically disallowing failed to working state transitions from the STD thus making it a Reliability graph.
A Reliability it is assumed horizontal
The coefficients
represent
distribution
defining
occurrence
of the first simplex otherwise
fault wins
(i.e. only noted
being actively
Key:
the race,
used
follow
uses one of the two working descriptions,
The vertical a general
of the
between
of the first.
failure
occurs.
the
If the
If the removal
reconfigures
components).
all working
into a Unless
processors
in the configuration.
Description
I 2 3 4 5
3 working 2 working system faile(_ 2 working, uses system failed
I-I: Reliability
I
Graph
of a Triad
with
an
of working
follow
is a race
then the system
State
Figure
There
then system
These
The
the constant
the filename
ana the removal
in the state
These
the number
from a fault.
the race,
of I.
I represents
_ represents
fault
In this model
in the configuration.
the distribution.
of a second
fault wins
used
I-I.
coverage
of i represent
recovery
and consequently
in Figure
fault arrivals.
and consequently
are being actively
transitions
is given
h_s a perfect
represent
distribution
processors
second
that the system
rate.
histogram
of a triad
transitions
exponential hazard
graph
I Spare
are
10
The information matrix
called
conveyed
the state
row i and column
The
defined
used
Markov
illustrated
in
The
in Figure
is often
matrix
summarized
(STM).
paper
and
the
hierarchy
to
denote
i to state
the various
assumptions of
in a square
The STM element
in the arc from state
this
models,
below.
STD
transition
j is the labe!
terminology
time-varying
by the
j.
types
they are based
time-varying
Markov
in
of
on are
models
is
I-2.
TIME-VARYING MARKOV PROCESS -
A
stochastic process whose future state
depends only upon the present state,
and not upon the history that led
to its present state.
HOMOGENEOUS whose
MARKOV
MODEL -
state transition
continuous-time transition discussed
follow
process
- A
are
Markov
an
state
spent in the
process an
this
the pure Markov upon the global
time.
this implies
exponential distribution, is discussed
but
a
-
state.
A
whose
that the
the
This model
is
of the pure
depend
upon
For the continuous-tlme state they
transition might
a
the
semi'
times
follow
This model
model state
state
Often
they can follow
and applied
that the state
generalization
do not Weibull
is discussed
and
that uses a generalization
of
transition
For the continuous-time
distribution.
implies
For the
in [White 84].
MODEL
process
this
distribution.
systems
MARKOV
time-independent.
probabilities
distribuCion,
or any other
process
70].
uses
that
a pure Markov
distribution.
transition
implies
exponential
to computer
that
present
uses
process
exponential
model
whose
NON-HOMOGENEOUS
process
probabilities
homogeneous
MODEL
distribution applied
that
in [Chung 67] and [Romanovsky
local time Markov
model
times follow
SEMI-MARKOV Markov
A
transition
_hey
are
any
to computer
depend
non-homogeneous
Markov
times
assumed
other
systems
probabilities
do not follow
to follow
distribution. in [Trivedi
81].
an
a Weibull This
model
11
Time-Varying Markov / / Homogeneous (tlme-lndependent)
\
I \-Semi-Markov (local time-dependent)
Non-Homogeneous (global time-dependent)
Figure I-2: Hierarchy of Time-Varying Markov Models
The probability of being in
a
particular state for a discretelstate
and continuous-time Markov model equation. these
be expressed with a differential
The set of simultaneous differential equations that describe
models
equations.
can
are
called
the _continuous-time Chapman-Kolmogorov
For homogeneous Markov models these equations can be solved
using matrix or Laplace transformations.
If the state transition quite difficult to
probabilities
obtain
explicit
Chapman-Kolmogorov equations. reaching a state through a solution of a
multiple
obtain
to the continuous-time
the
exact
probability of
particular path of transitions requires the
integral, the
integrals
using
approximated
[Stiffler 79].
time-dependent it may be
solutions
To
probability of making one of are
are
where
each integral represents the
transitions numerical
in the path.
Often the
integration
techniques
An alternative method is to approximate the continuous-
time process with dlscrete-time equivalents [Siewiorek 82b]. •
difficulty with the second are effectively zero in
method is
the
The major
that many transition rates that
continuous'time
process assume small but
nonzero probabilities in a discrete-time process.
1.2 Previous Work
There are several evaluate .
the
p_ograms
reliability
standby redundancy or can
that
and/or
use
be
time-varying Markov models to
availability
repaired,
and
systems
that use
are susceptible to hard,
transient, and intermittent
faults,
SURF, and HARP.
programs can evaluate both the reliability
All
and availability of
these a
system,
such
of
except
for
as
CARE
CARE
III, ARIES, SURE,
III
which can only
12
evaluate
the reliability.
the system
CARE
specification
III
[Bavuso
84],
can
that
technique
described
events
while
frequent using
the
Weibull
in the
models.
fault trees, model.
The
the transition transition
to
of
matrices
that
can not be
at Raytheon,
analyzed
and uniform using
use exponential
model.
specified
by
Numerical
using
Markov extended
to the non-homogeneous
is specified
semi-Markov
accepted
an and
is reflected
these time-varying
is
it is written
assumes
of relatively
analyzed
Markov
converted
fixed
is
behavior
behavior
the
the
infrequent
is separately
can
solve
behavior
fault-handling
technique
composed
handling
automatically
parameters
III was developed
by providing
model.
Therefore
as input directly.
in FORTRAN
CARE
77, and runs on
or a VAX.
ARIES
(Automated
described
Reliability
in [Makam 82],
The system series
used
fault-occurrence which are
is
non-homogeneous
are
use
not repair
of relatively
behavior
model
fault
aggregate
techniques The
a Cyber
The
that
can use exponentlal
occurrence Markov
distributions,
integration
state
fault
non-homogeneous
parameters
Markov
The
This
behavior
that
do
in
decomposition/aggregatlon
is composed
fault-handling
described
systems
but
81].
behavior
model
of
faults
as one of
matrix.
Estimation),
[Trivedi
behavior
semi-Markov
distributions. aggregate
The
transition
behavioral
fault-handling
events.
a fixed
a
in
they all have
reliability
component
uses
the fault-occurrence
III,
the state
the
to tolerate It
for CARE
Reliability
evaluate
components.
solution
methods
(Computer-_ided
reconfiguration faulty
Except
can be specified
of independent
are either solution
Butler
active
technique
transition
is
matrix.
[Butler
_nreliability
restricted using
subsystems
or serve that
_nteractive
as
a
assumes
describes
Range Evaluator)
transition
each containing It uses
distinct
It was developed
84]
to homogeneous state
spares.
a
which
Estimation
at UCLA
_ystem),
Markov matrix,
identical a matrix
models. or as a
modules
that
transformation
eigenvalues
for the state
and runs on a VAX.
program evaluates
named
SURE
(Semi-Markov
the unreliability
upper
13
and lower bounds
of
semi-Markov
models.
theorems proven in [White 84] and
[Lee
means of bounding the probability of model within a specified time.
assume
that
transitions
slow
describe
of
system reaching any death state
very
respect
the
close
to
the
occurrence
transitions describe the.recovery
of
bounds.
mission
These theorems
time) exponential
faults,
process.
permanent, transient, or intermittent. state transition matrix.
These theorems provide a
traversing a specific path in the
the
usually
(with
85].
uses new mathematical
By applying the theorems to every path
of the model, the probability can be determined within
It
and
fast
general
Faults can be modeled as
Its
only input method is the
SURE was developed at NASA's Langley Research
Center, it is written in VAX-11 Pascal, and runs under VAX/VMS.
SURF, described in [Landrault 78],
can solve semi-Markov models that
use exponential distributions or non-exponential distributions that are related to the exponential (e.g. stages [Cox 68] is used to transformations are used
Gamma,
Written in PL/I, facility
in
it
to
obtain
values, runs
Yorktown
The method of
produce a homogeneous Markov model. time-independent
MTTF and the limiting availability. obtain time-dependent
Erlang, etc.).
a
as
IBM
Heights,
values, such as
The Laplace transform is used to
such
on
Matrix
availability and reliability. System/370
New
York.
at the IBM research
SURF
was
developed in
Toulouse, France.
For
HARP
(Hybrid
[Trlvedi
85],
uniform,
Weibull,
Automated
the state or
HARP
can only evaluate
rates.
HARP
has
behavior
converted
to
behavior
extended
be
the availability additional
specified
models.
stochastic
histogram
transition
(e.g. fault
Petri
_redictor),
probabilities
(i.e.
by providing
the CARE
be provided)
is given with
repair
the fault-
are automatically The fault-handling
the transition models
III model,
by the user
constant
of specifying
model.
The fault-handling net,
list must
all of which
Markov
in
exponential,
of systems methods
described
can have
matrix
trees),
non'homogeneous
can also
of one of several
state
several
occurrence
a
transition
general
distributions. _ If the
Reliability
parameters
available
the ARIES
are:
model,
an and
14
the SURE model. solution hybrid
It
technique
fashion.
uses
the same behavioral
as CARE III, but Time-varying
using numerical
integration
nets are solved
by simulation.
a VAX.
It is
still
the various
Markov
models
models
techniques,
under
decompositlon/aggregation
are solved
and extended
It is written development
are solved
analytically
stochastic
in FORTRAN
in a
Petri
77 and runs on
at Duke University
and Clemson L
University.
An abstract described
specification
by
Butler
langu_ge
[Butler
specify
(a) the state space
range,
(b)
the
start
85]. by
by
(c) the death states
variables,
and (d) the state
the possible
states
all in terms of
in the SURE
input language.
Section
3.
is written
Kini
[Kini
_nteractive generates
Pascal,
81]
modifications
a
Evaluator
reliability
are:
independent,
(b) the PMS system has
components
(a) all
faults
are not repaired
and
input is the interconnection
program
inputs
types, reliability communicate as
input
clusters, used
the
with components the
requirements
in the form
in ADVISER
of the
of
for
modified
for detecting
of
models
Center,
in it
PMS
structures.
and (c) failed state.
the PMS structure
its
expressions.
symmetries
Its Other
by their
and ability
The program and
It_
and stochastically
the PMS structure.
type.
Boolean
(Advanced
automatically
a non-faulty
system,
PMS graph
has been
to generate
ADVISER
port connections,
same the
Research
coverage,
to
graph of
internal
destination
reliability
which
permanent
a perfect
components
functions,
named
for
returned
primary
describe
are
that
to ARM as described
Reliability)
functions
assumptions
rules
VAX/VMS.
program of
of the state
in ASSIST
Langley
and runs under
the state
This language Markov
to
and their
of
and their
used
at NASA's
describes
_ymbolic symbolic
algorithm
with
variables
expression
rates,
was
has statements
values
variables.
The
models
by a set of if-then
to generate
was developed
in VAX-11
Boolean
their
program
be applicable
ASSIST
a
state
language
initial
transitions
the
in the ASSIST
the model will
by
reliability
the state
the
transitions,
implemented
The
defining
state
variables,
define
for Markov
to
also takes
subsystems
or
The methods
and tree structures
15
will also be Section 3.
applicable
with
modifications
ADVISER was developed
at
to
ARM
as described in
CMU, it is written in BLISS, and
runs on a PDP-10. J
1.3 Motivation
The goal of this research
and
development
computer architect a powerful and assume the burden of
an
easy
advanced
to use software tool that will
reliability analysis that considers
intermittent, transient, and permanent
faults
high complexity and sophistication.
The
description was selected because (a)
it
digital systems and therefore well known
to
computer
effort is to provide the
the
for computer systems of
PMS level of computer system is
easiest
architects.
the highest level view of to
The
specify, and (b) it is
time-varying Markov model
technique of reliability and availability analysis was selected because (a) it is powerful enough for
concurrent
reliability
events,
to accurately analyze most situations except
analysts
and
and
(b)
it
several
is
in widespread
evaluation
programs
use
among
have
been
developed.
Previous efforts have been limited in one of two ways. a computational reliability
aid
analysis
once
the
preliminary
system
had
been
manually
achieved.
computer systems of less
Most provided
decomposition and Alternatively
complexity and sophistication were considered
without transient and intermittent faults.
1.4 Organization
The
system
description
availability Markov model
required
involved in the
is
automatic
to
described
Markov models are discussed
generation
generated Markov reliability
in
in
Section
a
reliability 2.
or
The problems
of reliability and availability
Section
models
generate
are
3.
Examples of automatically
presented
in
Section 4.
A
summary of the research and a plan for its accomplishment are presented in Section 5. The Appendix A.
algorithms used
by the ARM program are described in
16
2. System
Description
It is important to have a general system description method that will accommodate new fault-tolerant
techniques
and
system
designs.
This
section presents the system description method currently envisioned for the ARM
program.
The
generality
of
this
method
needs
to
be
system
of
investigated to correct any deficiencies.
When
calculating
components,
a
reliability
measure
four items of information
a) The reliability b) The fault components
behavior
for
an arbitrary
are necessary,
of the system
namely:
components
(Section
tolerant function of individual components in the system (Sections 2.2 and 2.3).
2.1).
or groups
of
c) The communication paths that components in the system may use, and which are the components that need to exchange information (Sections 2.4 to 2.6). d) The operational requirements placed subsystems (Sections 2.7 and 2.8).
Item
(b) is the only one that is not necessary
ARM program
will use eight
information
for any arbitrary
input categories These
description
of the component
structure
multiprocessor
that The
system
the ARM program.
categories
system. to
ARM
convey input
types
fina! can be
the
section
for some systems]
to obtain
categories
are:
2.1),
requirements
the purpose ARM will
specified
these only
the information
(Section
will discuss
provide
and its
For some systems
2.4), and the system
sections
categories
information.
of
minimum
(Section
The following
input
are required
and (d).
input
on the system
program give using
items
of
three ARM
in (a), (c), a reliability
the interconneetion (Section
2.8).
and necessity with
an
The
these
example
all the input
of the items
of
of
how a
categories
E
17
2.1
Component
The first
Types
input category
in the PMS structure. identical
Components
in function
is natural
and
alternative
is a list
describing
of
the
and reliability.
reduces
the
same
of components
type are assumed
The concept
system
would be to specify
the types
of component
specification
burden.
the characteristics
to be types
The other
of each particular
component.
Each type declaration rates of
the
components
various
and the
more
specify
failure,
of that type.
distribution follow
will
recovery,
Rates
will
parameters
of
than one distribution
The function
of the system
the coverage
a
Section
2.8.
A
distribution
(i.e. a histogram
The nine defined
classes
below.
two classes. Used
must
Figure
in reliability
type
state.
the distribution
expression
of a
as defined
Weibull,
type declaration must
illustrates
how
contain
in
or general
can contain
at least
the first
is the name of the component
assumed
caused
produce
TRANSIENTS - The
A rate may
seven
are
the first
classes
are
faults
are
models.
class
continuously
a
declaration
2-I
HARD - The second be
by a probability
of the system
be exponential,
information
TYPE - The first class
to
can
Boolean
for
be provided).
of
Each
function
and the
processes
distribution.
that determines
rate will be in the form of a modified
repair
specified
that
as
state
be
and
probability,
is
the
Hard
by
permanently
errors when
third
class
transient failure rate _.
rate _. damaged
Hard
components
that
exercised.
consists
Another
that is the rate at which the
failure
type.
is
of
two
rates.
One is the
the transient duration rate 6,
transient stops producing errors.
It is
assumed transients are not caused by or produce any permanent damage to the components.
18
INTERMITTENTS
- The fourth
intermittent
failure
rate
_, that is the rate producing
errors.
an intermittent and starts caused
fault
producing
component Coverage must
that
class
be
following:
becomes
producing
benign
benign
errors
It is assumed
rate
or stops
becomes
active
intermlttents
are
components.
the
fault
can
on
coverage
survive
recover'.
a
This
accurately
analytic
C expressed
fault
in
probability
t1_e reliability
very
simulation,
One is the
rate e, that is the rate at which
stopped
system
estimated
three rates.
an intermittent
is
impact
of
is the intermittent
once more.
damaged
and successfully has a great
Another
that had
the
consists
is the active
errors
- The fifth
probability
_. which
Last
by permanently
COVERAGE
it
at
class
this type of defaults
of a system.
using
methods,
one or
as the
or
to I.
Therefore
more
fault
of the
injection
experiments.
REPAIR which Only
- The sixth components
class
is
of this
if the repair
rate
the
type
repair
are
is specified
rate
repaired
_, that is the rate at and returned
can the availability
to service.
of the system
be modeled.
RECOVERY
- The seventh
at which
the system
components
rate
at which
active
can detect,
the active
- The eighth
component
that
The purpose
can
is performing with
of shadows
a
different
memory
triad.
provide all the
redundant
group
memory A
hot
up and activating
rate p, that is the rate
and reconflgure
shadow
shadow
is to increase
at which a
recovery
isolate,
the exception
this is the rate
that a
the
from faults
(a hot or powered
in
up spare
component).
class is the
the system
components
powering
is
of this type by using
that is imitating
SHADOW
class
shadow or
a
activation
rate o, that is the
shadow.
functions
A shadow
of a redundant
that its output the recovery module can
powered
up
rate.
a cold or unpowered
provided spare
group
is not being
used. of
to shadow
by changing
is imitating,
spare.
of
An example
can be reloaded be
is a spare
a
the
or by
19
©
\
Key: State
D_scription
I 2 3 4 5
no faults hard fault transient fault active intermittent benign intermittent
6 7
correct fault detection, isolation, and reconfiguration incorrect fault detection, isolation, and reconfiguration
Figure
2-I: Use of Component
DEGRADATION
- The ninth
rate at which
the system
class
fault fault
Type Information
is the degradation
can gracefully
degrade
in Reliability
rate
8.
That
Models
is the
by elimlnatlng'one"
20
redundant
group of components
set of components output
performing
can be selected
is necessary
when
a
replace
and
the
it,
requirements failed
the
using group
components
need
rcquirements),
a
to
of
fail
This
is
done
is
is a
of
group
not
fails
and
there
Degradation
are no spares above
because
the
the system
vote.
there
groups
probability
for
group
2.3)
fails,
A group
such that the correct
or majority
these
greater
and if a
in Section
same operations
component
number
has
are all of this type.
diagnostics
for the system.
component
(defined
which
to
the minimum
a group
failure
to
with
a
(fewer group
meet
its minimum
is no watchdog
timer
fails.
2.2 Redundant Groups
The second group
category
of components
performing selected will
input
in the
the same
operations
the maximum
the requirements, adaptive
voting
The adapted adaptive
or
A
that
the
correct
vote.
Each
of this
components
in
adapted with
to the
output
group
type,
group
can be
declaration
the group
the group,
name,
and if uslng
and the adaptive
the adjusted
time
any redundant
is a set of components
of groups
the
is the group
that specifies group
majority
of
the name of
rate corresponds
list
such
number
the type
group
a
system.
using diagnostics
contain
is
voter
involved
rate.
threshold.
in changing
The
the voting
threshold.
Currently
the redundancy
by three things. not. zero
One is
The other two or
not.
specified. extended
Table
so systems
component
the
with
in a group
minimum
used
whether
it
whether
its
2-I of
shows
using
hybrid
to replace requirements
each
it,
redundancy
are
for
rates
or are
technique must
is be
can be described.
the hybrid
the number the
group
specification
techniques
or adaptive and
and adaptive
technique
category
is specified
part of a redundant
recovery
how
redundancy
input
for a component
is
new redundancy
for this
there are no spares above
are
This method
The semantics
technique
system,
following. redundancy of
When
fails,
these groups
then
the
a
is
system
!
21
gracefully degrades by eliminating the groups is not above
the
minimum
group.
If the number of these
requirements
for the system and the
group uses adaptive hybrid redundancy, then the system reconfigures the faulty component out of the voting process.
If there are shadows, then they among the groups.
If a
group
are assumed to be evenlydistributed
has
to
be able to transmit to another
group, then each component of the
transmitting group has to be able to
transmit to all the components of
the receiving group.
the latter is
so
each
independent majority
component
vote
on
of
the
the
receiving
information
The reason for group can do an
from the transmitting
group.
REDUNDANT GROUP
RECOVERY RATE
ADAPTIVE RATE
STATIC REDUNDANCY
yes
zero
zero
DYNAMIC REDUNDANCY
no
nonzero
zero
HYBRID REDUNDANCY
yes
nonzero
zero
ADAPTIVE VOTING
yes
zero
nonzero
ADAPTIVE HYBRID
yes
nonzero
nonzero
i
Table 2-I: Redundancy Technique Specification
2,3 System Watchdog Timers
The third input category
is
a
list
component type or group of components system, and the rate at which
the
watchdog is assumed to have a
timer
out or the watchdog will restart probability that multiple will cause system failure.
faults
that
specifies which (if any)
acts as a watchdog timer for the
watchdog can restart the system.
that must be reset before it runs
the system. in
A
a
A watchdog decreases the
redundant group of components
22
Key:. State
Description
State
Description
I 2 3 4
3 working 2 working system crashed I working
7 8 9 10
system failed watchdog failed 2 working, no watchdog system failed
5 6
2 working, uses I system crashed
11 12
2 working, uses I, no watchdog system failed
Figure 2-2:
Reliability
Graph
of a Triad
with
a Watchdog
23
The semantics for this input category are the following. no watchdog, and
any
group
fails,
then
the
watchdog fails, and any group fails,
then
the system fails.
words, there has to be failure.
a
watchdog
for
Adding a watchdog timer modifies
system
If there is
fails.
If the In other
the system to survive a group
the system model by preventing some
}
i
=!
states from being
failure
states
and
by
creating
new states.
For
!i
example, if a watchdog with failure rate _ and system restart rate 0 is
ii!
added to a triad the system model changes from the one in Figure I-I to
ili
the one in Figure 2"2.
i
has a perfect coverage of I
i
cause system failure.
In this new model it is assumed that the system and
The
the
watchdog
failure of
the watchdog wili not
prevents a system crash caused by
i
the failure of two processing
elements
from causing system failure by
il
restarting the system as a simplex without spares.
2'4 PMS Structure
The fourth input structure.
category
It is assumed
is
an
that
interconnection
list
critical components which are required
for the system to be operational must be able to communicate. i,i
i
purpose of'
the
_
failures will
• interconnection
prevent
of the PMS
list
communication
is
The main
to
analyze which component
between
critical components and
therefore cause system failure.
The
interconnection
list
can
also
be
used
to
detect
which
substructures in the PMS graph are symmetrical in their component types and neighboring components. be identical in function ]
and
Syc_etrical
substructures are assumed to
reliability.
Therefore the reliability
models of symmetrical substructures are identical and only have to be generated once. These models can then be duplicated and merged to obtain the reliability model of the system.
Each i
component
will
have
an
interconnection
specifies its type and neighboring components.
declaration
that
Since the PMS graph is
24
non-directed it
is
occurrence in one
possible
noted that each
arc
must
making
the
an
arc
by its
However, it will be
on two interconnection declarations. is
the
twofold. system
Secondly, a reader
comprehend
specify
declaration.
occur
redundancy
can be detected thus
easily
completely
interconnection
The purpose of this
contain errors.
to
Firstly, inconsistencies
specification less likely to
of a system specification can more
structure
if
the
connection
is
made quite
of
the
current
work,
the system
expllcit with two-way links.
Although not
within
the
scope
specification could be further eased interface.
by a graphics based user friendly
The interconnection list would then be provided by an input
interface that would accept a graphic description of the PMS structure, Since
this is not part
accept its input from
of
the current
an
interface
generated independently by support Future Net program for the IBM
research and ARM could easily
program,
this interface could be
personnel
PC.
using
tools such as the
Such a graphics interface already
exists for the PERQ personal work stations at CMU.
2.5 Intracomponent Port Connections
i
The fifth input
category
is
a
list
specifying
the internal port
connectivity of some components and/or component types. the interna! port connectivity will prevent communication cause system reliability
is
This
modeling
to analyze which component failures
between
failure.
critical
information
program
from
would
components and therefore is
needed
assuming
paths through intermediate components to this behavior into account
The purpose of
lead
to prevent the
incorrect communication
other components.
Not taking
to an optimistic evaluation of
the system reliability.
This
input category
reliability
modeling
is
needed
program
component types that will be
to
because have
designed.
modified to be a directed graph,
the
it
this Even
is
impossible
knowledge
for
for a all the
if the PMS graph where
program would still need to know
if information passed from A to B can be passed from B to C.
i
25
If not specified, the default is for
every port of a component to be
connected bidirectlonally to all other ports internal port connectivity component type all port explioit.
Each
is
specified,
connections
connection
and
of the component. then
If the
for that component or
their direction must be made
declaration
contains
the
following
parameters: VERTEX
The specific components or component type whose port connections are being specified.
TRANSMITTER
A transmitter port.of the VERTEX. It is specified by the component or component type connected to it.
RECEIVER
A port that receives from the previous transmitter port. It is specified by the component or component type connected to it.
i _: _
r_
2.6 Intra Component-Type Communication j,
it !,!
The majority of components of like to communicate.
are passive and do not need
of
passive
components are memories, buses,
and Input/output transducers.
Active
or self-talklng components need
to
exchange
Examples
type
information
amongst
each
other.
Examples
of active
_I
components are processors, direct-memory-access device controllers, and
_J
other "smart"
_
components to bepassive
controllers.
If
not
specified
the
default
is for
and not communicate with their own type.
The sixth input category is a list specifying the component types for which communication between components_of purpose of the intra
component
which component failures
will
type
like type is necessary.
The
communication list is to analyze
prevent
communication between critical
components that need to exchange information and therefore cause system failure.
This
information
is
modeling program from requiring
needed
of the same type
that
never
behavior into account would system reliability.
to
prevent
the reliability
communication paths between components exchange
lead
to
information. a
Not taking this
pessimistic evaluation of the
26
2.7 Component Clustering
The seventh
input
category
is
a
components form clusters, that
is
subsystems
requirements.
list
If the cluster requirements
specifying
which (if any)
with their own separate
are not met all the cluster
components fail but the system ma___yy continue to operate depending on the system requirements.
The
purpose
dependencies that sometimes
of
exist
between
declaration will contain the name its requirements
in
the
form
clusters
of of
is
to represent the
components.
Each cluster
the cluster, its components, and a
modified
Boolean expression as
defined in Section 2.8.
System Requirements
2.8
The eighth input category is a
succinct statement of the minimum set
of critical component types and/or for the system to be
operational.
critical resource set (MCRS). system may
only
function
the
of
success
Together they constitute a minimum
The set is minimum in the sense that the
if
(depending on the status other words,
component groups which are required
a
MCRS
other of
an
of
components
components MCRS
is
are functional
a
in the structure). necessary,
In
though not
sufficient, condition for system success.
The MCRS will simple form
grammar
in Figure
be defined
using
of requirements
is
a
modified
shown
Boolean
expression.
in the traditional
The
BackusCNaur
2-3.
::=
i OR
::= i I AND I ()
::= OF I OF
Figure
2-3: Grammar
of Requirements
i_'
27
2.9 Example i
•
In this example
a
multiprocessor
system
is
described using ARM's
i_
tabular format in Table 2-2.
Failure rates are assumed to be specified
_
in failures per million hours, all
other
per hour basis.
to zero and are assumed to follow a
All rates default
rates are assumed to be on a
single exponential distribution unless otherwise indicated. _,! I W_:
distribution rate is specified with a file containing the necessary specifications.
For
the
followed by the name of the
discriminating function and distribution
exponential
•distribution is
only its constant
rate is given.
f Iii I;
followed by the scale and shape parameters. A general distribution•is specified with a 'G' followed by the name of the file containing the
,:
necessary histogram.
_
Weibull
distribution
I_
_.
The
'M'
A multiple
specified
wfth a 'W'
The first component type described in Table 2-2 is a processor P with the following characteristics: hard failure rate:
i = 200 failures per million hours
transient failure rate:
oL= 10000 failures per million hours
transient benign rate:
6 = 3600 per hour
i
i!
ii
_j
intermittent failure rate: i = 10000 failures per million hours intermittent benign rate:
_ = 3600 per hour
intermittent active rate:
e = 360 per hour
coverage probability:
C = I
repair rate•:
_ = Weibull distribution of scale=1 ond shape=1.1
recovery rate:
p = multiple rates defined in the file RECP
shadow rate:
o = general distribution defined in the file SHADP
degradation rate:
8 = general distribution defined in the file DEGP
The PMS diagram of the multiprocessor has 10 LRU LRU.IO]
multiprocessor (Line
is
shown if Figure 2-4.
The
Replaceable Units) clusters, LRU.I to
LRU.i has a processor P.i, a memory M.i, and a watch dog timer
28
Component Types (Section 2.1): TYPE HARD TRANSIENT INTERMITTENT
P M WT B WB
200 210 50 10 I0
(10000, 3600) (10500, 3600) (2500, 3600) (500, 3600) (500, 3600)
COVERAGE
(20, 3600, 360) (21, 3600, 360) (5, 3600, 360) (I, 3600, 360) (I, 3600, 360)
I I I I I
REPAIR
RECOVERY
SHADOW
W W W W W
M M M M M
G G G G G
1 I 1 1 1
1.1 1.1 1.1 1.1 1.1
RECP RECM RECW RECB RECWB
DEGRADATION
SHADP O DEGP SHADM G DEGM SHADW SHADB SHADWB
Redundant Groups (Section 2.2): SIZE GROUPNAME REQUIREMENTS
TYPE
ADOPTS
ADAPTATION
3 I 2 I I I I I
P P M M WT WT B WB
PSimplex
G ADAPTP
MSimplex
G ADAPTM
WSimplex
G ADAPTW
System
PTriad PSimplex MTriad MSimplex WTriad WSimplex BTriad WBTriad Watchdog
PMS Structure COM PONEN T
2 I 2 I 2 I 2 2
Timers
OF OF OF OF OF OF OF OF
3 I 3 I 3 I 3 3
(Section
2.3): WTriad
(Section 2.4): TY PE
P.I-I0 M. I-I0 WT.I-IO B.I-5 WB.I-5
NEIGHBORLCOMPONENTS
P M WT B WB
Intracomponent VERTEX
B.I-5, WB. I-5 B.I-5, WB. I-5 B.I-5, WB. I-5 P.I-I0, M.I-I0, P.I-I0, M.I-I0,
Port Connections TRANSMITER
B B B WB WB
(Section 2.5): RECEI VER
P P M WT WT
Intra Component-Type
M WT P P M Communicators:
(Section 2.6):
Component Clusters CLUSTERNAME
(Section 2.7): COMPONENTS
REQUIREMENTS
LRU.I-IO
P.i, M.i,
I OF M.i
WT.i
Syste m Requirements (Section 2.8): (I OF PTriad OR I OF PSimplex) AND
Table
WT.I-5 WT.I-5
2-2: Multiprocessor
(I OF MTriad
P
OR I OF MSimplex)
System Description
Example
29
WT.i, and for any of its components to be available M.i must be working properly.
Components of the
same
type
are
grouped
into 2 out of 3
triad subsystems.
The system other that
uses adaptive
than a bus fails component
into a simplex i !
_i !!
The system must memory
buses,
and
(a
single
have
B.I to B.5.
is
component of
so
that if a component
only one triad
reconfigures
a minimum
and memory
redundancy
there
type, then it
triad or simplex
Processor
hybrid
without
spares
the two remaining
emulating I processor
a triad) triad
of
components
with a spare.
or simplex,
and I
to be operational.
triads
transmit
A processor
triad
on
a
bus triad
can transmit
formed
out of 5
to any kind of triad
i i
including
another
i !
processor
triads.
formed
out of 5
transmit
_'
processor The
watchdog
buses,
to processor
triad.
WB.I
triad to
and memory
LRU. I
A memory
triad
transmits
WB.5.
can only
transmit
on another
The watchdog
to
bus triad
triad
can only
triads.
.....
LRU. I0
,t
i: w i
0 7oi0r
......
B.I B. 2,-" .... B.3 B.4 !
.....
:
B.5
_.
WB. I WB .2......
.................._,L,_,__,L_W B .3-L---L
......
•WB.4--L--' WB. 5
Figure
2-4: PMS Diagram
of Multiprocess0r
Described
in Table
2-2
3O
3. Automated Reliability Modeling Considerations
The ARM
program
will
attempt
reliability
model
operational
requirements.
selected
to
based
increase
on
the
to
the The
efficiently
interconnectlon
structure
divide-and-conquer
computational
program development complexity.
generate the system
The steps
and the
methodology
efficiency
and
was
reduce the
the ARM program is going to
follow in generating reliability models are shown in Table 3-I.
I) Interface with user and obtain system description. 2) Detect symmetries in the PMS graph. 3) Segment the PMS graph. t
4) Identify the PMS system success and failure states based onthe operational requirements. 5) Generate the models for the PMS graph segments. 6) Merge the models for the PMS graph segments. 7) Reduce the state space of the resulting model. 8) Format and output the state transition matrix of the model.
Table
Steps 2 and 3
3-I: Automated
of
Table
derived from those presented they are mature, effort will
be
well the
Reliability
3-I
Modeling
Steps
have been implemented using algorithms
in
Kini's dissertation [Kini 81] because
documented,
and
identification,
simple.
analysis,
fundamental problems in each of steps I,
The major research and
solution
of the
and 4 through 7 of Table 3-I.
The research will also include the development of efficient algorithms, and methods algorithms.
The
to
feasibility
efficiency
theoretically
of
the
due to the large
in any reasonable
structure.
and
algorithms number The
experimentally
developed
of states validity
validate
depends
and transitions of
on
the
their
involved
these algorithms
is
I
particularly
important
31
for
life
critical
applications
where
a
probability of failure in the order 10-W is required.
The following sections steps 2 through 7.
will
Progress
the problems involved,
discuss
the
purpose
and necessity Of
and
already made in identifying and analyzing developing
and implementing algorithms to
solve them is also presented.
il I r
3.1 Deteotion of Symmetry in the PMS Graph Substructures in the PMS graph G will be considered symmetric if they are isomorphic and the corresponding identical component type assumed to be identical reliability models of purpose of detecting
in
models
symmetrical
substructures willbe
then
be
and
reliability.
substructures
symmetrical
will
Symmetrical
function
substructures
duplication of effort by generating These
of the two graphs have
labels.
vertices
are
Therefore the
is
identical.
The
to avoid needless
their reliability model only once.
duplicated
and
merged
to
obtain
the
reliability model of the system.
The symmetry detection algorithm based on the component type _
the graph. it has.
The degree of
a
ks
labels
shown
and
in
Appendix A.I.
It is
the degree of the vertices in
vertex is the number of neighbor vertices
Two vertices are neighbors if they are interconnected.
The algorithm requires three steps
to
partition the vertex set_of a
labelled graph into equivalence classes whose vertices are symmetrical. In the first step the partition is based on the component type label of each vertex.
For the second step
of each vertex.
The
third
step
the partition is based on the degree attempts
to partition based on the
number of neighbors each vertex has in each equivalence class.
The last step must be repeated until there are no more changes in the equivalence classes. changes
the
number
The of
reason
neighbors
for in
this each
is
that each partition
equivalence
class,
and
32
therefore other partitions may this repetition will element.
Each class
is
stop
vertices
in
when
related
because the vertices in other
become
to the
necessary.
each
other class
classes.
In the worst case
equivalence
classes
in
class has a single
i
a connectivity sense
are symmetrically connected to the
These
equivalence
classes
and their
,
connectivity relationships may be viewed
as defining another graph G'.
The vertices of G' correspond uniquely to the equivalence classes in G. Unlike the basic non-directed graph without self-loops, which was taken to be the model
for
This would be the equivalence
G,
G'
result
class
are
may
of
a
have vertices which have self-loops. case
connected
in
to
which
each
vertices in the same
other
in
some symmetric
fashion, thus making the equivalence class its own neighbor.
Also, the
number of links or connection density between two vertices of G' can be greater than one.
This would be the result of a case in which multiple
vertices in the same
equivalence
class
are
connected to one or more
vertices in another equivalence class.
3.2 Segmentation of the PMS Graph
The purpose of segmenting the PMS conquer methodology.
graph is to follow the divlde2and-
The segmenting proceeds by searching for what are
termed Pendant Tree Subgraphs (PTS). they are not part of another tree. path between any
pair
of
vertices
These are maximal trees, that is In these tree subgraphs the slmple is
vertices in the overall graph, in other is common
to
find
PTS's
in
most
the
only
path between those
words there are no cycles.
PMS
structures.
It
In particular
input/output subsystems typically assume this character.
If the PMS interconnection graph G
is
not
a PTS and all its PTS's,
excluding their roots, are removed then the remaining vertices and arcs form a subgraph of G that is not tree-connected. This to as the Kernel.
The root
of
the PTS as well as the Kernel.
each
will be referred
PTS has dual status as member of
The PTS's along with the Kernel form a
33
natural
set of segments
computation
the PTS's
algorithm
in a given
G' which represent trees"
neighboring trees
G
on
is
are then
vertices
of
"grown"
of these
at their
of
point
a
set
on the number
upward
roots
this
tree
instance
"stopping
of further
growth,
of the tree
no longer
or
of the tree has a
tree, with is when
a connection
the previous
respectively.
assume
states
(step I).
the root
merging of
Steps
These
by adding
on
the germinal
2 and 3 continue
of trees G'
of
is possible.
have
At
been generated. by the root,
each
of G or a set of PTS's.
In
in the set will be symmetric.
conditions"
to
which
a
tree is not
that cycles
would
The first
condition
is when
The second
single
neighbor,
density
greater
merged
under
the fact
itself.
with
the system
model The
during
are
termed
the generation
because
some sequence and therefore
is
identification
a very different
other states, through
G
vertices
which
condition
The
formed
the root
is when
is not already
than one. another
be
the
in that
third condition
tree that meets
one of
of Success and Failure States
on whether
the reliability
essential
It discovers
conditions.
3.3Identification
Depending
3).
one PTS
due to
the tree has been
of
of G represented
be a tree.
is a neighbor
A.2.
(step 2), and merging
of vertices
are three
and it would
root
the reliability
those leaf
towards
subgraphs
all PTS's
Appendix
(step
of these trees in G' may represent the latter
in
vertices
leaves
of vertices
capable
of which
by collecting
leaf
until no more adding
There
basis
shown
PMS structure
classes
that overlap
Depending
the
task may be divided.
The segmentation
"germinal
of
form.
success of
of
must
transitions.
can not have
or not the states
states
success
and
the reliability
Success
the system of
operational
states must
or failure failure model have
be able to reach Failure
any transitions
states
states
because
a failure
to other
is
they
transitions
states
in
to
state
are trapping states.
34
The
identification
unnecessary states
of
generation
of
are not needed
them is by being
For example,
is
consider For
have
that
being
failure
developed
in which
state
to identify
the system
a system
is assumed
state
to include
to
where
components.
The Boolean into
prevent
some
some failure can arrive
at
2 out of 3 processors
to
state
can
arrive
algorithm parse
all three
The reason
at that state
has already
been
is shown
in Appendix
A.3.
tree searching
due
to
for some way
The system
the failure
of requirements
sum-of-products
is by
states
that are not operational
paths,
for
have failed.
the requirements.
components
where
be generated.
two processors
expression a
reason
and failure
This
communication
The
failure
system
can satisfy
have
been transformed
the
success
those
also
way the system
that requires
requirements
they do not
contain
the
may
state.
system
and implemented.
It traverses
only
does not have
only way
An algorithm
the
a system
this is thatthe in another
states.
failure
that
failed
states
failure
in another
be operational. processors
failure
form
is assumed
so
that
state
because of other to have
it does not
any parenthesis.
The parse • levels.
tree of a sum-of'products
The bottom
processors".
The
requirements, top level
level•represents intermediate
that is an
represents
requirements,
that
AND
the is
Boolean
expression
atomic
requirements
level
expression
represents of atomic
sum-of-products an
OR
such as "2 of
pure
conjunctive
requirements.
expression
expression
only has three
of
of
pure
The
the system conjunctive
requirements.
For example, processor memory.
consider
a
and two memories, For readability
system or the
that
with symbol
one
can
operate
processor,
_(N,X)
with either one disk,
will represent
one
and one
the atomic
35
requirement
"N of X".
requirements
The
sum-of-products
OR _(I,P)
tree of such an expression
The algorithm and returns
is a Boolean
true
three levels
the system
AND _(I,D) AND _(I,M)
(3.1)
if it
is
it will return
sum-of-products
expression
level it
will
a
to the true of
are
meet.
requirements
are meet. L
in Figure
that takes
success
levels
of the parse
The
third
all
tree.
is
atomic
level
works
in the
meet.
At the
requirements
determines
at
At the•
requirement,
requirements, if
as an argument
The algorithm
if any conjunctive
true
3-I.
a state
state.
system
return
conjunction
is shown
function
that correspond
first level
second
of
is
_(I,P) AND _(2,M) The parse
expression
which
in a atomic
OR
/
\
/
\ AND
AND
/
I
\
\
_(I ,P)
_(2,M)
Figure
of Models
The generation
of
divide-and-conquer transitions
for PMS Graph
the system
corresponding
For
to the
reliability
in the model model
will
generation
An algorithm already Appendix
model.
be
algorithm
for
what
been developed A.4.
Minimal
and
that
using
subtrees
and
of
then merged
algorithms
states
and
to produce
and transitions derived
subtrees
This are
the
from the
85].
minimal
PTS's
the
follow
of the equivalence
of states
in [Butler
implemented.
will also
purpose,
The generation
termed
(3 I)
..
segments
generated
presented
are
expression
model
different
implemented
_(I ,M)
Segments
reliability
methodology.
\
_(I ,D)
tree of requirement
class graph G', will be separately the system
I
_IJ(1 ,P)
3-I: Parse
3.4 Generation
/ I \
/
algorithm
those
of PTS's has is shown
that are below
in the
36
minimum
system
minimal
subtree
fai! because other
requirements of a PTS
fails
This algorithm repairable
minimal
generating
the minimal
The minimal
subtree
model
repairable
transient
and
developed
to
generate
algorithm
will
generate
in a minimal produce
the kernel
and merge
reliability
model.
I) Initialize 2) While until
models
minimal
the
system for
and merge
it
The it
second
with
those with
the set of new states
the New Set is not empty, a success state is found.
transition
a) If the destination
are susceptible
to
algorithms
will
models
The
subtree
generate
models
Table
to for
the system
state.
out of the New Set
subtree, if more the transitions
generated: state
is new then add it to the New D Set.
b) Add the transition to the model by obtaining the two factors whose product is the transition's rate: the number of working components in the class whose failure is described by the i transition, and their failure rate.
:
be
f_rst
a model
to produce
New Set to the start get a state
must
of a PTS that are not
3) For every equivalence class node in the minimal components of this class can fail then generate out of the success state. 4) For every
to
the minimal
algorithm PTS
must be extended
3-2.
model.
nodes
and nonin
in Table
more
from
follows
which
Two
subtree
by itself.
algorithm
algorithm
of a
isolated
in non-redundant
reliability
the
become
requirements
are shown
faults.
the root
in that minimal
the
subtrees
a model
the PTS model.
steps
generation
intermittent
subtree,
faults
The
When
it, which
the system
hard
subtrees.
subtree
and
to
exactly.
the nodes
within
can meet
is limited
them
all
none of the subtrees
nodes in the graph,
redundant
or meet
3-2: Minimal
Subtree
Modeling
Steps
_
37
3.5 Merging of Models for PMS Graph Segments
For
the purpose
models
of
following
of the segments
to generate
with N and M
most NM states. incoming Table
of the equivalence
the PTS models
two models
and also
states
All
transitions
the
generate
that follow
I) Retain
only
are
merged
that
in
must
steps
and also
the
model
the system
states,
be merged
model
When has at
and
their
with new ones.
to merge
two models.
and implemented
reliability
start
the
model.
models
along
must be developed
original
G' will
original
be followed
one of the two identical
2) Retain all the other states.
graph
the resulting
in the resulting
these
the PTS models
class
methodology,
the system reliability
states
appear
3-3 shows the steps
Algorithms
the divide-and-conquer
to
model.
states.
which
amount
to N . M - 2
3) Produce at most NM - N - M + I new states by combining each original state in one model with all the original states in the other model, except for the start states.
Table
3.6 Reduction
3-3: Two Model
of the State
The use of time-varying into
three
First,
problems
the models
alleviated
when
such as the ARM computational
of
Third,
the evaluation
become
impossible
program
will
technique
state
space
the
to the system
to
others
option
two of
reliability
a
iarge. can be
and evaluation
tools
Second,
computer
the state
the
prohibitive. systems
may
For
the
limitations.
problems
runs
this
may become
space
model.
extremely
discussed.
certain
applying
systems
human, but
model
memory
last
any
modeling
using
complex
becomes
already
the
model
their
the
for
aided
evaluating
due
of alleviating have
and
of the
purpose
to analyze
computer
program
cost
the
models
intractable
by the use of
Steps
Space
Markov
become
Merging
user space
of the ARM reduction
38
The number computing 72].
of states
can be reduced
the equivalent
The equivalent
transition
transition
IAB =
where
lij is the transition
subset
B.
State merging
equivalent
transition
any state
i in subset
merged
into a subset
The number relatively purpose
[Singh
space
biggest
the states examined
the states
with space
between
truncation
have been
state
i in subset
be reduced
A to state
First,
j in
the
(3.2) must be the same
for
of all the states
by deleting
states
that can be used
truncation
with
a
for this
[Singh 72] and sequential
must
follow
two conditions. state
in t_e remaining
truncated,
space
the states
truncation
has generated
these
may be divided
consisting
conditions
truncation
should
including
the next
different
from the previous
otherwise
one more
failures.
be selected. subset.
subset
should
states
At first
be included
should
be
or
states
components The state states
an arbitrary
level
of a of
can then be repeated
by
are not significantly can be stopped,
and the computations
to systems
to transient
after
be deleted
subset having
the computation
be extensible
are susceptible
should
of N identical
each
that
any new
this new absorbing
If the new values
should
diagram
hasgenerated
The computation
ones,
the
Second,
are not hard to achieve.
into N + I subsets,
level of coincident
space.
transition
of truncation
In systems
First,
should be less
state
whose
which
(3 2)
the probabilities
the new absorbing
components
A and B is
two conditions.
Either
This
[Singh
for any icA
states.
repeated.
subsets
by equation
in the truncated
probability
two states
the subsets
and
75].
be retalned.
certain
between
Two techniques
to see if the process
absorbing
should
can also
state space
probability
the smallest
follow
Second,
low probability.
truncation
State
rate from
them into subsets
must be equal.
of states
are called
rate
rate given A.
rates
_ I.. j_B ij
must
by merging
with
different
and intermittent
types
of
faults.
39
In sequential truncation the state probabilities are calculated every time a hew state is generated and states with probabilities less than a reference value are
deleted.
This
time than state space truncation
method consumes more computation
but
does
not have to be repeated to
insure the accuracy of the approximation.
The state space truncation conservative estimate.
technique
failures can be made
The states failure
reached
with
be
extended to produce a
a certain level of coincident
states.
transitions of the new failure could only be
can
This eliminates the out going
states, and truncates those states that
through
the
new
produce a conservative estimate because
failure
states.
This will
the truncated states will also
be analyzed as though they were failure states.
Only the extended state the ARM program.
space
truncation technique is applicable to
The reason for
this is that the ARM program will not
be evaluating the reliability attempted during segments.
the
model.
generation
Algorithms that follow
implemented.
of
State space truncation will be the
models
for
the
PMS graph
this technique must be developed and
4O
4. Automated Reliability Modeling Examples
Currently redundant
the
program
is
and non-repairable
to transient categories failure
ARM
minimal
or intermittent have been
implemented.
requirements.
has been implemented.
Only
These
types,
Only
to
to
that
input
minimum
categories
output
format
illustrate
ARM
structure,
for the SURE
architecture
the current
input
are: the hard
the interconnection
the
are non C
and are not susceptible
the three
The Cm* multiprocessor
[Swan 77] will be used
systems
subtrees,
faults.
rate of the component
the system
limited
and
program
described
capabilities
in
of the
ARM program.
The
Cm*
multiprocessor
microcomputer.
Figure
architecture. module
connected
modules.
address
composed
the Kmap.
shared
the figure
models
architecture.
by
PMS
validate
the models
program,
and compared equations.
the models
where
Table
4-I.
access
obtained
and
in
from
will
which
figure).
the Kmaps
of
they
exponential
more
external
allow
the automatic
generated
graph be
of manually
to
processors
in
clusters
via
marked
L in
[Siewiorek
78]
Buses.
generation of
the
of Cm*
to the system
using
derived
rates
The
references
is demonstrated.
evaluated
failure
Cms.
is
cluster)
versions
models
will
results
or
to the Intercluster
the
the
Each cluster
or in other
several
the
realize
The components
illustrate
of
or more memory
or in a different
interconnection
from
two
the cluster
modeling
one
passes
controllers
with the The
and
LSI-11
of one processor
processors.
mapping
generated
of failure
the
to
the
collectively
cluster
The sensitivity and the
(Slocal)
on
version
is composed
(Kmap)
(B in the
sections
(Cm)
in the
elsewhere
are interfaces
reliability
requirements
memory
Buses
following
by
based
possible
structure
controller
are
memory
the
is one
interface
space
elsewhere
the Intercluster
module
in
The Kmaps
Cms to access
shows
memories
local
(i.e. to memory
The
an
of a cluster
controls
4-I
computer
via
The
virtual
Slocal
Each
architecture
To
the SURE
probability
used to evaluate
and are reproduced
in
41
B
B
I\
/ \
/
\.
/
L
L
Kmap /\ I Slocal
.
/ P
Kmap /\ \ Slocal
', M
/
M
\ L
L
P
I Slocal
', M
/
M
\ Slocal
\
P
M
Kmap /\ I Slocai
/t\ M
P
\ Slocal
/1\ M
M
P
M
/ M
P
', M
M
Key: B L
Intercluster Intercluster
Kmap
Mapping
Bus Bus Interface
Controller
Slocai P
Local Switch Processor
M
Memory
Figure 4-I: Cm* Architecture
Processor Memory Local Switch Mapping Controller Intercluster Bus Interface Intercluster Bus
Table
4.1Cm*
Computer
connected
The PMS diagram
4-I: Failure
Rates of Cm* Modules
Module
The Cm* computer module
29.893E-6 46.278E-6 24.059E-6 130.935E-6 34.836E-6 O.O00E-6
module
via
an
to
be
modeled
interface
of the Cm* computer
is composed
(Slocal) module
t•o three memory
is shown
Slocal
/ P
Figure
_II i\_ / \ \ M
M
M
4-2: Cm* Computer
of one processor,
Module
in Figure
modules •. 4-2.
42
The model ARM automatically Figure
4-2 requires
function
is shown
failure
states.
in state
failure, model
during
one processor in Fig_e
component I with
when
Only states
module
other
will fail
than a memory
all its components
the first
the Cm* computer
and one memory
4-3.
The computer
or if any single starts
generated
ten hours
to perform
module
in
its
I, 4, and 6 are not if three fails.
working.
of operation,
memories
fail,
The system
The probability obtained
from
of
this
is 5.39375E-4.
_
Q
/
Key :
i
State
Failed components
I 2 3 4 5 6 7 8
None I S I P I M I M& I P 2M 2M & I P 3M
Figure
The
equatlon
4-3: Model
of Figure
for the probability
4-2 Cm* Requiring
of failure
Pf = I - RsRp(R3m + 3_m(I ..2
I P & I M
Pf is
- Rm) . 3Rm(1
- R m)2)
"
(4.1)
43
where Rs, Rp, and R m are the reliability functions of a local switch, a processor, and a memory. The R3 term corresponds •to the state in which m all three memories state -
"
function.
in which one memory
The
failed
3R2(I
- R m) term corresponds
and two are functional.
3Rm(1 - Rm )2 term corresponds
to the state
in which
and one is functional.
probability
of failure,
ten hours of This
The
operation,
obtained
is the same result
obtained
from
from
this
two memories during
equation
the model
to the
The failed
the first
is 5.39375E-4.
in Figure
4-3.
4.2 Effect of the System Requirements
The number function failure.
components
affects The
probability
For
of
of failure
requirements
the
of
starts model
number
model
states.
or if any single I •with
all
six and the
probability
operation, • increased
The equation
module
other
its
of
4-2 are Only
will fail
than a memory
failure,
during
of
of N.
The
when
the
states
to
I and 4•
if two memories The system
Comparing
of states the
increased
fails.
working.
the number
its
of N.
4-4.
components 4-3,
function
generated
in Figure
module
to perform
the probability
function
in Figure
computer
to the one shown in Figure
and
automatically
is shown
component
system
a non-increasing
ARM
The
a
states
is a non-decreasing
and 2 memories
in state
for
of
is
the Cm* computer
are not failure fail,
the
required
number _of states
example,
I processor
both
N
this
decreased
firstten
hours
to of
to 5.40016E-4.
for the probability
of failure
Pf = I - RsRp(R _ + 3R_(I
Pf is
- Rm))
(4"2)
.
where Rs, Rp, and Rm are the reliability functions of a local switch, a processor, and a memory. The R3 term corresponds to the state in which m
-
all three memories state
in
probability
function.
which
one
of
failure,
The
memory
during
obtained
from
this equation
obtained
from
the model
3R_(I
failed
is
in Figure
the
- Rm)
and first
5.40016E-4. 4-4.
term corresponds
two
are
ten
functional.
hours
This
to the The
of operation,
is the same result
44
Key :
State
Failed
I 2 3 4 5 6
None I S I P IM IM& 2M
Figure
components
I P
4-4: Model
of Figure
4-2 Cm* Requiring
I P & 2 M
4.3 Cm* Cluster
The Cm* cluster
to be modeled
connected
via a cluster
composed
of one processor
two memory Figure
modules.
is composed
controller module
of three
(Kmap).
connected
The PMS diagram
Each
computer
computer
via an interface
of the Cm* cluster
modules
module
is
(Slocal)
is shown
to
in
4-5.
Kmap
/
\
Slocal
Slocal
I P
, M
M
Figure
P
Slocal
\ M
I M
P
M
\ M
4-5: Cm* Cluster
The model ARM automatically generated when the Cm* cluster in Figure 475 requires 2 processors and 5 memories to perform itg function is shown in Figure 4-6.
All the failure states have been collapsed
•
into state 2. fail.
45
The cluster will fail if two memories or two processors
The system starts in state I with all its components working.
In state 5 one processor and one memory in the same computer module E have failed, therefore if their local switch fails no other components will be affected.
In state 6 one processor and one memory in a
different computer module have failed, therefore if a local switch fails other components will also be affected.
The probability of
failure, during the first ten hours of operation, obtained from this model is 2.03253E-3.
Key:
-
State
Failed components
I 2 3 4 5 6
None I K or I S or 2 P or 2 M I P I M I M & I P in the same Cm I M & I P in a different Cm
Figure 4-6: Model Of Figure 4-5 Cm* Requiring 2 P & 5 N
-
46
The
equation
for the probability
of failure
Pf = I -R'R3(R3K s p + 3R_(I
Pf is
-Rp))(R6m
+ 6R5(1m -Rm))
(4.3)
where Rs, Rp, and Rm are the reliability functions of a local switch, a processor, and a memory. The R3 term corresponds to the state in which
•
p
all three processors function. The 3R (I - Rp) term corresponds state in which one processor failed and two are functional. term corresponds
to the state
6R_(I - Rm) term corresponds and five
are functional.
ten hours This
of
operation,
of
interconnection
state
probability
obtained
in which
one memory
of failure,
from
this
during
equation
from the model
of
module
affects
failure.
both
For
of
one
is composed
PMS diagram
the
failed
the first
is 2.03253E-3.
in Figure
in
one
let
Figure
processor of
number
example,
the Cm* cluster
are composed
resulting
the
all six memories
4-6.
of the PMS Interconnection
probability
computer
to
obtained
The PMS interconnection
modules
which
The
is the same result
4/4 Effect
in
to the The R6 m function. The
and
us
change
and the the
PMS
4-5 so that two computer one memory,
processor
of the Cm* cluster
of states
and the third
and four memories.
is shown
in Figure
The
4-7.
Kmap
zi\ /
I
Slocal.
Slocal I
I
I
" \
PI
Figure
The Figure
\ MI
4-7: Nonsymmetrical
model
ARM
4-7 requires
is shown in Figure into state 2.
\
automatically 2 processor 4-8.
The cluster
All will
Slocal_
"\
_"1
\
PI
I MI
connection
I
\
\
P2 M2 M2 M2 M2
of Figure
generated
when
and
5 memories
the
failure
4-5 Cm* Cluster
the
Cm*
to perform
states
cluster
its function
have been
fail if two memories
in
collapsed
or two processors
47
fail. ._
The system starts
Comparing this model to
in
state
the
one
states increased to twelve and
I with all its components working. shown
the
in
Figure 4-6, the number of
probability of failure, during the
first ten hours of operation, decreased to 1.55366E-3.
The equation
for the probability
of failure
Pf is
_--'-_k__ +_ (_-__ +_-_m_-. _. C_-__ where Rs, Rp, and Rm are the reliability processor, equation
and a memory.
The only
(4.3) is the addition
corresponds
to the state
other two local switches
in are
functions
difference
of a local
between
of the 2R k R2(I s - Rs)R
which
one
functional.
local
R
switch
_,._
switch,
this equation term
This
S I failed
The probability
a
and term
and the
of failure,
i
during
the first
1.55366E-3. 428.
This
ten hours 'of operation, is the same
result
obtained
obtained
from
this equation
from the model
is
in Figure
48
Key: State I 2 3 4 5 6
Failed
components
None I K, I S2, 2 P, or 2 M I SI & I P! & I MI I P! I MI I P2
State
Failed
components
7 8 9 10 11 12
I I l I I I
M.l in the same Cm M. in a different Cm 11MI M2 " I M2
M2 P! P! P! P2 P2
& & & & &
Figure 4-8: Mode! of Figure 4-7 Cm* Requiring 2 P & 5 M
49
5. Plans for Future Work
The architectures and
fault
_.ypes the
increase in complexity in phases
as
research
described
will address will
below.
The reason for
breaking the research work into phases is to keep the complexity of the problem being addressed at phase
of
the
a
research
manageable
will
be
level.
theoretically
validated before proceeding to the next phase. used as part of the performance and evaluated.
experimental
range of
Based on the
The results of each
results
of
experimentally
The ARM program will be
validation
applications
and
of each phase. the
ARM
Next the
program must be
of the validation and evaluation the
approach will be reformulated as necessary.
The first phase Of the
research
redundant and non-repairable root to be operational. steps I, 4, and 5
of
will
PMS
This Table
address hard faults, and non-
tree
structures
phase
3-I.
that require their
will only involve research into
All subsequent phases will involve
research into steps I, and 4 through 7 of Table 3-I.
Phase two of the research general structures with no
will address hard faults and non-redundant
will address hard faults,
repair.
can
have
imperfect
The
third phase of the research
and dynamically redundant architectures that
coverage,
and
repair.
Examples
of
such
architectures are a multiprocessor at CMU, named Cm* [Swan 77], and the Electronic Switching Systems (ESS)
used
in
the Bell System [Toy 78].
Cm* and ESS will be used in the experimental validation of this phase/
Phase
four _ of
the
research
architectures but only for hard "
the research will For
the
last
two
address
faults.
the
multiprocessor, Cm*, and ESS.
transient,
architectures
Fault-Tolerant
Langley Research Center [Lala
address
83],
an
hybrid
The fifthand
intermittent,
phases
validation will be the
will
used
redundant
last phase of
and hard faults. for experimental
Multiprocessor (FTMP) at NASA's Intel 432 [Siewiorek 82] based
50
6. Conclusion
The previous reliability approach
sections
and availability
consists
The Automated implement
these
system
The
also
program
the
graph,
the behavior and
the
input
method
steps Section
the
most
already
and developing
made
is being to
obtain
Markov the
computer
model
purpose
in identifying
and implementing
system
tch
(PMS)
the faultSection
2
for the
input categories architectures.
from
this
and necessity
and analyzing algorithms
to
a
envisioned
These
3-I.
developed
requirements.
program.
This
in Table
the PMS components,
of the current
the
summarized
currently
ARM
Markov
architectures.
Processor-Memory-Swi of
discussed
for automatic
is
operational
generate 3
step
categories
of
of describing
Progress
involved,
(ARM)
of
eight
descripLion.
were
consisting
the
other
which
first
description
steps.
Modeling
steps.
Strategies,
are capable
steps
approach of computer
The
interconnection
described
an
modeling
of eight
Reliability
description
tolerant
presented
System of these
the problems
to solve
them was
presented.
Section
4 presented
program.
The
requirements Section
examples
sensitivity
and the PMS
5 presented
the ARM program
of
the the
current
models
interconnection
the current
to include
of
plans
capabilities
generated
graph
the system
was also demonstrated.
for extending
all of the steps
to
of the ARM
in Table
the capabilities 3-I.
of
51
A. ARM Program
Algorithms
A.I Symmetry Detection Algorithm Function definitions: Split class(R, C, L) - If relation R is not partitions class C and creates a new class after Returns the number of equivalence classes. Size(C) class C.
- Returns
Element(E,
the
number
C) - Returns
of
element
elements
in the vertex
E of the vertex
E of class
Equal_Neighbor Classes(E, C) - True if same number of neighbors in each class class C.
procedure function
Equivalent(Current
return return
function
= Degree
Element,
•
C is equivalent
Relation);
Equal_Degree(Current Element, class) Equal_Neighbor_Class_s(Current_Element,
This_class,
in
C has the same degree
then
Split_Class(Relation,
C.
element E of class C has the as the preceding elements of
Class,
begin Split := false; for I := 2 to Size(This Class) do begin Current Element := Element(I, This "
class
Symmetry;
begin if Relation else end;
equivalence
equivalence
Equivalent(E, C, R) - True if element E of class terms of relation R to the preceding class elements. Equal Degree(E, C) ' True if element as the _receding elements of class C.
satisfied it then the last class L.
Class);
Last_class);
Class);
if not Equivalent(Current_Element,-This_class, Relation) then begin if not Split then begin Split := true; Last Class := Last Class + I; ( Create a new Last class with the degree and neighbor attributes of the Current element of This class. ); end;
52
( Move the Current_Element end; end; return Last Class; end; { Split C_ass } begin
{ Symmetry
{ Step
to the Last class.
);
}
I: Split
based on equal
Last Class :: Last Type; for _ :: I to Last Class ( Add elements
of
}
do
on equa±
I :: I; while I