Graph Theoretic Bus Fault Tolerance for Spacecraft ...

2 downloads 0 Views 572KB Size Report
Aug 20, 1998 - fellowship at the Jet Propulsion Laboratory, sponsored by NASA and ... Supported in part by the NASA Institute for Advanced Concepts, grant ...
Graph-Theoretic Fault Tolerance for Spacecraft Bus Avionics 1 Laurence E. LaForge 2, 3 [email protected]

Kirk F. Korver 3 [email protected]

The Right Stuff of Tahoe, Incorporated 3341 Adler Court Reno, NV 89503-1263 USA 775-322-5186

Abstract – Introducing new analytic results, we minimize the cost of point-to-point fault tolerant avionics architectures. Refining the graph model of Hayes [23], we formulate the worst-case feasibility of configuration as: What (f+1)-connected n-vertex graphs with fewest edges minimize the maximum radius or diameter4 of subgraphs (i.e., quorums) induced by deleting up to f of the n vertices? We solve this problem by proving: i) K-cubes (cubes based on cliques) can tolerate a greater proportion of faults than can traditional C-cubes (cubes based on cycles); ii) quorums formed from K-cubes have a diameter that is asymptotically equal to the Moore bound, while under no conditions of scaling can the Moore bound be attained by C-cubes whose radix exceeds 4. Thus, for fault tolerance logarithmic in n, K-cubes are optimal, whereas C-cubes are suboptimal. Our exposition also corrects and generalizes a mistaken claim by Armstrong and Gray [19] concerning binary cubes. TABLE OF CONTENTS 1. INTRODUCTION: SPACECRAFT BUS AVIONICS 2. GRAPH-THEORETIC FAULT TOLERANCE 3. TAXONOMY OF GRAPH ARCHITECTURES 4. CONCLUSION: APPLICATION OF RESULTS APPENDIX A.DETAILED RESULTS FOR K-CUBES APPENDIX B. DETAILED RESULTS FOR C-CUBES ACKNOWLEDGMENTS REFERENCES BIOGRAPHICAL SKETCHES

1. INTRODUCTION: SPACECRAFT BUS AVIONICS An important challenge, and the focus of this paper, is to minimize the cost of fault tolerance in point-to-point architectures, such as those relying on an IEEE 1394 Firewire bus [12]. Refer to Figure 1. As originally designed at the Jet Propulsion Laboratory, for example, X2000 avionics accommodate increasing demand for onboard processing power by endowing each bus node with a computer.5 As is the case with the 1394 Firewire bus, we consider configuration to be successful only if bus nodes that have not failed remain capable of communicating among themselves. In this case the target architecture is known as a quorum. We impose additional requirements on the quorum, such as graph diameter or graph radius.4 For illustration, 1394 Firewire mandates that a quorum must contain a tree whose diameter (i.e., maximum number of node-to-node hops) is at most 16. In the interest of performance, moreover, we seek to minimize the maximum number of node-to-node hops in the tree configured [17].6 We model bus nodes and interconnections as the vertices and edges of a graph. For the sake of exposition, we confine our fault model to node failures that partition the bus. Our main contribution is an arsenal of theorems that maximize fault tolerance, while simultaneously minimizing pincount, wiring, and network latency. Practically speaking, we show how these theorems may be embodied in GRAFT, a computer program that synthesizes netlists for optimal designs. Refer to Figures 2 through 6. Over a range of benefits (fault tolerance, latency) and cost (wires or port per node), GRAFT recommends alternative architectures, each with value improved over the original, handcrafted design.

1. 0-7803-5846-5/00/$10.00 © 2000 IEEE. In Proceedings, 2000 IEEE Aerospace Conference. Big Sky, Montana, March 18-25, 2000. 2. Portions of this work performed when assistant professor of computer science and mathematics at Embry-Riddle Aeronautical University. Supported by a fellowship at the Jet Propulsion Laboratory, sponsored by NASA and by the American Society for Engineering Education. 3. Supported in part by the NASA Institute for Advanced Concepts, grant number 07600-026. 4. The diameter and radius of a graph are its maximum resp. minimum eccentricities. A vertex's eccentricity is the maximum distance to some other vertex. The (graph) distance between two vertices is the length of the shortest path connecting them. The length of a path is the number of edges it contains. Depending on the graph, the diameter ranges between the radius and twice the radius ([14], Thm 2.4). 5. Due in part to changes in mission requirements and budgets, the X2000 design departs somewhat from the description here. 6. Actually, the 1394 quorum configured must be a tree [17]. A tree is a maximally connected cycle-free graph [14].

ISC-1A

ISC-1B

SFG-1B

ACS I/F

SFG-1A

c

NVM-4C

gi

Science IF

lo

NVM-3C

I/f

NVM-2C

SLM-1C

NVM-1C

c

SFC-1C

gi

NVM-4B

lo

Flight Computer C

PDS-03

I/f

NVM-2B NVM-3B

SLM-1B

NVM-1B

SFC-1B

c

PDS-02

gi

NVM-4A

lo

NVM-3A

NVM-2A

SLM-1A

NVM-1A

I/f

Flight Computer B

PDS-01

SFC-1A

Flight Computer A

“Rigid-flex” Embedded Network Backplane

Figure 1: Proposed packaging for X2000 computational avionics [9]. A typical node in this design comprises a 4" x 4" multichip module (MCM) slice containing a microcontroller and eight application specific integrated circuits (ASICs) [3]. 00

01

02

03

04

05

06

SFC-2A

CDG01

SFG-1A

ISC-1A

SFC-1A

SFC-1C

STM-1A

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

A

B

1 2 3

3 2 1

A 3 2 1

B

A

B

A

B

A

B

A

B

1 2 3

1 2 3

3 2 1

3 2 1

1 2 3

1 2 3

3 2 1

3 2 1

1 2 3

A 1 2 3

B 3 2 1

(Multi)graph corresponding to netlist:

07 INST1 1394 Controllers A 3 2 1

B 1 2 3

5 ≤ max quorum radius ≤ max quorum diameter ≤ 8

1 2 3

3 2 1

A

B

1 2 3

A

3 2 1

B

1 2 3

A

3 2 1

B

3 2 1

A

1 2 3

B

1 2 3

3 2 1

A

B

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

SFC-2B

SFC-2C

SFG-1B

ISC-1B

SFC-1B

10

11

12

13

14

3 2 1

1 2 3

A

B

1394 Controllers

1 2 3

A

3 2 1

B

3 2 1

A

1 2 3

B

1394 Controllers

1394 Controllers

EPA-1A

STM-1B

INST2

15

16

17

CDG EPA INST ISC SFC SFG STM

Command Data Ground support equipment OpComm Electronics & Processor Assembly Instrument Interface IMU/Sun Sensor Controller System Flight Computer Stellar Frame Grabber Spacecraft Transponding Modem

Figure 2: Sixteen node version of 1394 bus originally proposed for X2000: 2-fault-tolerant, 36 wires/node (6 wires/port) [9]. 00

10

20

30

01

11

21

SFC-2A

CDG01

SFG-1A

ISC-1A

SFC-1A

SFC-1C

STM-1A

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

A

B

B

A

B

A

B

A

B

A

B

1 2 3

3 2 1

3 2 1

1 2 3

1 2 3

3 2 1

3 2 1

1 2 3

1 2 3

3 2 1

3 2 1

1 2 3

1 2 3

3 2 1

3 2 1

1 2 3

1 2 3

3 2 1

1 2 3

3 2 1

1 2 3

3 2 1

3 2 1

1 2 3

1 2 3

3 2 1

3 2 1

1 2 3

1 2 3

3 2 1

3 2 1

1 2 3

A

B

A

B

A

A

B

A

B

A

B

A

B

A

A

B

31 INST1 1394 Controllers

B

A

A

B

Netlist recommended by GRAFT corresponds to a 2-dimensional 4-ary K-cube:

B

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

SFC-2B

SFC-2C

SFG-1B

ISC-1B

SFC-1B

EPA-1A

STM-1B

INST2

03

13

23

33

02

12

22

32

2 ≤ maximum quorum radius ≤ 3 2 ≤ maximum quorum diameter ≤ 3

Figure 3: Same cost (36 wires/node, 6 ports/node) as original design, but improved fault tolerance (5) and latency (radius, diameter). As the channel routing suggests, our graph model should also be complemented by a layout model.

00

10

20

30

01

11

21

SFC-2A

CDG01

SFG-1A

ISC-1A

SFC-1A

SFC-1C

STM-1A

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

A

B

1 2 3

3 2 1

1 2 3

3 2 1

A

A

B

B

A

B

A

B

A

B

A

B

3 2 1

1 2 3

1 2 3

3 2 1

3 2 1

1 2 3

1 2 3

3 2 1

3 2 1

1 2 3

1 2 3

3 2 1

3 2 1

1 2 3

1 2 3

3 2 1

1 2 3

3 2 1

3 2 1

1 2 3

1 2 3

3 2 1

3 2 1

1 2 3

1 2 3

3 2 1

3 2 1

1 2 3

A

B

A

B

A

B

A

B

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

SFC-2B

SFC-2C

SFG-1B

ISC-1B

SFC-1B

03

13

23

33

02

A

B

1394 Controllers

A

A

B

31 INST1 1394 Controllers

B

A

A

B

B

1394 Controllers

1394 Controllers

EPA-1A

STM-1B

INST2

12

22

32

Netlist recommended by GRAFT corresponds to a 1-dimensional 4-ary K-cube-connected cycle, with four nodes in each of 4 cycles:

maximum quorum radius = 3 3 ≤ maximum quorum diameter ≤ 4

Figure 4: Less cost (30 wires/node, 5 ports/node) than original design, but improved fault tolerance (4) and latency. 0000

0001

0011

0010

0110

0111

0101

SFC-2A

CDG01

SFG-1A

ISC-1A

SFC-1A

SFC-1C

STM-1A

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

A

B

1 2 3

3 2 1

1 2 3

3 2 1

A

A

B

B

A

B

A

B

A

B

A

B

3 2 1

1 2 3

1 2 3

3 2 1

3 2 1

1 2 3

1 2 3

3 2 1

3 2 1

1 2 3

1 2 3

3 2 1

3 2 1

1 2 3

1 2 3

3 2 1

1 2 3

3 2 1

3 2 1

1 2 3

1 2 3

3 2 1

3 2 1

1 2 3

1 2 3

3 2 1

3 2 1

1 2 3

A

B

A

B

A

B

A

B

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

1394 Controllers

SFC-2B

SFC-2C

SFG-1B

ISC-1B

SFC-1B

1000

1001

1011

1010

1110

A

B

1394 Controllers

A

A

B

0100 INST1 1394 Controllers

B

A

A

B

B

1394 Controllers

1394 Controllers

EPA-1A

STM-1B

INST2

1111

1101

1100

Netlist recommended by GRAFT corresponds to a 4-dimensional binary hypercube:

4 ≤ maximum quorum radius ≤ 5 4 ≤ maximum quorum diameter ≤ 5

Figure 5: Yet another tradeoff, with value improved over that of the original design: 24 wires (4 ports) per node, 3-fault-tolerant. 00

01

02

03

04

05

SFC-2A

CDG01

SFG-1A

ISC-1A

SFC-1A

SFC-1C

1394 Controller

1394 Controller

1394 Controller

1394 Controller

1394 Controller

1394 Controller

06

07

STM-1A

INST1

1394 Controller

1394 Controller

1 2 3

3 2 1

1 2 3

3 2 1

1 2 3

3 2 1

1 2 3

3 2 1

1 2 3

1 2 3

1 2 3

3 2 1

1 2 3

3 2 1

1 2 3

3 2 1

1394 Controller

1394 Controller

1394 Controller

1394 Controller

1394 Controller

1394 Controller

1394 Controller

1394 Controller

SFC-2B

SFC-2C

SFG-1B

ISC-1B

SFC-1B

EPA-1A

STM-1B

INST2

10

11

12

13

14

15

16

17

Netlist recommended by GRAFT corresponds to a 1-dimensional binary K-cube-connected cycle, with eight nodes in each of two cycles:

maximum quorum radius = 5 5 ≤ maximum quorum diameter ≤ 8

Figure 6: Same fault tolerance (2) and latency as original design, but at half the cost (18 wires/node, 3 ports/node).

Table 1: Notation. Symbol

Significance

x ; x

Ceiling (least integer no less than x); floor (greatest integer no greater than x)

;



Graph distance between vertices u and v; length of path P

O(g(n)); Ω(g(n))

Set of functions no greater resp. no less than c⋅g(n), for n > k, constants c, k

o(g(n)); ω(g(n))

Set of functions h(n) such that lim n → ∞ h/g = 0 resp. lim n → ∞ g/h = 0

Θ(g(n))

Intersection of O(g(n)) and Ω(g(n))

Cn

n-vertex cycle

Cjd

d-dimensional j-ary C-cube

e

Size (number of edges) of a graph

f, f frac

Number, fraction f /n of faulty elements (deleted vertices) that can be tolerated

G

Graph, often one that represents the configuration architecture

G +n,f,k

Set of minimum size (f+1)-connected graphs of order n whose quorums, induced by deletion of up to f vertices, have radii at most k

G n,f , G n,f,k

Set G+n,f,k that minimizes the maximum radius k

H;T

Quorum induced by deleting vertices from G; tree, often one that spans H

Kn=Kn1; Kjd

n-vertex clique; d-dimensional j-ary K-cube

Kjd(n); Km⋅jd

d-dimensional j-ary K-cube-connected cycle on n resp. m⋅j d vertices

n; nC(d,j)

Order (number of vertices) of a graph; of a Kjd; of a Cjd

ρ(n, f)

Maximum radius among quorums induced by f or fewer faults

Pn

n-vertex path

Sn

n-vertex star

2. GRAPH-THEORETIC FAULT TOLERANCE A graph is collection of vertices, any two of which may be related by an edge. In general, we refer to simple graphs, i.e., ones in which no vertex is joined to itself. The degree of a vertex, or node, is the number of edges to which it belongs. Two flavors of hardware cost models have emerged: those which emphasize VLSI area and wirelength (e.g., [30]), and those which stress connectivity (e.g., [22]). By the discussion in Section 1, the latter is more appropriate to point-topoint bus avionics. The cost of bus or network fault tolerance is perhaps best captured by Hayes [23], who proposes and analyzes graph architectures for one-dimensional arrays, simple cycles, and balanced trees.7 The connectivity of a graph G is the minimum number of vertices whose removal from G results in a disconnected graph or a lone vertex. To tolerate f partitioning faults, therefore, we seek architectures whose corresponding graph is

(f+1)-connected. Since our primary cost function is the number of point-to-point interconnections, we furthermore focus our attention on (f+1)-connected graphs with minimum number of edges. A lower bound on this number is readily seen by noting that the connectivity of a graph is at most the minimum degree of a vertex in the graph.8 In consequence, the degree of every vertex in an (f+1)-connected graph is at least f+1. If we sum the degrees of all the vertices then we have counted every edge twice. The number of edges in any (f+1)connected n-vertex graph is therefore at least n(f+1)/2. For any positive integers n > f > 0, moreover, Hayes [23] achieves this bound with constructions from which we can 7. A graph T of order n is a tree if and only if T is connected and cycle-free; equivalently, T is connected and has minimum size n-1 ([14], Chapter 3). T is said to span H if T and H have the same vertices and every edge of T is an edge of H. A (sub)graph H is connected if every pair of vertices is connected by at least one path; alternatively, H contains a spanning tree. The size e and order n of a graph are the number of edges resp. number of vertices it contains. 8. [14], Thm 5.1: vertex connectivity ≤ edge connectivity ≤ min degree.

configure a one-dimensional array (which, of course, is a tree). These constructions are chordal graphs of order n and size n(f+1)/2 from which we can remove i vertices, 0 ≤ i ≤ f, and still have n-i vertices connected together as a path Pn-i . Unfortunately, the diameter of Pn-i equals n-i-1 and is maximum over all quorums. In general, that is,

chordal constructions that achieve a Pn-i depart from our objective. Moreover, the maximum quorum radius of chordal graphs exceeds that of secant graphs, a special case of K-cube connected cycles characterized by LaForge [8].

Table 2: Principal results for maximum quorum radius of n-vertex graph architectures.

Fault tolerance f

Graph architectures

Maximum of quorum radii ρ(n, i), 0 ≤ i ≤ f At least

At most

Maximum radius of quorum divided by lower bound ρ-Thm 6

References

0

G n,0 uniquely the set of n-vertex stars Sn

1

Exactly best possible

[8], Table 7

1

G n,1 uniquely the set of n-vertex cycles Cn

n/2

Exactly best possible

[8], Table 7

2

G+n,2 includes 1-dimensional binary K-cube-connected cycles K21(n = 2m+1), m≥2

2⋅[logj n] -1 = 2d - 1, [(j-1)⋅logj n] -1 = (j-1)⋅d - 1

1 if n = 5 else 1 + n/2 /2

1 + n/2/2

Don’t know

[8] Table 11, Thm 6, discussion on p. 40

G+n,2d-1 includes d-dimensional j-ary C-cubes Cjd; j ≥ 5

j/2⋅logj n

j/2⋅(logj n) +j/2 - 1

Definitely not best possible: ratio diverges to ∞ as n→∞

Table 4, [8], Thms 6, 34, 35 Cor 35.1

G+n,(j-1)d-1 includes d-dimensional j-ary K-cubes Kjd

logj n

1 + logj n

2

G+

(j-1)⋅logj (n/2) = (j-1)⋅d

1 + (j-1)⋅logj (n/m) = (j-1)⋅d + 1

n-2, n-1

n,(j-1)d includes d-dimensional j-ary K-cube-connected edges K2⋅jd, j ≥ 3

G+

n,(j-1)d+1 includes d-dimensional j-ary K-cube-connected cycles Km⋅jd, m ≥ 3

if d = 1

1 + logj (n/2)

2 + logj (n/2)

1 + m/2

if d = 1

m/2 1 + m/2 + logj (n/m) + logj (n/m)

G n,n-2,1 , G n,n-1,1 uniquely the set of nvertex cliques Kn

The tree architectures considered by Hayes [23] are tolerant to at most one fault, and trees configured from these are balanced. By comparison to the problem we wish to consider, this is both underconstrained (we wish to tolerate more than one fault) and overconstrained (our trees need not be balanced). There are as well differences with a number of other works. Kwan and Toida [25] consider tolerance to one and two faults for balanced trees, and whose every level represents a potentially different type of processor. Dutt and Hayes [22] use vertex covering to design balanced j-ary trees

1

As n → ∞: approaches best possible whenever d ∈ o(j) and m ∈ o(d) or d and m bounded. Within 1+q+qr+r of best possible whenever m ---- + 1 ≤ qd and 2

ln d ≤ rln j, for least upper bounds q, r.

Exactly best possible

Table 3

[8], Tables 11 and 14, Cor 33.1, Thm 6

[8] Table 7, Theorem 6

that are optimal when f < j. Still other works treat configuration of balanced trees in either a probabilistic context, or with respect to VLSI layout area and maximum wirelength (e.g., [21]). Thus, we are challenged with finding a solution to the bivariate optimization problem stated in the abstract. We should point out that our analysis makes three simplifying assumptions: i) faults have been correctly diagnosed; ii) the outcome of this diagnosis is passed to an algorithm, that effects the configuration; iii) only vertices (and not

edges) may be deleted. Items (i) and (ii) are major issues, and are addressed (largely by reference to other works), in [8]. Since edge connectivity is no less than vertex connectivity, item (iii) does not materially affect our analysis; however, allowing the deletion of edges can change the sharpness of our results for radius and diameter. Within the page limit of this submission, it is impractical to rigorously establish all of the results of Table 2.9 Rather, we explain our results, and complement our explanations (Appendices A and B) with further details. Appendix A also corrects, and generalizes a mistaken claim by Armstrong and Gray [19] concerning disjoint paths in binary cubes.

3. TAXONOMY OF GRAPH ARCHITECTURES Suppose that an n-vertex graph G is (f+1)-connected and, for 0 ≤ i ≤ f, denote by H an arbitrary quorum induced by deleting i vertices of G. For our purposes it will often be more convenient to formulate the problem in terms of graph radius than in terms of diameter.6 This is largely a consequence of Theorem 1 of [8]: any tree has at most two central vertices,10 and they (it) always lie(s) at the intersection of maximum length path(s). An immediate corollary is that the diameter of a tree is either twice its radius, or twice its radius minus one. The relative convenience of radius over diameter is bolstered by a theorem of [15]: for every vertex u of a connected graph H, there exists a spanning tree T of H that is distancepreserving from u. Moreover, we can compute, on a Turing machine equivalent and in time O(n(n+e)), a spanning tree having minimum radius ([8], Theorem 36). Here n and e are the order and size of the quorum of interest. Taken together, these results free us from having to distinguish the radius of the induced quorum H from the radius of a tree spanning H. Refer again to Table 2. Our candidates for configuration architectures are members of the set G+n,f,k of minimum size (f+1)-connected graphs of order n whose quorums, induced by deletion of up to f vertices, have radii at most k. For given n and f, we naturally wish to assure that k is the exact minimum, in which case we write G n,f , perhaps with an extra subscript k. We denote the corresponding radius by ρ(n, f). Although the general solution to this problem appears to be unknown,11 we can enumerate G n,0,k=2 , G n,1,k=n/2 , and G n,n-2,k=2 ; that is, ρ(n, 0) = 2, ρ(n, 1) = n/2, and ρ(n, n-2) = 1. For other values of f, we provide upper and lower bounds on ρ(n, f), and give sets G+n,f,k whose induced quorums have radii that are logarithmic in n. 9. Full details available in [8], posted at two sites on the World Wide Web. 10. A vertex of G is central if its eccentricity equals the radius of G. 11. The closest body of work seems to be related to the function ϕ(n,d0,d,f), introduced by [33]. Here j counts the minimum number of edges in an nvertex graph with diameter at most d0, such that deletion of any f of the vertices induces a graph of diameter at most d. Even for this relatively wellstudied problem, results are confined primarily to the cases d ≤ 4, f = 1 or d0 = 2 ([13], Chapter IV, Sections 2 and 3). Moreover, our formulation differs in that we fix the number of edges at (f+1)n/2, and then ask for the minimum diameter or radius achievable in a tree that spans the induced quorum.

By way of overview, we can break our analysis into four stages: i) bound the maximum radius ρ(n, i) of any quorum, as a function of the number i of vertices deleted; ii) find the maximum among these maxima ρ(n, i), for 0 ≤ i ≤ f ; iii) convert to bounds on the diameter; iv) compare the corresponding results for different structures to each other, as well as to a general lower bound on the radius. The latter ρ(n, f) ≥

log f

n(f – 1) + 3 ---------------------------f+2

, 1 < f < n-2

(1)

is obtained by maximizing an inequality that takes into account i faults. As derived by LaForge ([8], Thm 6), that is, ρ(n, i) is at least log f

( n – i – 1 ) ( f – 1 ) + f + 1 + [ n ( f + 1 ) mod 2 ] --------------------------------------------------------------------------------------------------------f + 1 + [ n ( f + 1 ) mod 2 ]

(2)

For graphs with maximum degree (as opposed to connectivity) f+1, the independently obtained (1) is equivalent to the bound attributed to Moore [35], and we will continue this custom.12 In particular, any minimum edgecount (f+1)-tolerant graph that achieves equality in (1) is optimal. Refer to the last row of Table 2. An n-vertex clique Kn (that is, a graph of order n and maximal size ½⋅n(n-1)) is tolerant to f = n-2 or f = n-1 faults, has minimum wirecount, and delivers a quorum radius or diameter that is at most one. Substituting the latter gives equality in (1), hence Kn matches the Moore bound and is optimal. Furthermore, with respect to our cost criteria, Kn is the unique optimum graph that is tolerant to f = n-2 or f = n-1 faults ([8], Table 7, Thm 6). A weaker but still strong criterion asserts the optimality of a family of graphs if, as n approaches infinity, the maximum quorum radius is within a constant factor of the Moore bound. Referring to the next-to-last three rows of Table 2, we see that such ratioed asymptotic optimality is indeed achieved by members of the K-cube family. The mainstay of this family, a d-dimensional Gray-coded j-ary K-cube Kjd is recursively constructed as follows. Kj0 is a lone vertex labeled with the null string. For Kjd we i) make j copies of Kjd-1; ii) join with an edge vertices u and v (from different copies of Kjd-1) if and only if u and v have with identical labels; iii) prepend i to the label of each vertex of the ith copy of Kjd-1. Note that Kj1 is just the clique Kj whose vertices have been labeled from 0 to j-1. Figure 7 illustrates binary and ternary K-cubes in 3 resp. 2 dimensions. Very few graphs are known to match the Moore bound [35], and our results for the K-cube family appear to be new. From a practical standpoint, the K-cube family delivers asymptotically minimum quorum radii whenever the fault tolerance is on the order of n1/d log n.

12. In contrast to inequality (1), Moore’s bound is concerned with the maximum order of a graph with bounded diameter and degree, in the absence of faults. Both results make use of arguments that minimize the height of a spanning tree, with application of the formula for summing a geometric series.

1-tolerant graph architecture with minimum edgecount and optimum quorum radius. Analogously, the star Sn comprising a single vertex with n-1 leaves is the unique 0-tolerant graph architecture with least cost. The upper rows of Table 2 synopsize these results.

Where f = 1, inequality (1) does not apply. However, a similar derivation (with common ratio 1 in the geometric series) leads to a lower bound of (n-1)/2 on the maximum quorum radius. Moreover, this bound is tight for n-vertex cycles Cn . Furthermore, and as proved Sec. 3.1 of [8], Cn is the unique

20 110

111

100

101

10

22

21

12

011

010

3

K2 000

11

00

2

K3

001 01

02

Figure 7: Gray-code labeling of a three-dimensional K2-cube and a two-dimensional K3-cube. 022

2 30

31

20

21

32

22

33

2

4

C4 = K2 : vertices correspond to ordered pairs of integers in the nonnegative quadrant

1

222

23 0

10

11

00

01

12

02

13

C33 = K33: vertices correspond to ordered triples of integers in the nonnegative octant

03

1 2 200

210

220

Figure 8: Labeling and connectivity for a C4 -cube and C3 -cube = K33 in two resp. three dimensions.

Performability : Why Prefer Clique-Based K-cubes Over Cycle-Based C-cubes 16 6

5 = 15625 nodes

Quorum Radius, Diameter (Less is Better)

14

Upper bound on diameter, Table 4

C-cube

12

Dimension = 6 Lower bound on radius, Table 4

10 8 6

11-fault-tolerant radix 5 cubes, each costing the optimum 12 edges/node

You are here You are headed here Lower bound on C-cube radius , inequality (2)

3

5 = 125 nodes

4

Lower bound on radius, upper bound on diameter, Table 3 K-cube

Dimension = 3

2 Lower bound on K-cube radius, inequality (2) 0

Faults Tolerated 0

1

2

3

4

5

6

7

8

9

10

11

Figure 9: Performability measures the combination of fault tolerance and performance [24], [32]. The fractional fault tolerance of K-cubes (and their relatives, K-cube-connected edges and cycles) is superior to that of traditional C-cubes. Moreover, for given fault tolerance, and at minimum cost of wires and pins per node, the diameter of a K-cube is less than the radius of the corresponding C-cube. Furthermore, the radii of K-cubes approach lower bound (1), whereas C-cube radii diverge from (1). Finally, we wonder whether there are, in fact, well-studied (f+1)-tolerant graph architectures which are mistakenly believed to deliver optimal, or near optimal, quorum radii. As indicated by Table 2’s results for C-cubes, the somewhat

surprising answer is, "yes". Often referred to in the literature as a "hypercube" or simply a "cube", a labeled d-dimensional j-ary C-cube Cjd is con-

structed as follows. For j = 2: C2d is a d-dimensional binary d

K-cube K2 (equivalently, a (d-1)-dimensional binary Kcube-connected edge K2⋅2d-1); for j = 4: C4d is a K22d (proof by induction); binary cubes are characterized Section 3.3 of [8]. For j > 2: Cj0 is a single unlabeled vertex. Cj1 is a cycle on j vertices, numbered circularly from 0 to j-1; two vertices are joined by an edge if and only if the modulo j difference in their labels equals ±1. Note that a one-dimensional j-ary C-cube Cj1 is the same as a j-vertex zerodimensional j-ary K-cube-connected cycle Kj⋅j0. In general, to construct Cjd we i) make j copies of Cjd-1; ii) prepend i to the label of each vertex of the ith copy of Cjd-1; iii) connect with an edge vertices u and v (from different copies of Cjd-1) if and only if the modulo j difference in the high order digits of the labels on u and v equals ±1, and the low order d-1 digits are identical. Alternatively, we can reserve d digits for the label on each vertex, thus giving to rise a construction that is independent of the order in which dimensions are populated. Figure 8 illustrates 4-ary and ternary C-cubes in 2 resp. 3 dimensions. Note that, since a cycle on three vertices is also a three-vertex clique, C3d = K3d (equivalently, a (d-1)dimensional ternary K-cube-connected cycle K3⋅3d-1); K3d’s

are characterized by Section 3.3 of [8]. It suffices therefore to consider dimensions d ≥ 2 and radices j ≥ 5, and such is the focus of our comparison. The volume of literature concerning C-cubes exceeds perhaps that of any other structure studied in fault tolerance or networks. For this reason, it is especially surprising that K-cubes and their relatives are preferred to C-cubes. Quantitatively, this is due to: 1) the radius of a C-cube quorum exceeding the diameter of the comparable K-cube having identical fault tolerance ([8], Thm 34); 2) there being no relation such that, as nC = j d → ∞, the ratio of the C-cube quorum radius to the Moore bound does not diverge; i.e., this ratio must approach infinity. With respect to both criteria, that is, C-cubes are sub-optimal. Moreover, when scaling is such that K-cubes match the Moore bound, C-cubes diverge from the optimal quorum radius ([8], Cor 35.1). In this ratioed asymptotic sense, K-cubes are optimal, whereas C-cubes are sub-optimal. Figure 9 illustrates these observations for the lowest radix (5) where C-cubes and K-cubes differ. For j > 4 it is impossible to find a C-cube whose fault tolerance and number of nodes equals that of a K-cube, and so we have compared the respective structures having identical fault tolerance. The fractional fault tolerance of C-cubes is less than that of K-cubes, and so the performability comparison is conservative.

GRAFT: GRaph Architecture Fault Tolerance Calculator, Version 2.2. Computes n -node f -fault tolerant graph architectures having minimum number of point-topoint connections, bounded radius ρ and diameter. Instructions: fill in and adjust values for n and f. Determine whether the radius ρ(n) of spanning trees fall within design tolerances. If so then wire avionics according to the adjacency of the recommended graph architecture. Notes and caveats below.

Input:

n = number of nodes

f = maximum number of partitioning faults

e = minimum number of point-topoint connections:

Average number of point-to-point connections per node (number of ports per node)

64

8

288

9.00

Graph radius ρ (n,f) of quorum and of tree Feasible graph architecture(s) with minimum number spanning the quorum

of point-to-point connections:

At least

At most

Recommended:

3-dimensional 4-ary K-cube

3

4

Feasible, but not recommended:

1-dimensional 8-ary K-cube-connected cycle with 8 cycles, each containing 8 vertices

5

5

Figure 10: GRAFT’s main worksheet summarizes properties of feasible graph architectures.

4. CONCLUSION: APPLICATION OF RESULTS Figure 10 illustrates how we have embodied, in executable form, the theorems and corollaries summarized in this paper. GRAFT (GRaph Architecture Fault Tolerance calculator) is a Microsoft Excel workbook. The main worksheet summarizes the quorum radius by taking the maximum of our lower and upper bounds on ρ(n,i), as the number i of faults ranges between 0 and f. Underlying worksheets detail results for stars, cycles, cliques, K-cubes, K-cube-connected cycles, Kcube-connected edges, and C-cubes. As a function of the

number of vertices deleted, the underlying worksheets give bounds on quorum radius. The underlying worksheets also detail lower and upper bounds on quorum diameter, as well as the minimum diameter of a tree spanning the quorum. Returning to the example of X2000, note that the Europa mission must meet space shuttle requirements for worst-case tolerance to two faults. For f = 2 GRAFT is able to construct a K21(n) for all n > 2, and furthermore tells us that the diameter remains within our limit of 16 as long as n ≤ 30. GRAFT’s upper bound on a maximum diameter equals 16

for n = 30, 29, 28, and 27, but decreases to 14 at n = 26. For f = 3 the upper bound 16 on diameter, as computed by GRAFT, attains the limits imposed by the 1394 bus at n = 44, 40, 39, and 36. Within the range 3 < n ≤ 44, GRAFT is able to find only Km⋅jd(n)’s whose dimension d equals 1 or 2, and whose radix j equals 2 or 3.

in as many as 96 nodes, all the while staying within the 16 hop limit imposed by the 1394 bus. Between these points in the design space (same benefit, reduced cost resp. same cost, reduced benefit), we may have other options, each of whose benefit/cost ratio is superior to that of the original, handcrafted design. Figures 4 and 5 illustrate in the case of a bus with sixteen nodes.

From the preceding we conclude that an economical alternative to the original X2000 design (Figure 2) is an K21(n) (Figure 6). In this case we halve the number of 1394 ports originally proposed for X2000. Doing this recovers 16 input/output pins per node, and at the same time maintains the fault tolerance at 2. The percentage of pin resources used by bus communications drops from 35% to 16%. Alternatively (Figure 3), we can keep six ports per node. In this case, and as computed by GRAFT, we can tolerate five faults

Figure 10 also illustrates how the utility of Table 1 extends to high performance systems that are highly fault tolerant. We suggest that analytic results, such as those illustrated in this paper, properly characterize such performability, in novel terms that depart from standard Markov models [24], [32]. We leave as future work the integration of a comprehensive set of algorithms for diagnosis and configuration of autonomous spacecraft.

APPENDIX A. DETAILED RESULTS FOR K-CUBES Table 3: Characteristics of quorums induced by deleting vertices of d-dimensional j-ary K-cubes Kjd. Kjd is constructible if and only if the maximum number of faults f equals [(j-1)⋅logj n] -1 and d = logj n. Theorems are from [8]. Radix j of K-cube

Number i of vertices deleted, 0≤i≤f f = [(j-1)⋅⋅logj n] -1 0

2

from 1 to [log2 n]- 2

Radius At least

≥3

from [logj n] to [(j-1)⋅logj n] - 1

At most

log2 n Theorem 7 [log2 n] - 1 Theorem 11

[log2 n] - 1 from 0 to [logj n] -1

Diameter

[log2 n] Theorem 9 [log2 n] + 1 Theorem 9

logj n Theorems 8, 11 logj n Theorem 11

Between any two vertices in K2d , d > 1, there are d interior-disjoint paths of length at most d [19]. Rather, and again for d > 1: (LaForge, Thm 9 [8]) Between any two vertices in K2d: a) there are d interior-disjoint paths of length at most d+1. b) At least d-1 of these paths have length at most d. The following counterexample simultaneously shows that the claim of Armstrong and Gray is false, while Theorem 9 of [8] is best possible.13

At most

log2 n Theorems 9, 10 log2 n Theorem 10

[log2 n] + 1 Theorem 9

logj n Theorems 8, 10

[logj n] + 1 Theorem 8

In addition to the results cited in Table 3, let us spruce up a 1981 claim of Armstrong and Gray:

At least

logj n Theorem 10

[logj n] + 1 Theorem 8

Number i of vertices deleted, 0≤i≤f f = [(j-1)⋅⋅logj n] -1 from 0 to [log2 n] - 2

[log2 n] - 1 from 0 to [logj n] - 1 from [logj n] to [(j-1)⋅logj n] - 1

Consider arbitrary vertex u ′ in copy K ′ of K2d-1 ⊂ K2d. Form a quorum H by deleting d-1 neighbors of u ′; leaving undeleted one neighbor u ″ ∈ K ″ of u ′. The label of u ″ is same as that on u ′, except for the high order bit. Let v ′ be the vertex in K ′ having label whose low order d-1 bits are all different from those of u ′. In H any path of shortest distance necessarily enters K ″ via the edge (u ′, u ″), follows a path to some vertex z ″ ∈ K ″, re-enters K ′ via edge (z ″, z ′), and follows a path from z ′ to v ′. The length of this path is at least + + + ≥ 2 + = d+1. Since K2d contains 2 d vertices, this construction holds for each of the 2 d values that u can take d

13. We do not include a proof of Theorem 9. As remarked near the end of Section 3, at radix j = 2 C-cubes and K-cubes are equivalent. It is therefore arbitrary whether to include this counterexample here or in Appendix B. For j > 2 the counterexample is generalized on page 22 of [8].

on. Of the  2  quorums formed by deletion of d-1 verti d–1

ces from a Gray-coded d-dimensional binary cube, that is, at least 2 d have diameter d+1.

APPENDIX B. DETAILED RESULTS FOR C-CUBES Table 4: Properties of quorums induced by deleting vertices from C-cubes Cjd, j ≥ 5, d ≥ 2. Theorems, Corollaries from [8] Number i of vertices deleted, 0 ≤ i ≤ f f = 2⋅⋅[logj n] - 1

Radius At least

Diameter At most

At least

At most

j/2⋅logj n Corollary 29.1

0 from 1 to [logj n ] - 1 from [logj n ] to 2⋅[logj n ] -2

if j is odd then ½⋅(j-1) ⋅logj n else ½⋅j ⋅[logj n]-1 Theorem 31

2⋅[logj n ] -1

j/2⋅logj n Theorem 30, Corollary 32.2 j/2⋅([logj n]-1)+j/2 Corollary 32.2 j/2⋅(logj n)+j/2-1 Corollary 32.2

ACKNOWLEDGEMENTS We are indebted to the staff at the Jet Propulsion Laboratory, in particular, to personnel working on the X2000 project and in the Center for Integrated Systems Microelectronics (CISM): Leon Alkalai, Robert Barry, Savio Chau, Daniel Dvorak, Daniel Erickson, Donald Hunter, John Lai, Al Nikora, Robert Rasmussen, Glenn Reeves, Raphael Some, and Carl Steiner.

REFERENCES Related NASA Documents [1] R. C. Barry, "X2000 Fault Protection Approach, Plan". Viewgraph presentation. Pasadena, CA: Jet Propulsion Laboratory, February 26, 1998. [2] S. Chau, "X2000 Core Avionics Peer Review: Fault Protection". Viewgraph presentation. Pasadena, CA: Jet Propulsion Laboratory, April 17, 1998. Online at http://knowledge.jpl.nasa.gov/adssdlib/. [3] S. Chau and E. Holmberg, "X2000 Core Avionics Peer Review: CDH Slices Description". Viewgraph presentation. Pasadena, CA: Jet Propulsion Laboratory, April 17, 1998. Online at http://knowledge.jpl.nasa.gov/adssdlib/. [4] W. Charlan, S. Chau, J. Donaldson, D. Geer, C. Guiar, H. Luong, N. Palmer, V. Randolph, R. Rasmussen, C. Steiner, and S. Woods, "X2000 Core Avionics Peer Review: Bus Tiger Team". Viewgraph presentation. Pasadena, CA: Jet Propulsion Laboratory, June 11, 1998. [5] S. Chau, "Backup Upstream Connection Failure, X2000 Preliminary Design Review". Viewgraph presentation. Pasadena, CA: Jet Propulsion Laboratory, August 18– 20, 1998.

j/2⋅logj n Theorem 30

j/2⋅([logj n]-1)+j/2 Corollary 32.2 j/2⋅(logj n)+j/2-1 Corollary 32.2

[6] D. Geer, "X2000 Core Avionics Peer Review: Flight Computer Status". Viewgraph presentation. Pasadena, CA: Jet Propulsion Laboratory, June 11, 1998. [7] D. J. Hunter, "X2000 Design, Implementation, and Cost Review: First Delivery Project: Electronic Packaging". Viewgraph presentation. Pasadena, CA: Jet Propulsion Laboratory, March 11 – 13, 1997. Online at http://knowledge.jpl.nasa.gov/adssdlib/. [8] L. E. LaForge, "Fault Tolerant Physical Interconnection of X2000 Computational Avionics". Pasadena, CA: Jet Propulsion Laboratory, November 27, 1998. Document JPL-D-16485, online at http://knowledge.jpl.nasa.gov/adssdlib/ and at http://www.ec.erau.edu/cce/centers/faculty/laforgel/Current-and-Recent-Research/NASA-ASEE/. [9] C. Steiner, "X2000 Design, Implementation, and Cost Review: First Delivery Project: Avionics System Engineering". Viewgraph presentation. Pasadena, CA: Jet Propulsion Laboratory, March 11 – 13, 1997. Online at http://knowledge.jpl.nasa.gov/adssdlib/. [10] C. Steiner, "X2000 Core Avionics Peer Review: Avionics Overview". Viewgraph presentation. Pasadena, CA: Jet Propulsion Laboratory, April 17, 1998. Online at http://knowledge.jpl.nasa.gov/adssdlib/. [11] D. Woerner, A. Spear, G. Parker. "X2000 First Delivery Implementation Plan". Pasadena, CA: Jet Propulsion Laboratory, August 7, 1998. Document JPL-D-15438, online at http://knowledge.jpl.nasa.gov/adssdlib/. Related Books and Dissertations [12] D. Anderson, Firewire System Architecture, Reading, MA: Addison Wesley, 1998. [13] B. Bollabás, Extremal Graph Theory, London: Academic Press, 1978.

[14] G. Chartrand and L. Lesniak, Graphs and Digraphs. Belmont, CA: Wadsworth, Inc, 1986. 2nd ed. [15] O. Ore, Theory of Graphs, Providence: American Mathematical Society Publications, 1962. [16] P1394: Standard for a High Performance Serial Bus, New York: Institute of Electrical and Electronics Engineers, Inc. Draft 8.0v2, July, 1995.

Reconfigurable Architectures), R. W. Hartenstein and V. K. Prasanna, eds., 117–120. Bruchsal, Germany: ITpress Verlag, March, 1997. Online portable document format (PDF) at http://ec.db.erau.edu/cce/centers/faculty/laforgel/Refereed/. [29] L. E. LaForge, "Configuration of Locally Spared Arrays in the Presence of Multiple Fault Types", IEEE Transactions on Computers, 48 (4), 398–416, April, 1999.

[17] D. Paret and C. Fenger, The I 2C Bus: From Theory to Practice, New York: John Wiley and Sons, 1997.

[30] T. Leighton and C. E. Leiserson, "Wafer-Scale Integration of Systolic Arrays", IEEE Transactions on Computers, C-34 (5), 448–461, May, 1985.

[18] G. B. Thomas, Calculus and Analytic Geometry, Reading, MA: Addison Wesley, 1969. 4th ed.

[31] J. McDermid and N. Talbert, "The Cost of COTS", Computer, 46–52, June, 1998.

Related Papers and Articles

[32] H. Nabli and B. Sericola, "Performability Analysis: a New Algorithm". IEEE Transactions on Computers, 45 (4), 491–494, April, 1996.

[19] J. R. Armstrong and F. G. Gray, "Fault Diagnosis in a Boolean n Cube of Microprocessors", IEEE Transactions on Computers, C-30 (8), 587–590, August, 1981.

[33] U. S. R. Murty and K. Vijayan, "On Accessibility in Graphs", Sakhya Ser. A, 26, 299–302, 1964.

[20] P. Banerjee, S-Y Kuo, and W. K. Fuchs, "Reconfigurable Cube-Connected Cycles Architectures", Proceedings, 16th International Symposium on Fault Tolerant Computing, 286–291, July, 1986. [21] Y-Y Chen and S. J. Upadhyaya, "Reliability, Reconfiguration, and Spare Allocated Issues in Binary Tree Architectures Based on Multiple-level Redundancy", IEEE Transactions on Computers, 42 (6), 713–723, June, 1993. [22] S. Dutt and J. P. Hayes, "On Designing and Reconfiguring k-Fault-Tolerant Tree Architectures". IEEE Transactions on Computers, 39 (4), 490–503, April, 1990. [23] J. P. Hayes, "A Graph Model for Fault Tolerant Computing Systems". IEEE Transactions on Computers, C-25 (9), 875–884, September, 1976. [24] B. R. Haverkort and I. G. Niemegeers, "Performability Modelling Tools and Techniques". Performance Evaluation, 25, 17–40, 1996. [25] C.-L. Kwan and S. Toida, "Optimal Fault-Tolerant Realizations of Some Classes of Hierarchical Tree Systems". Proceedings, 11th Annual Symposium on Fault-Tolerant Computing, 176–178, June 24–26, 1981. [26] L. E. LaForge, K. Huang, V. K. Agarwal, "Almost Sure Diagnosis of Almost Every Good Element", IEEE Transactions on Computers, 43 (3), 295–305, March, 1994. [27] L. E. LaForge, "Feasible Regions Quantify the Configuration Power of Arrays with Multiple Fault Types", Proceedings, European Dependable Computing Conference 1, K. Echtle, D. Hammer, and D. Powell, eds, 453–469, Berlin: Springer-Verlag, 1994. [28] L. E. LaForge, "Configuration for Fault Tolerance", in Reconfigurable Architectures: High Performance by Configware (a.k.a. Proceedings, 1997 International Workshop on

[34] F. P. Preparata and J. Vallemin, "The Cube-Connected Cycles, a Versatile Network for Parallel Computation", Communications of the ACM, 30–39, May, 1981. [35] M. Sampels, "Large Networks with Small Diameter", Proceedings, Graph-Theoretic Concepts in Computer Science, 23rd International Workshop, R. H. Möhring, ed., 288–302, Berlin: October, 1997. [36] A. K. Somani, V. K. Agarwal, "Distributed Diagnosis Algorithms for Regular Interconnected Structures". IEEE Transactions on Computers, 41 (7), 899–906, July, 1992.

BIOGRAPHICAL SKETCHES Laurence E. LaForge is President and founder of the Right Stuff of Tahoe. Previously an assistant professor of Computer Science and Mathematics for Embry-Riddle Aeronautical University, he took adjunct status in August of 1998 in order to devote fulltime energies to The Right Stuff. Bidding on behalf of The Right Stuff, Dr. LaForge won a Phase 1 award from the NASA Institute for Advanced Concepts (NIAC) for his 1999 proposal, "Architectures and Algorithms for Self-Healing Autonomous Spacecraft". As a 1998 NASA/ASEE Summer Faculty Fellow, he worked on bus fault tolerance and fault tolerant middleware for the Jet Propulsion Laboratory's X2000 project. Recipient of a 1997 NASA/ASEE Summer Faculty Fellowship, he worked at JPL on software fault tolerance for Deep Space 1, the first probe propelled by an ion engine. In 1998 he was guest editor for of the IEEE Transactions on Components, Packaging, and Manufacturing Technology, and in 1997 was program chair for the IEEE International Conference on Innovative Systems in Silicon.

Kirk F. Korver is Director of Product Development for and cofounder of The Right Stuff of Tahoe, Incorporated. He holds his MS in Electrical Engineering from the University of Nevada, Reno, where he designed the CMOS array

processor Tau-3. From 1996 to 1999 he was a firmware engineer and team leader at International Game Technology. Mr. Korver is senior technical researcher for a Phase 1 award from the NASA Institute for Advanced Concepts (NIAC): "Architectures and Algorithms for Self-Healing Autonomous Spacecraft".