An Adaptive Routing Algorithm for WK-Recursive ... - CiteSeerX

15 downloads 17292 Views 220KB Size Report
and congestion conditions, the routing strategy can follow what we call the triangle ..... the Hybrid Computing Research Center) using a subdomain with 64 IMS ...
An Adaptive Routing Algorithm for WK-Recursive Topologies Lorenzo Verdoscia, Roberto Vaccaro Istituto per la Ricerca sui Sistemi Informatici Paralleli - CNR Via P. Castellino, 111 80131 Napoli - Italy email: [email protected]

Abstract This paper presents an easy and straightforward routing algorithm for WK-Recursive topologies. The algorithm, based on adaptive routing, takes advantage of the geometric properties of such topologies. Once a source node S and destination node D have been determined for a message communication, they

hl vn(SD ) and hl vn(DS ) that respectively contain S but not D and D but not S. Such virtual nodes characterize other Nd ? 2 (where Nd is the node degree for a fixed topology) virtual nodes hl vn(ISD ) of the same level that contain neither S nor D. Consequently, it is possible to locate Nd ? 2 triangles whose vertices are these virtual nodes with the property to share the same path, called the self-routing path, directly connecting hl vn(SD ) to hl vn(DS ).

characterize, at some level l, two virtual nodes

When the self-routing path is unavailable to transmit a message from S to D because of deadlock, fault, and congestion conditions, the routing strategy can follow what we call the triangle rule to deliver it. The proposed communication scheme has the advantage that 1) it is the same for all three conditions; 2) each node of a WK-Recursive network, to transmit messages, does not require any information about their presence or location. Furthermore, this routing algorithm is able to tolerate up to [ Nd (N2d ?3) faulty links.

+ 1] NNdd?? l

1 1

1 Introduction As large-scale massively parallel systems have been deployed in the field, it has become increasingly clear that fault-tolerance, freedom from deadlock, and automatic load-balancing are critical factors for communication mechanisms in the design of machines and particularly multiprocessor networks. To achieve these three requirements, it is essential that hardware be designed to be exploited by low-level software without high overheads. Unfortunately, these qualifications, on most interconnection network topologies, require complex routing control strategies that make the hardware design of routers difficult and augment considerably the communication overhead. Most existing adaptive routing algorithms guarantee freedom from deadlock by defining a set of virtual networks implemented with virtual channels. Cycles can be prevented statically [7], [13], [2] or dynamically [8]. Su and Shin [16] have proposed a solution for meshes and hypercubes that adopts only two virtual channels, but each node is assumed to know the state of its neighbors. All of these approaches require virtual channels to provide deadlock-free routing in torus networks. Unfortunately, virtual channels can be expensive because they complicate routing decisions and channel control, increasing router node delay significantly [3]. Consequently, to achieve adaptive routing with speeds comparable to dimension-order routers, more efficient adaptive routing algorithms must be developed. Some exceptions are the Turn Model [14], which prevents deadlock by prohibiting turns, and Compressionless Routing [12], which prevents deadlock by using fine-grained flow control and backpressure of wormhole routing. However, a new class of topologies, called WK-Recursive [4], whose topological properties are described in [9], [1] (note that a comparison with mesh and hypercube topologies can be found in [10]), seems well tailored to efficiently manage faults and deadlocks with a simple routing control strategy. Furthermore, this strategy allows an automatic distribution of load to allow when part of the network is overloaded. For this purpose, we have designed an integrated framework which exploits in a simple way the natural synergy between adaptive routing and fault-tolerance, deadlock prevention, and automatic load-balancing.

2 WK-Recursive networks A WK-Recursive family of network topologies, denoted as WK(Nd ;

L), can be described by two pa-

rameters [4], [9]: the node degree Nd and the expansion level L. A WK network is recursively described as:

i) for L = 1 there are Nd nodes configured as a fully connected undirected graph (called also ampli-

tude) that holds Nd free links, and this constitute what we call a first-level virtual node (vn);

ii) for L > 1 there are Nd virtual nodes, each one being a WK(Nd ;

L ? 1), forming a virtual node of

L-th level and amplitude Nd that holds still Nd free links.

For this family of topologies, starting from a WK(Nd ; 1) and recursively arriving to an expansion level L, the following relations hold:

n = NdL P = Nd  (NdL ? 1)=2 D = 2L ? 1 where n is the total number of real nodes, P is the total number of links, and WK(Nd ;

D is the diameter of

L). We point out that the diameter depends only by L and not by Nd. Figure 1 shows a WK network with Nd = 4 and L = 3 for which we have n = 64, P = 126, and D = 7.

Three important properties of massively parallel multiprocessor systems are expandibility, regularity, and equal degree. A WK network has all these properties. Thus the network is scalable to any expansion level without changing the number of communication links connecting virtual nodes. If on the one hand this can be considered a drawback, in terms of bisection width, on the other it constitutes an advantage in terms of VLSI scalability [15]. In a WK network the indexing scheme is based on the rule used to build higher level virtual nodes starting from a single real node having Nd bidirectional links numbered (0;

1; : : : ; Nd ? 1).

Figure 1. A WK network with

Nd = 4 and L = 3.

Indexing scheme for nodes Once an origin and an orientation have been arbitrarily fixed, a real node N within a WK network is thus recursively numbered: i) for L = 1 by an index n1

2 f0; 1; : : : ; Nd ? 1g;

ii) for L > 1 by an L-tuple of indices nL : : : ni : : : n1 , where ni

2 f0; 1; : : : ; Nd ? 1g.

We note that ni index identifies the virtual node vn at level i including that real node. Therefore, it is possible to address a vn of generic level 1  l  L by removing the l-1 least significant indices.

For example, in Figure 1 the real node A is numbered 022 ( a3

= 0 a = 2 a = 2), the virtual node , including A, is numbered 02 (! = 0 ! = 2), while the virtual node , including A and , is numbered 0 ( = 0). 3

2

1

2

3

Indexing scheme for links As far as the communication link indexing is concerned, it is local to each node. Given

Nd nodes of

amplitude Nd , a pair of links connecting two of them form a channel; any channel has two associated

indices each equal to the index of the node it connects, while the Nd unconnected links result free and

have the same index of the nodes they belong to. Because these free links can be used to build virtual nodes of higher level l, we assign them a weight w

= l. Obviously, any real node of a WK network has

Nd ? 1 links whose weight is w = 1. For Example, in Figure 1 the link connecting the virtual node

= 0 to the virtual node  = 2 is

numbered 2, according to node number connected to, its weight is 3, and belongs to the real node A; the other links of A have w

= 1. The free link of numbered 0, instead, has weigth w = 4 because can be used to build the virtual node of level 4 (L + 1)

3 Adaptive Fault-Tolerant, Deadlock-Free, and Self-Load-Balanceable Routing Algorithm A very simple routing algorithm for WK networks was proposed by Della Vecchia and Sanges [5]. This algorithm, which we call self-routing, uses a deterministic routing to deliver a message. So, the only information it needs is the destination node address. However, despite its simplicity, the selfrouting algorithm is unable to deliver a message in the case of a fault. To guarantee more robustness in delivering a message, Fernandes et al. proposed another solution for WK networks based on the Multi Path Graph technique [11]. This technique is able to tolerate up to Nd ? 2 faults, but presents the drawback of an increase in network traffic because, for each message, it sends Nd ? 1 copies.

Our proposal, as we will show in the following, consists of mixing both these techniques but sending only one message over a possible minimal path. Furthermore, the proposed communication scheme has the advantage that each node of a WK-Recursive network, to transmit messages, does not require virtual channels, nor any information about the presence or location of deadlock, fault, or congestion conditions .

For a WK network the following assumptions are made about the communication model.



Each link of the network is full duplex, i.e., two messages can simultaneously travel on the link in opposite directions.



Communications are based on a wormwhole transport mechanism [6]. A message is broken in flits. The header flit governs the route, and the remaining flits follow in pipeline fashion.



A link is available for a communication if its associated channel (buffer) is available to accept the header of a message, and if that link can physically transmit it.



Each node can concurrently communicate flits on all of its ports.



All messages have the same length.



A message reaching its destination will eventually be consumed.

Under these assumptions, because no constraint has been imposed on the network, a message delivery from a node n to a node m may be blocked because one of the following situations can occur:



deadlock generated by cycles



fault of a link connecting two nodes



overload (congestion) of some link on which that message is supposed to transit.

An important property of WK networks is that, if we consider two distinct nodes M and N, there exist

Nd ? 1 paths connecting M and N that constitute the shortest Edge-Disjoint Hamiltonian path (sEDHpath) set [11]. The self-routing path is one of them. Given two real nodes A and B 2 WK(Nd ; L), we denote by hl

level that contains A but not B, and by hl

vn(AB ) the virtual node of the highest

vn(BA) the virtual node of the highest level that contains B but not A. In a WK network of degree Nd there are also Nd ? 2 virtual nodes of the same level l of hl vn(AB ) and hl vn(BA), which we denote by hl vn(IBA), where 0  I  Nd ? 1, that do not contain either A or B, such that il 6= al ; bl and il = al = bl . Besides, al , bl , and il are the indices of these virtual nodes in the WK network of amplitude Nd at level l + 1. For example, in +1

+1

+1

+2

+2

+2

+1

+1

+1

Figure 2 the two real nodes A=300 and B=331 determine the four virtual nodes of level l=1 respectively numbered hl

vn(AB ) = 30, hl vn(BA) = 33, hl vn(1AB ) = 31, and hl vn(2AB ) = 32. Let us consider two nodes S and D with S, D 2 WK(Nd ; L) where S is the source node and D is the destination node for a message M. Let us consider the WK network of amplitude Nd at level l + 1 such that hl vn(SD ), hl vn(DS ), and hl vn(ISD )  WK(Nd ; l + 1). If we identify hl vn(SD ) as the source node (S l ), hl vn(DS ) as the destination node (D l ) for the message, and hl vn(ISD ) as the generic node (I l ), the minimal path between S l and D l is the self-routing path because it directly +1

+1

+1

+1

+1

Figure 2. The 4 virtual nodes

hl vns of level 1 defined by the two real nodes A=300 and B=331.

connects these two (virtual) nodes. We point out that in a WK(Nd ;

L) with L > 2 the self-routing path

between two real nodes M and N is not always the minimal one. Alternative minimal paths, when the self-routing one cannot be used, are those that connect S l+1 and

Dl

+1

by means of I l+1 , and they belong to the sEDH-path set.

From a geometrical point of view, as shown in Figure 3.a, S l+1 , D l+1 , and each I l+1 can be seen as

the vertices of Nd ? 2 triangles. All these triangles have the property to share the self-routing path while

the union of the other two sides, that is, the path connecting S l+1 to I l+1 and I l+1 to D l+1 is a path in the sEDH-path set. Furthermore, as shown in Figure 3.b, each side is a directed link that has the property to be numbered il+1 on the node S l+1 and dl+1 on the node I l+1 . As the self-routing path has only one side,

the directed link on S l+1 is just dl+1 . Consequently, we can say that a necessary condition for a message

M to be routed along a sEDH-path is that S l Dl

+1

must receive it from a (directed) link lk

+1

must send M through a (directed) link

= dl

+1

lk 6= sl

+1

and

.

Then, the routing strategy from S l+1 to D l+1 can follow what we call the triangle rule algorithm, just because it tries to deliver a message by always routing it along the sides of a triangle. For example, if the link dl+1 is unable to transmit M, a new possible link il+1 will be chosen. If also this link is unavailable, the algorithm will be repeated until a link is found that can transmit

M. If no

Sl with Nd = 6; b) directed link numbering for a communication from S l to D l .

Figure 3. a) Triangles that share the self-routing path between the two nodes +1

+1

and

Dl

+1

in a WK

+1

Figure 4. A possible livelock condition.

link is available, we say that M cannot be delivered. Then, the routing algorithm will be:

counter = 0

route M through the link dl+1

< Nd ? 1) and (link-failure) counter = counter + 1 select a new link il 6= sl and route for it if counter = Nd ? 1

while (counter

+1

+1

M cannot be delivered

Once

M reaches an I l

+1

because of a routing failure on the self-routing path, the algorithm starts

again. In this case the source becomes now the node I l+1 . But, if we do not modify the algorithm to

forward the message from I l+1 , livelock cycles can be generated when we apply the triangle rule. In

fact, when the header of a message reaches an intermediate node, we can behave as though the message had been generated inside that node and could be routed to the destination node via the self-routing path.

Figure 5. The only possible path connecting S and D in presence of 10 unavailable links and applying the triangle rule:

2 ! 1 ! 0 ! 5 ! 3 ! 4.

If the link numbered dl+1 on this node is unavailable, the triangle rule is applied again, but now the message could be sent to where it came from. For example, in Figure 4, if the connection from the node 2 to the node 4 is unavailable, one of the possible paths could be 2-3-4. Suppose that the message is routed just to node 3. Once the header reaches this node, if the link connecting 3 to 4 is not available, the node 3 could send the message back to node 2 instead of routing it for any other dotted one. As WK networks are recursive ones, this algorithm can be repeated at any level of the network. The only thing we want to point out is that, when the link dl+1 of weight l + 1 connecting, e.g., S l+1 to D l+1

is unavailable, so that a new link il+1 of the same weight is assigned to forward M, inside S l+1 we can

apply the triangle rule to the virtual nodes of lower level l, but, every time we change virtual node at this level, the routing link becomes again il+1 . This link will be a kind of attractor for M until we verify its

unavailibility when its weight becomes l + 1. Observing the WK network shown in Figure 5 we can say

that for a generic WK(Nd , 1) the minimum number of links that ensures the network communication is

Nd-1.

So, it is able to tolerate up to Nd (N2d ?3)

+ 1 unavailable links if each node has at least a link on

which the message can advance. Recursively applying this result to each virtual node up to L level, the entire network can tolerate up to [ Nd (N2d ?3)

+ 1] NNdd?? L

1

1

unavailable links.

Because no assumption has been made about the origin of a link’s unavailability, nothing changes in the routing strategy whether it was caused by a fault, or an overload, or a dealock. We point out that at the amplitude level no deadlock can happen because the network is a completely connected system.

At this point we have all elements to define the routing algorithm. In fact, if we allow a message to wait on a channel for a time TM , that can be tuned according to the lenght ofM, after this time we can route it on another channel without caring about the nature of the problem. The benefits that we can obtain are:



automatic load-balancing in case of overload on some link



automatic cutting of cycles in case of deadlock



automatic bypassing of a link in case of fault

Furthermore, no special prevention mechanism and no knwoledge about the state of the network are needed to route a message, but only the destination node address and some extra information on the history of the message path. This result does not only enormously simplify the routing control strategy, because the algorithm manages in the same way the three critical conditions for a network, but makes straightforward the hardware implementation of the algorithm on a router chip. 3.1 Example

Let us consider, for example, a WK(4,3) whose traffic condition is as shown in Figure 6. The network presents some faulty links, denoted by an x; the directed link 0 of weight 3, connecting the virtual node 2 to the virtual node 0,is congested; and there is a deadlock condition inside the virtual node 3. The node D=231 wants to send a message to the node E=102, while, into the virtual node 3, the node A=301 tries to send a message to the node B=312, the node 310 tries to send a message to the node 321, and so on. Inside the virtual node 2, the node D, to send the message to the node E, tries to use the link 1 of weight

hl vn(DE ) to hl vn(ED ). Because the real node of hl vn(DE ) that physically connects this virtual node to hl vn(ED ) belongs to the self-routing path, D 3 because this is the self-routing path connecting

must reach the node whose index is 211. D and 211 determine the new virtual nodes 23 and 21 through which the message must transit. The message, after waiting for a time TM the availibility of the link 1 of weight 2, tries to reach the virtual node 23 through the virtual node 22, applying the triangle rule to the virtual nodes of level 1. When the message reaches the node 211, it cannot advance because its link 1, whose weight is 3, is faulty. The triangle rule is now applied to the virtual nodes of level 2. Thus, a new link, for example the link 0, is assigned to leave the virtual node 2. Now the message tries to reach

the real node 200, but the link 0 of weight 1 of the node 211 is faulty. The triangle rule is applied again, but this time to the nodes of the virtual node 21. Once reached the node 200, because of congestion of its link of weight 3, the triangle rule routes the message for E through the virtual node 3, assigning as new link the link 3. When the message reaches the node 322, the routing link reaching the virtual node 1 becomes again the link 1. Inside the virtual node 3, the header of the message generated, e.g., by A reaches the node 310 but cannot advance on the link 2 (of weight 1) to be forwarded to its destination node B because of the deadlock condition. After waiting for a time TM , the triangle rule is applied. The header is then forwarded to the node C=313 to reach its destination node B. At this point all other messages can advance and the deadlock is automatically removed. We point out that, because of the relation between TM and

M, previous messages generated by dealocked nodes are uninfluent.

previously sent two messages

M

1

and

M

2

In fact, suppose that node 310

destined for node 321 and that the link from 310 to 312

was busy. If it was busy due to transmitting a message for node B=312, after a time be available for

M. 1

T < TM it will

If it was busy due to transmitting a message to node 321 because of a deadlock

condition, we are in the above situation and we saw that deadlock can be automatically removed. The message

M , since it cannot be transmitted before a time TM 2

from the transmission of

M , does not 1

influence the communication conditions.

4 Performance We have evaluated the behavior of a WK network on a prototype network built at the IRSIP (formerly the Hybrid Computing Research Center) using a subdomain with 64 IMS T800 nodes of the CS1 Meiko parallel machine; the routing processes have been coded in the Occam programming language. We have considered a WK-network with

Nd = 4 at different expansion levels L. We have generated a

random traffic on the network to disturb the delivery of one thousand different messages between two nodes whose distance was the diameter

D.

For this purpose we have assumed that each link of any

node presents, for different values, a homogeneous percentage of unavailability (faults, deadlocks, and congestion of some links) to transmit a message. We have also considered a wormhole transmission with a message length of 2L?1 flits. In our model, a message is dropped by the network if its header is not able to advance after a waiting time TM

= 2L? tf , where tf is the time for a flit transmission between 1

Figure 6. A traffic condition in a WK(4, 3).

two adjacent real nodes. Furthermore, we have supposed that, if the header of a message reaches a node, the remaining part of the message will reach that node, that is, no fault can arise along the path crossed by the header until the tail reaches such node. With this routing strategy, we have measured two parameters different from latency and throughput. The first, shown in Figure 7, is the percentage of delivered messages; the second, shown in Figure 8, is the percentage of the average increment of the path length with respect to that of the self-routing path. Observing Figure 7 we see that at 1% of link unavailibility our routing algorithm is able to deliver, up

Figure 7. Delivered messages when the distance between two real nodes S and D is the diameter

D.

Figure 8. Increment of the length of the message path between two real nodes S and D respect to the self-routing path length.

to an expansion level L = 4 (n = 256) all messages, while for L = 5 (n = 1024) 99% of messages. At

10% of link unavailibility our routing algorithm is still able to deliver, up to an expansion level L = 5, a

high percentage of messages ( 80%). Doubling the percentage of link unavailibility and for 3  L  5

we have a drastic performance degradation, conferming that message locality should be preferred in massively parallel systems. Anyway, this routing algorithm offers still good performance (

80% of

delivered messages) for L = 3 (n = 64) with  17% of link unavailibility. So, for L > 3 most of traffic

sould occur inside virtual nodes of third-level. Observing Figure 8 we can see that most of a WK network traffic is delivered by means of the self routing path. In fact, because among paths belonging to the sEDH-path set there exists almost only one shorter than the self-routing, the other ones have a percentage of path increment, respect to self-routing path, bigger than 50%. Because of its simplicity, this result shows the validity of our choice to prefer the self-routing path to the shortest one.

5 Conclusion In this work we have presented an adaptive routing algorithm for WK-Recursive topologies. The algorithm takes advantage from the geometrical properties of such topologies because the source node S

hl vn(SD ) and hl vn(DS ) that respectively contain S but not D and D but not S, and other Nd ? 2 virtual nodes hl vn(ISD ) of the same level of hl vn(SD ) and hl vn(DS ) that contain neither S nor D. Consequently, it is possible to locate Nd ? 2 triangles whose vertices are these virtual nodes. Besides, these triangles have the property to share the minimal path, called the self-routing path, directly connecting hl vn(SD ) to hl vn(DS ).

and the destination node D characterize, at some level l, two virtual nodes

Because the other sides of these triangles become minimal when the self-routing path is unavailable, the routing strategy can follow what we call the triangle rule to deliver a message. The proposed communication scheme, even though based on the wormhole technique, does not require virtual channels, nor any information about the presence or location of deadlock, fault, and congestion conditions. Furthermore, the network is able to tolerate up to [ Nd (N2d ?3)

+ 1] NNdd?? l

1

1

unavailable links. The routing algorithm is

the same for all three conditions and can be easily implemented in hardware because of its simplicity. Results obtained simulating a WK network with Nd

= 4 confirm the effectiveness of this solution, espe-

cially when local communications predominate respect to remote ones.

Acknowledgements Authors wish to thank Prof. Domenico Ferrari for his helpful suggestions to improve this paper.

References [1] G.H. Chen and D.R. Duh. Topological Properties, Communication, and Computation on WKRecursive Networks. Networks, Vol. 24, pp. 303-317, 1994. [2] A.A. Chien and J.H. Kim. Planar-adaptive routing: Low-cost adaptive networks for multiprocessors. In Proceedings of the International Symposium on Computer Architecture, pp. 268-77, May 1992. [3] A.A. Chien. A cost and performance model for k-nary n-cube wormwhole routers. In Proceedings of Hot Interconnects Workshop, August 1993. [4] G. Della Vecchia and C. Sanges. Recursively Scalable Networks for Message Passing Architectures. Proceedings of Int. Conf. Parallel Processing and Applications, 1987. [5] G. Della Vecchia and C. Sanges. An Optimized Broadcasting Technique for WK-Recursive Topologies. Future Generation Computer Systems. Vol.5, no.4, 1990. [6] W. Dally and C. Seitz. The torus routing chip. Jurnal of Distributed Computing. Vol.1, no. 3, 1986. [7] W. Dally and C. Seitz. Deadlock-free message routing in multiprocessor interconnection networks. IEEE Transactions on Computers, C-36(5), 1987. [8] J. Duato. On the design of deadlock-free adaptive routing algorithms for multicomputers: design methodologies. In Proceedings of Parallel Architectures and Languages Europe, 1991 [9] D.R. Duh and G.H. Chen. Topological Properties of WK-Recursive Networks. In Jurnal of Parallel and Distributed Computing pp. 468-474, 1994 [10] R. Fernandes. Recursive Interconnection Networks for Multicomputer Networks. In Proceedings of Int’l Conference on Parallel Processing, 1992 [11] R. Fernandes, D.K. Friesen, and A. Kanevsky. Efficient Routing and Broadcasting in Recursive Interconnection Networks. In Proceedings of Int’l Conference on Parallel Processing, 1994 [12] J.H. Kim, Z. liu, and A.A. Chen. Compressionless Routing. In Proceedings of ISCA ’94 April 1994.

[13] D. Linder and J. Harden. An adaptive and fault tolerant wormhole routing strategy for k-ary ncubes. IEEE Transactions on Computers, C-40(1), 1991. [14] L. Ni and C. Glass. The turn model for adaptive routing. In Proceedings of the International Symposium on Computer Architecture, May 1992 [15] M.T. Raghunath. Interconnection Network Design Based on Packaging Considerations. PhD. thesis, University of California, Berkeley, 1993. Computer Science Division, UCB//CSD-93-782. [16] C.C. Su and K.G. Shin. Adaptive Fault-tolerant Deadlock-Free Routing in Meshes and Hypercubes. IEEE Transactions on Computers, 1995.