On Detailed Routing for a Hierarchical Scalable

0 downloads 0 Views 289KB Size Report
Apr 8, 1998 - Modern ASIC design and FPGA mapping processes are typically ... boxes are in practice prohibitively expensive, requiring area and .... formulated as the packing of rectilinear global nets disjointly into the physical ...... chordal graphs," J. Combinational Theory, vol. ... VLSI Physical Design Automation, The-.
On Detailed Routing for a Hierarchical Scalable Recon gurable Array With Constrained Switching Capability CS-270 Course Project Eylon Caspi, Randy Huang, and Christoforos Kozyrakis April 8, 1998 Abstract In modern FPGA CAD ow, netlist routing on a particular routing architecture is solved in two steps, global routing based on wire bandwidth constraints of the architecture, and subsequent detailed routing based on the ner switching constraints of the architecture. Detailed routing is dif cult and provably NP-complete in popular 2-D mesh architectures such as the Xilinx 4000 series FPGAs [16]. Certain tree based routing architectures, which are desirable for scalability and area universality [11], have known polynomial algorithms and guarantees for detailed routing. In this paper we study approaches to detailed routing for a fat-tree routing architecture with restricted switching topology, in which routing bandwidth scales according to Rent's Rule. We discuss several formulations and frameworks for solving the detailed routing problem, using such techniques as graph coloring, multicomodity ow, and integer linear programming.

1 Introduction A eld-programmable gate array (FPGA) is a programmable semiconductor chip containing an array of con gurable logic blocks and interconnect. An arbitrary logic design can be mapped to run on an FPGA in seconds to hours, and physically loaded into the array within tens of milliseconds, thereby bypassing the months-long turnaround time typically associated with custom chip manufacturing. Because of their fast development times and low cost for low-volume production, FPGAs have gained popularity in application speci c integrated circuit (ASIC) designs. For example, the ow control logic in the popular Cisco network router is implemented in Xilinx XC4K FPGA. 1

C

S

C

S

C

S

L

C

L

C

L

C

C

S

C

S

C

S

L

C

L

C

L

C

C

S

C

S

C

S

L

C

L

C

L

C

Figure 1: A generic 2-D mesh routing architecture. \L" denotes a logic block (LUT); \C" denotes a connection-box (C-Box) connecting a LUT to a neighboring routing channel; \S" denotes a switch-box (S-box) connecting vertical and horizontal channels. The most widely used FPGA (Xilinx XC4K FPGA) is a look-up-table (LUT) and SRAM based 2-D mesh routing architecture. A LUT implements a logic function by using its input signals to address a block of SRAM which contains a look-up-table for that function. For example, a 2-input LUT can be con gured to implement an AND gate by programming the 4 SRAM bits to be f0,0,0,1g; a 3-input LUT can be con gured to implement a 3-way OR gate by programming the 8 SRAM bits to be f0,1,1,1,1,1,1,1g. In general, an FPGA logic block contains more than just a LUT, but we will use the two terms interchangably. A generic 2-D routing architecture is shown in gure 1. We adopt the terminology of [16] for a discussion of the architecture. Logic blocks (LUTs, marked \L" in the gure) communicate through a 2-D mesh of routing channels. Each channel contains several parallel wires or tracks. The I/O pins of each LUT connect to a neighboring routing channel through a switch termed a connection-box, or C-box (marked \C"), shown in gure 2. Vertical and horizontal channels are connected by switches termed switch-boxes or S-box (marked \S"). The exibility of a C-box or S-box refers to the number of pins or perpendicular tracks a track can connect to through the box. Note that the exibility of S-boxes and C-boxes need not be the same. The number of tracks in a channel is termed the channel capacity W, and the number of tracks in a channel reaching a C-box is termed the base channel capacity c. Note that W = c in this architecture, though this need not be true in general. Track domains refers to maximal sets of tracks mutually reachable through S-boxes. Track domains partition the set of all tracks in a manner determined by the topology of S-boxes. A LUT-to-LUT signal, or net, must therefore be routed completely inside a track domain. Modern ASIC design and FPGA mapping processes are typically done in several phases. A design is rst speci ed in a high-level languages such as VHDL or Verilog, or in a schematic-capture method which combines functional modules, 2

k pins c tracks

C Box

LUT

Figure 2: A Connection-box (C-box). LUT I/O pins (vertical wires) can connect to certain routing tracks (horizontal wires) according to an arbitrary switching topology.

and then synthesized into a netlist (a speci cation of logic gates and interconnections between them) A subsequent placement phase assigns logic functions into logic blocks at particular locations on the array. Finally, a routing phase assigns nets into particular routing tracks. The placement of logic blocks in order to minimize total wire length (signal distance) is in general NP-complete, requiring such heuristics as min-but bi-partitioning and simulated annealing. The routing problem is typically solved in two phases. An initial global routing phase chooses for each net a sequence of channels which form a path between the net's endpoints, constrained only by available channel capacities along the path. A subsequent detailed routing phase assigns for each net a sequence of particular tracks, constrained by the precise switching topology of S- and C-boxes along the signal path. The maximum number of signals assigned to any channel by the global routing is termed the global channel density Wg , whereas the maximum number of tracks needed in any channel after detailed routing is termed the detailed channel density Wd . In general, Wd may be larger than Wg , because restrictive switching schemes may not allow arbitrary packing of nets through tracks and S-boxes. We would like to guarantee that the mapping ratio Wd =Wg be bounded for any design feasible in a particular routing architecture, namely in order to guarantee that any design global-routable in Wg tracks is detailed-routable in only slightly more tracks. In a routing architecture with channel capacity W and a bounded mapping ratio M, a design would be guaranteed to route if it successfully global-routes using Wg  W=M tracks. It has been shown that 2-D mesh architectures with less than full switch exibility do not have a bounded mapping ratio and that their detailed routing problem is NP-complete [16]. While 2-D mesh architectures with full switch exibility have a perfect mapping ratio Wd =Wg = 1 and trivial detailed routing, full- exibility switchboxes are in practice prohibitively expensive, requiring area and power that grow with the square of the number of LUTs. Fat-tree routing architectures have several properties which make them attractive over 2-D meshes. Because they are hierarchical, fat-trees are naturally 3

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

ROOT L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

(a)

1111 0000 0000 1111 0000 1111 0000 1111 0000 1111

111 000 000 111 000 111 000 111 000 111

111 000 000 111 000 111 000 111 000 111

000 111 111 000 000 111 000 111 000 111

(b)

Figure 3: A fat-tree routing architecture: (a) routing graph with several global nets, (b) corresponding colored intersection graph. T switches 1 2 3 4 5 6 1 2 3 4 5 6

pi switches 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6

Figure 4: Two S-box topologies for fat trees, showing possible connections of incident tracks: (top) using T-switches, (bottom) using -switches.

4

scalable. Speci cally, the channel capacity in and out of a subtree can be made to scale according to Rent's Rule, an empirical nding that the number of I/O connections to a computational block scales sub-linearly with the size of the block1 . Fat trees are provably area-optimal, in that they can implement a logic design in the same area as its optimal implementation with bounded (poly-logarithmic) slow-down [11]. Because global routing on a fat-tree is deterministic given a placement, the placement and routing phases are collectively replaced by a partitioning phase. While balanced partitioning based on min-cut bi-partitioning is NP-complete, there are known ecient approximation algorithms such at Fiduccia-Mattheyes (local search) and spectral partitioning [14]. Detailed routing for some tree architectures is provably easier than for 2-D meshes. Wu et al. have demonstrated an H-tree architecture using diagonal Sboxes ( gure 4(top)) which has a bounded mapping ratio Wd =Wg = 1:5 as well as a polynomial-time detailed routing algorithm. This architecture, however, does not scale channel capacity with subtree size, and thus becomes bandwidth limited for non-local signals in large logic designs. U.C.Berkeley's HSRA (Hierarchical Scalable Recon gurable Array) uses a fat-tree routing architecture with two types of S-boxes in order to scale routing resources in accordance with an arbitrary Rent parameter p. By using an arbitrary schedule of S-boxes based on T-switches or -switches ( gures 4(top),(bottom)) at di erent heights of the tree, the tree can be made to resemble anything from the H-tree of Wu et al. [16] (using only T-switches, p = 0) to a butter y network with channel capacity linear in the number of logic blocks (using only -switches, p = 1). The fat-tree routing resources are completely described by the following parameters:

 N | the number of logic blocks in the array  k | the number of I/O pins connected to each logic block  c | the base channel capacity (a C-box connects k pins to c channels, as

in gure 2)  growth schedule | the sequencing of S-box types (T-switch or -switch) at each height of the tree

In this project, we study the detailed routing problem for a fat-tree routing architecture as in the U.C.Berkeley HSRA. In section 2, we survey the ndings and methodologies of several related works on detailed routing. In section 3, we formulate the detailed routing problem in several graph-theoretic frameworks and work towards ecient routing algorithms. Our ndings thus far do not shed light on whether a bounded mapping ratio exists.

1 Rent's Rule can be stated formally that the number of signal connections in and out of a group of N logic units is #IOs = cN p , where the Rent Parameter p (0  p  1) and scaling factor c are properties of the logic design in question. In the case of FPGA routing architectures, with N being the number of LUTs, c is the base channel capacity, and p is typically 0.5.

5

Figure 5: An instance of the Berkeley HSRA (Hierarchical Scalable Recon gurable Array)

with N = 64 LUTs, k = 3 pins per LUT, base channel capacity c = 3, and an alternating T-switch / -switch S-box schedule to obtain a Rent parameter p = 0:5.

2 Related Work 2.1 2-D Mesh Architectures Wu, Tsukiyama, and Marek-Sadowska prove in [16] that detailed routing on a 2-D mesh architecture (as in the Xilinx 4000-series FPGAs) is NP-complete and has no bounded mapping ratio. Their analysis considers 2-D Manhattanstyle routing with diagonal switch-boxes, where track domains are 1 channel wide, and channel width is the same everywhere and equal to the number of track domains. Figure 1 shows a high-level view of this routing model. Routing channels connect to LUTs via full- exibility connection boxes, so that each LUT I/O pin can connect to any (or multiple) tracks. This feature allows the analysis to largely ignore the programming of C-boxes. Detailed routing is formulated as the packing of rectilinear global nets disjointly into the physical routing resources (\2-D interval packing"). The assignment of nets into tracks is formulated as a graph coloring problem on the intersection-graph of global nets. 6

The intersection graph is de ned to contain a node for each global-routed net, an edge connecting nets which pass through a common S-box (or equivalently, through a common C-box), and with node colors representing track domains. Using coloring on intersection graphs, detailed routing for such a 2-D architecture is shown to be NP-complete (for 3 or more track domains) by reducing the coloring problem for an arbitrary graph G into an instance of the detailed routing problem. Because the instance is constructed to have xed global channel density Wg = 2, while the detailed channel density Wd is equal to the chromatic number of G and is thus unbounded, the mapping ratio Wd =Wg is not bounded. In [17], Wu and Chang prove that the results of NP-completeness and unbounded mapping ratio (for 3 or more track domains) hold with any xed switch-box topology that has less than full exibility. This is to say that adding more switching exibility to a 2-D mesh routing architecture does not make its detailed routing any easier (short of expanding to full exibility).

2.2 Tree Architectures Tree-based routing architectures arrange channels into an H-tree or fat-tree topology, with S-boxes as internal nodes (connecting channel triplets) and LUT/Cbox pairs as leaves. Leighton's tree-of-meshes architecture [10] represents the case where S-boxes have full exibility, allowing tracks of the 3 incident channels to be connected in arbitrary permutations. Detailed routing is trivial in this architecture, and a perfect mapping ratio Wd =Wg = 1 can be achieved. In practice, implementing full- exibility S-boxes is prohibitively expensive in area as well as power, since the number of switches required grows with the square of the number of LUTs. Some good routing results exist for lower- exibility S-boxes, speci cally for diagonal S-boxes ( gure 4(top)), which allow the network's switch requirements to grow linearly with the number of LUTs. Using the same correspondence between detailed routing and graph coloring, Wu et al. show in [16] that detailed routing on an H-tree architecture with diagonal S-boxes can be done in polynomial time with a bounded mapping ratio Wd =Wg  3=2. The H-tree topology studied, shown in gure 3(a), uses 3-way T-switches in its S-boxes, as shown in gure 4(top). As in the 2-D architecture, track domains are 1 channel wide, and channel capacity is the same everywhere. Note that this architecture corresponds to a Rent parameter p = 0, where channel widths do not scale for larger trees. The existence of an ecient routing algorithm derives from the fact that the global nets, being subtrees of an undirected tree, have a chordal intersection graph [3] which can be colored in O(jV j + jE j) time [5]. Figures 3(a),(b) illustrate the correspondence between an H-tree's global routing graph (a) and chordal intersection graph (b). The chordal property of subtree intersection graphs also indicates the applicability of ecient algorithms for other dicult problems, including maximum clique, minimum cover by cliques, and maximum independent set [4], which may be 7

useful in other approaches to detailed routing. A speci c top-down2 detailed routing algorithm requiring O(#nets + #LUTs) time is given in [16]. DeHon (U.C.Berkeley) has spent some time studying a fat-tree routing architecture whose diagonal switch-boxes use -switches ( gure 4(bottom)), to date nding no guarantees for detailed routing. Unlike the T-switch, a -switch does not merge tracks for nets heading up or down the tree. Rather, it permutes incident track pairs, allowing channel capacity to grow at higher levels of the tree. Speci cally, the root channel capacity (and track domain capacity) of a subtree grows linearly with the number of LUTs in the subtree, corresponding to a Rent parameter of 1. The intent of this routing scheme is to provide more resources and fewer constraints for non-local nets that go far up the tree. Several approaches have been considered for detailed routing in the -switch based fat-tree architecture. A naive, top-down track assignment is not sucient in this case, since it is possible to assign two nets into a track domain in such a way as to make them con ict farther down the tree, having to merge through the two legs of a -switch into the same arm. A naive, bottom-up track assignment is not sucient because it is possible to exhaust, at a low height, track domains that will be needed farther up the tree. Hence attempts to assign tracks in a top-down or bottom-up fashion have thus far required expensive backtracking to reassign tracks. The formulation of detailed routing to graph coloring is not directly applicable in this architecture, namely because track domains are wider than single tracks. Since the channel capacity of a track domain through a given S-box grows with the height of the S-box, several nets passing through the S-box may be packed into the same track domain, corresponding to the legal identical-coloring of adjacent nodes in the intersection graph. Finally a naive formulation of detailed routing as edge-coloring of the original netlist is insucient. Although it captures the routing constraint that tracks meeting at a common C-box need di erent colors (track domains), it fails to capture the constraints of S-boxes farther up the tree. A more general fat-tree routing architecture is presently being studied in the U.C.Berkeley BRASS 3 group's HSRA project (Hierarchical Scalable Recon gurable Array), using a combination T- and -switches at di erent tree heights to control channel capacity and to achieve a desired Rent parameter [2]. In the current formulation, a strictly-alternating switch schedule, shown in gure 5, is used to achieve a Rent parameter of 0:5. The analysis of detailed routing for this architecture su ers from most of the same problems as the all--switch fat-tree described above. A naive bottom-up approach to track assignment faces the additional complication that nets in the same track domain may con ict farther up the tree where tracks merge in a T switch. While a heuristic detailed rout2 By a top-down or bottom-up algorithm we mean a coloring (track-assignment) algorithm for a tree-based routing architecturewhich seeks to color global nets in order of their respective highest point in the routing tree (cross-over point), without resorting to backtracking. A topdown algorithm scans the routing tree for cross-over points from the root towards the leaves, whereas a bottom-up algorithm scans for cross-over points from the leaves towards the root. 3 Berkeley Recon gurable Architectures, Systems, and Software

8

ing algorithm has been implemented, it can make no guarantee on the detailed routability of a successfully global-routed network. In addition to the fat-tree topology, the HSRA provides short-cut channels between subtrees, similar to those of Greenberg's fat-pyramid architecture [6]. The shortcuts exist between physically neighboring LUTs which are logically distant in the tree. Greenberg used shortcuts to extend Leiserson's notion of area-universality for fat-trees under the realistic assumption that network delay is a function of distance through the network. In this paper, we ignore such short-cut paths in the analysis of HSRA-like fat-trees.

3 Formulations of The Detailed Routing Problem On Fat-Trees We have considered several approaches to the problem of detailed routing in a fat-tree architecture using alternating T- and -switch based diagonal switchboxes, as in the U.C.Berkeley HSRA [2]. In each case, we seek an ecient detailed routing algorithm, or to prove that detailed routing is NP-complete in that formulation, and to prove whether a bounded mapping ratio Wd =Wg exists. Our work for each approach remains preliminary, and conclusive results have yet to be found. Our analyses make several simplifying assumptions about the routing resources. First, we assume that LUTs are connected to routing channels via full- exibility C-boxes, so that a LUT I/O pin can connect to any track. In practice, we can get this e ect without full exibility, so long as a C-box covers all k pins and c channels incident on it, and pin signal assignments can be permuted by modifying a LUT's logic function. We further assume that all global nets are paths between a single source and single destination. This is equivalent to assuming that switch-boxes do not allow signals to fan-out. To support the embedding of arbitrary netlists with logical fan-out in the fat-tree, we can take other measures, such as allowing a LUT I/O pin to connect to multiple tracks (\dog-legging").

3.1 Graph Coloring We would like to extend the results of [16] to use graph coloring for global routing. This approach is attractive because it is based on chordal subtree intersection graphs for which there exist numerous ecient algorithms. In the H-tree architecture of [16] with T-switch based diagonal S-boxes, each trackdomain is one channel wide everywhere, so node-coloring the intersection graph of nets to assign nets into track domains completely solves the detailed routing problem. In a fat-tree routing architecture, on the other hand, the channel capacity of track domains is made to grow according to Rent's rule with the 9

height of the tree. Hence nets which share a common S-box need not belong to di erent track domains, and adjacent nodes in the intersection graph need not be uniquely colored. A modi ed coloring problem which permits some cliques of uniform color may still be used to assign nets into track-domains. One possible modi cation is to attempt to merge adjacent nodes in the intersection graph (i.e. assign intersecting nets into a common track domain) so as to reduce its chromatic number. The utility of coloring for track-domain assignment hinges on whether detailed routing can be decomposed into an initial phase of assigning nets into track domains and a subsequent phase of assigning tracks within each track domain. This decomposition is analogous to the separation of network routing into global-routing and detailed-routing, and we seek the same guarantee that the second phase will succeed if the rst has succeeded (possibly with some bounded increase in the required routing resources). Like global routing, the assignment of nets into track domains is constrained only by track domain channel capacity at each S-box, and we suspect that it can be done eciently. It is not clear, however, that subsequent track assignment within domains can be guaranteed. If all nets in a track domain cross-over at the root of a fat-tree, then the track domain's routing resources act much like a Benes network connecting the leaves of the left and right subtrees with arbitrary permutations between left-right leaf pairs [1] (a fat-tree using only -switches is exactly a Benes network connecting the leaves of its two subtrees). If some nets cross-over below the root, they e ectively occupy some internal resources of the Benes-like network, and it is not clear that an arbitrary permutation among the nets that cross at the root remains possible. We have yet to explore the properties of a Benes network with missing edges, and the possibility that modifying leaf (LUT) placement to make local nets occupy carefully-chosen routing resources may recover permutability in unoccupied parts of the Benes-like network. If detailed routing can indeed be decomposed into a track-domain assignment (coloring) phase within domains and a track assignment phase, then we can consider several approaches for each phase. The coloring phase can be formulated as a modi ed coloring problem on the intersection graph of global nets such that some adjacent nets are allowed to have the same color. Because subtree intersection graphs are chordal, and hence perfect with maximum clique size equal to the chromatic number, we can equivalently use clique-based analyses. Note that a clique in the intersection graph corresponds to a set of nets which all share a particular common S-box4 . 4 It is possible to nd a common S-box shared by a clique of X nets by scanning from the root of the routing tree towards the leaves. Suppose that X1  X nets cross-over at the root S-box. If X1 = X then this is the common S-box. Otherwise, X ? X1 nets cross over below the root. Because each of these nets intersects with every other net, they must all be localized to the same subtree below the root. Now repeat the cross-over check for this subtree: suppose that X2  X ? X1 nets cross-over at the root S-box of this subtree. If X1 + X2 = X then this is the common S-box. Otherwise all remaining X ? X1 ? X2 nets are localized to the same subtree below this one, and the scan continues down the tree. At the very latest, this process

10

One formulation of the modi ed coloring problem is, given an X-colorable, chordal graph Gi, is it possible to merge certain adjacent nodes (in accordance with track domain channel capacity constraints) so as to obtain a Y -colorable, chordal graph G0i? The initial graph Gi is the intersection graph of global nets with global channel density Wg = X (namely because the chromatic number X is also the maximum clique size, hence also the maximum number of tracks through any channel or S-box). The merging of adjacent nodes represents assignment of intersecting nets into common track-domains. The resulting graph G0i will then be the intersection graph of domain-merged nets, colorable with Y  X track-domains. If we consider the merging of adjacent nodes as a sequence of pairwise mergings (edge collapsings), it is easy to see that the chordal property of the intersection graph is retained across pairwise mergings. Hence the resulting graph and all intermediate graphs are valid subtree intersection graphs. They are also all perfect graphs, so that their respective maximum clique size indicates their current chromatic number. In order to reduce the chromatic number, it suces to merge pairs of nodes chosen from a maximum clique at each step, so as to reduce the maximum clique size. Because only so many nets can be packed into any color domain, we must take care not to merge more nodes than a track domain's channel capacity allows. This bandwidth constraint must be satis ed at every S-box at every height in the tree, so we must record the minimum height at which each pair of nets intersect (where domain capacity is most constrained). If merged nodes are marked with the number of original nodes they represent, and edges are marked by the domain capacity of the lowest-height S-box at which their endpoint nets intersect, then we can know not to merge adjacent nodes whose sum of represented nodes exceeds the capacity of their connecting edge. A greedy algorithm which chooses pairwise mergings one at a time would only need to consider O(N 3) mergings (N being the number of nets) and could conceivably be made to work in polynomial time. The runtime estimate comes from the fact that with N nodes in the intersection graph, there are N 2 possible pairwise mergings, and up to N successive mergings will be needed. It is conceivable that a dynamic programming algorithm which explores all valid sequences of pairwise node mergings under domain capacity constraints might also be made to work in polynomial time. Ecient algorithms for coloring and nding all maximal cliques of the intermediate chordal graphs can be found in [4] and [5]. An equivalent way to think of the modi ed coloring problem is to choose cliques in the intersection graph to color uniformly and color adjacent cliques, rather than partially collapse the graph and graph-color it. The colored cliques must partition the intersection graph, i.e. cover it disjointly, and be consistent with the track-domain channel capacity constraints at every S-box at every height of the routing tree. An algorithm to partition a chordal graph into a will nd some nets localized to a subtree containing only 2 LUTs, in whose root S-box all X nets must intersect.

11

minimally-colored set of cliques is not known to the authors at this time, much less under capacity constraints. An ecient algorithm for nding the minimum covering by cliques for a chordal graph is given in [4], though it is not clear that such a covering is colored minimally. In addition, that covering need not be disjoint or consistent with domain capacity constraints, so cliques may need to be trimmed and/or supplemented in order to form an acceptable clique partition. While these modi ed coloring algorithms may be a good step towards a detailed routing algorithm, their complexity, along with the lack of guarantees for a subsequent track assignment phase within domains, make it impossible to analyze a bound for the mapping-ratio.

3.2 Multi-Commodity Flow The detailed routing problem can be seen as a multi-commodity ow problem. Given a graph model of the routing network, where wires are represented with edges of unit capacity, we wish to nd some feasible way to route all nets without exceeding edge capacities. This approach has the advantage that, given a speci c recon gurable network and set of nets from global routing, one can potentially nd a detailed routing. On the other hand, this approach cannot give us a bound on the mapping ratio. In other words, a successful solution using this approach is of little use when setting parameters of the recon gurable network, but is of great use in routing once the network has been xed. A single commodity ow model, with multiple sources and destinations is not sucient, since this model maximizes ow without distinguishing a pairing of sources and destinations. Hence there is no way to retain the identity of nets as source-to-destination signals.

3.2.1 Graph Model for Recon gurable Network The rst step is to model the routing network as a directed graph. This includes modeling LUTs, C-boxes and S-boxes. Since each wire in the network can carry a single signal, all edges presented here have capacity 1. Because pin assignments in each LUT are permutable, we need not consider the programming of C-boxes. Speci cally, a LUT/C-box pair can be considered as a point source or destination in our graph routing model, connecting to the c adjacent base channels, as in gure 6. Note that these edges are not directed for the moment. The modeling of T-switch and -switch S-boxes to capture graph connectivity is presented in gure 7. A T-switch (a) can be modeled as a single node with 3 incident edges. A -switch (b) can be modeled by 4-node gadget with edges 12

...

0 1 C-switch LUT

c-1

Figure 6: Graph model for C-boxes and LUT pairs: single node and c edges.

(a)

(b)

Figure 7: Graph model for T/P-switches. Each connection of triples of wires in T- switches is modeled as in (a), while each connection of quadruples of wires in P-switches is modeled as in (b).

allowing permutations between the 's arms and legs. A horizontal edge between the left and right sides is not needed, since the lower nodes are reachable through a diagonal edge. This reachability also captures the fact that the two vertical edges cannot be used to push ow when a cross-over route is in place. Using the above mentioned models for C-box and T- and -switch S-boxes, we can transform the routing network into an undirected graph. In order to apply known algorithms for multi-commodity ow, we need a directed graph. The naive modeling of an undirected edge ( gure 8(a)) by two directed edges ( gure 8(b)) is not correct in our case, since a signal can ow through a given track in only one direction. To capture this notion we transform an undirected edge into the gadget of gure 8(c). The directed edge between C and D allows only one of the two directions to be available for ow between A and B. This transformation introduces 4 directed edges and 2 additional nodes per undirected edge.

3.2.2 Node, Edge and Commodity Calculation Assume that we are given a routing network with N LUTs and c base channel capacity. The number of levels in the H-tree is D = log2 N. In this section we will calculate the number of nodes, edges and commodities in the undirected graph. The corresponding numbers can be easily found for the directed case by adding 4 edges and 2 extra nodes per undirected edge. 13

1 A

1

A

B

B

1

A

1 1

C

1

B

1 1 D

(a)

(b)

(c)

Figure 8: Transformation of undirected edge (a) to a directed one. Method (b) does not preserve the preserver the property of physical wires that they propagate signal only on one direction. The capacity of each edge is shown.

Node Calculation Each LUT/C-box pair introduces one node. At the i-th level of the tree (1  i  D), we have 2i?1 switches. In the simple case where all the switches are T-switches, each switch has c triplets to connect. So the total number of nodes is: XD N + (2i?1c) ' N + cN = (c + 1)N i=1

If all switches are -switches, then each switch has 2D?i c quadruples to connect. So the total number of switches will be: N+

D X (2i?1 2D?i4c) ' N + 2cNlog N

2

i=1

Similarly,pif we alternate T- and -switches at successive levels of the tree, we get N + c N + 14 cNlog2 N. So generally we can tell that, in the worst case, the number of nodes in our model is O(c logN).

Edge Calculation Similar analysis can be performed for the case of the edges. We present only the worst case, that of a network with -switches only. Each LUT introduces c edges. A -switch at the i-the level introduces 2D?i?1c edges that go to the upper level and 4  2D?i c internal edges. So the total number of edges is: cN +

D X (2i?1 2D?i?1c + 2i?12D?i 4c) ' cN + cNlog N = O(cNlog N)

2

i=1

14

2

Commodity Calculation The number of commodities is equal to the number of nets speci ed by the global routing. Since each LUT has at most c inputs and outputs, this number is bounded by cN2 (actually, by kN2 , but we are running out of letters!). In this case we assume that we do not use fan-out in the network. It is clear that the transformation to the directed graph does not change the order of magnitude for the nodes and edge count, and they both remain O(cNlog2 N).

3.2.3 Linear Program Formulation Multi-Commodity ow problems are usually stated as linear programming problems. Assume we are given the directed graph model for a routing network with n = O(cNlog2N) nodes, m = O(cNlog2N) edges and k = O(cN) commodities5 . Each edge has a total capacity of 1 and each commodity has a total demand of 1. In other words, we have a unit-capacity, unit-demand integer multi-commodity problem to solve. The multi-commodity ow problem can be expressed as a linear programming problem as follows. For all adjacent nodes i and j and commodity l, we wish to calculate the quantity xlij to be 1 if commodity l is routed through edge (i; j) and 0 otherwise. Let bli be 1 if commodity l has node i as a source, ?1 if it has node i as a destination, and 0 otherwise. Now we wish to minimize the routing cost: X X xl ij l ij

(assuming unit cost for routing ow through an edge), under the constraints:

P xlij  1; 8i; j; Plj xlij ? Pj xlji = bli; 8i; l;

and xlij 2 f0; 1g; 8i; j; l:

The integer multi-commodity ow is generally an NP-complete problem [18], even when unit capacities and unit demands are used.

3.2.4 Approximation Algorithms A large number of approximation algorithms have been designed for the multicommodity ow problem and can be used in this case. A very common method is to solve instead the concurrent ow optimization problem, where we wish to 5 Note that we use n, m and k following the standard terminology used in multi-commodity

ow literature. In this section, k represents the number of commodities, not the number of LUT I/O pins.

15

know the maximum percentage z such that at least z percent of each demand can be shipped without violating the capacity constraints. To be more speci c, all the approximation algorithms discussed here, given a positive , return a routing that is feasible if the capacity of each edge is increased by a factor of at most (1 + ) (edge congestion less than (1 + )). Klein, Plotkin, Stein and Tardos present a randomized approximation algorithm for the unit capacity ow problem with expected running time O((k + m)m log m), where the constant depends on  [19]. This translates to complexity of O((k + cN log N)cN log N log(cN log N)) = O(c2 N 2(log N)3 ) in our case. Based on this result, Leighton et al. present an algorithm that solves the general concurrent ow problem (unit capacities not required) in expected running time of O(knm log4 n) [20]. This is O(c3 N 2(log N)6 ) in our case. Apart from providing a (1+epsilon)-approximation to optimal routing when a routing exists, this algorithm can also provide a proof in the case that no optimal solution exists. The algorithms mentioned in the previous paragraph use linear programming, interior-point techniques and/or combinatorial approximations to solve the problem. Recent data indicates that these algorithms can produce highly accurate results ( ' %) several order of magnitude faster than older approaches and are practical even for large problems [21]. Using these algorithms for detailed routing can be of great value in the absence of other direct solutions to the detailed routing problem. Leighton's algorithm for example can be used to check if a detailed routing exists at all. In the case that a detailed routing exists and can be approximated with high accuracy ( ' 1%), we can restrict the global router not to use O() percentage of the routing resources in order to guarantee detailed routing. Thus Leighton's algorithm can be used to nd the optimal routing with very high probability. One potential way that these algorithms can be improved for detailed routing is to augment them to take advantage of the structure of the routing graph, which is a set of interconnected fat-trees, and the fact that for each commodity not all edges are candidates for routing (only edges along paths from a net's source to destination)

3.3 Integer Linear Programming The detailed routing problem can be modeled as an integer linear programming problem (ILP) directly, without the use of multi-commodity ow. The key observation is that the main task of detailed routing is to assign each global net to a speci c track ID at each level of the tree, with the constraint that tracks at neighboring levels are connected through an S-box of particular switching topology. For example, assume that the tracks connected to the C-box of a LUT are numbered numbered 1 to c. A net starting or ending at this LUT, e.g. net i, 16

can be assigned to any track ID Fi . Assume now a second net j starting or ending at the same LUT. The track ID Fj to which net j is assigned must be at least one di erent from Fi , since they cannot both be routed on top of the same wire. Consider net k starting at the LUT that has a common parent in the H-tree with that of nets i and j. Since the three nets will meet at the same T- or -switch, every pair from Fi, Fj and Fk must have a di erence of at least one, since tracks with the same ID end at the same T- or -like connection of the switch and cannot be routed independently. We see that we already have a set of inequalities of the tracks the nets are assigned. We examined several schemes for expressing the whole detailed routing problem as an ILP problem. The rst scheme tried to express the inequalities by using as unknowns the track IDs assigned at the lowest level of the H-tree (as in the example given above). This was not enough since it did not capture the possible track ID change at a -switch. The second approach used as unknowns the track IDs assigned at the highest level of the H-tree. This approach takes advantage of the fact that, given the track a net uses at its highest level (its cross-over point), there is a unique path to both the destination and source, regardless of the type of switches used along that path. The problem with this scheme is that certain nets may not go all the way up to the highest level of the H-tree (root) and their existence signi cantly complicates the track ID assignment and associated inequalities. The approach that we nally chose for the ILP formulation was the following. We use as unknowns the track IDs a net is assigned at every level of the tree through which the net passes, as well as the routing decision at each -switch. Assume i and j are nets, while k speci es a level of the tree. Let F k (i) and Gk (i) be the track IDs that net i uses in each level of the source and destination subtrees, respectively. In addition, let H k (i) and J k (i) be the routing decisions taken for the net at level k, if a -switch exists at that level. This decision is 0 if the left output of the -like connection is used, or 1 if the right output is used. Given these variables here is how to write equalities and inequalities: For a single net i: At the highest level m the net reaches we must have F m (i) = Gm (i) so that the two paths to destination and source connect. If at level k we have a -switch then F k+1(i) = H k (i)(2F k (i) + 1) + (1 ? H k (i))(2F k (i)). If a T-switch is used: F k+1(i) = F k (i). Similar equations hold for Gk+1(i) (use Gk (i) instead F k (i) and J k (i) instead H k (i)). In addition, at any level k, F k (i) and Gk (i) are bounded by the number of tracks (wires) Wk at that level (1  F k(i)  WK ). Wk is a constant number for a given routing architecture. For a pair of nets i, j: we check whether their paths from the source/destination to the highest level they reach overlap at any T- or -switch. If not, no inequality needs to be written for these nets. Suppose, for instance, that their source paths meet at level k. Then we require that F k (i) and F k (j) di er by at least one, since the nets cannot share the same wire. Thus we have the inequality jF k (i) ? F k (j)j  1. Similar inequalities can be expressed if their paths to source or destination meet at any level. The absolute values from the inequalities can 17

be eliminated if we use modulo arithmetic. The resulting system of equalities and inequalities is an ILP problem where the real unknowns are F 0(i), G0(i), H k (i) and J k (i), 8i and 8 levels k at which -switches are used. Future work for this approach is to examine if such ILP systems can be solved in polynomial (or otherwise reasonable) time. This approach has the potential of giving a detailed routing in a given routing network, though it does not seem applicable to deriving bounds on the mapping ratio.

3.4 Other Approaches In this nal section we brie y discuss additional approaches to the detailedrouting problem which we have not had time to consider at length.

3.4.1 Graph Embedding The problem of routing signal paths through an arbitrary graph, studied above using multi-commodity ow, is studied in network circles as the problem of embedding one graph in another. There are numerous results in the literature about embedding graphs in meshes and stars, two popular network topologies. Few concrete results are found for our case of fat-tree routing, as few commercial or research architectures have been based on H-tree routing. An analysis of embedding should consider speci c properties of the fat-tree routing graph, for instance its small degree or cycle structure, as well as possible restrictions to the embedded netlist, such as bounding fan-in and fan-out. The desire to bound a mapping ratio Wd =Wg in detailed routing in e ect seeks to bound the area penalty for embedding an arbitrary netlist into a given routing architecture with respect to an implementation with optimal area. Leiserson showed in [11] that it is conversely possible to bound the time penalty for embedding an arbitrary netlist into a fat-tree of area equal to that of the areaoptimal implementation. This result, however, is not directly applicable to our study of detailed routing.

3.4.2 Constrained Edge Coloring We saw above (section 2.2) that formulating the detailed routing problem as edge-coloring a netlist captures routing constraints at C-boxes but does not capture routing constraints at S-boxes. It may be possible to extend the edgecoloring formulation by adding arti cial edges or nodes that express S-box constraints and prevent certain edges from being assigned identical colors, perhaps using non-trivial gadgets. The primary appeal of edge-coloring is the strong bound on minimum chromatic index due to Vizing, namely that any degree  18

graph can be edge-colored using  or  + 1 colors [15], [9]. The degree of a netlist can be reduced and bounded by transforming high-fan-out logic into a tree-structured design with bounded fan-out [7]. Hence it would be possible to arbitrarily reduce the number of track domains (colors) needed to detail-route a design at the cost of adding some LUTs. Such a trade-o is very appropriate in modern VLSI technology, where wires dominate chip area, and additional logic or LUTs is cheap.

References [1] V. E. Benes. Mathematical Theory of Connecting Networks and Telephone Trac. New York: Academic, 1965. [2] Andre DeHon, \BRASS RC array for CS254, fall 1997," U. C. Berkeley graduate course material, September 1997. [3] Fanica Gavril, \The intersection graphs of subtrees in trees are exactly the chordal graphs," J. Combinational Theory, vol. B16, pp. 47-56, 1974. [4] Fanica Gavril, \Algorithms for minimum coloring, maximum clique, minimum covering by cliques, and maximum independent set of a chordal graph," SIAM J. Comput., vol. 1, no. 2, pp. 180-187, June 1972. [5] M. C. Golumbic. Algorithmic Graph Theory and Perfect Graphs. New York: Academic, 1980. [6] Ronald I. Greenberg, \The fat-pyramid and universal parallel computation independent of wire delay," IEEE Trans. on Computers, vol. 43, no. 12, pp. 1358-1364, December 1994. [7] H. J. Hoover, M. M. Klawe, N. J. Pippenger, \Bounding fan-out in logical networks," J. ACM, vol. 31, no. 1, pp. 13-18, January 1984. [8] Stefan Hougardy, \Inclusions between classes of perfect graphs," http://www.informatik.hu-berlin.de/~hougardy/paper/classes.html. [9] Tommy R. Jensen, Bjarne Toft. Graph Coloring Problems. New York: John Wiley & Sons, 1995. [10] F. Thomson Leighton. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. Morgan Kaufmann Publishers, Inc., 1992. [11] Charles E. Leiserson, \Fat-trees: universal networks for hardware-ecient supercomputing," IEEE Trans. on Computers, vol. c-34, no. 10, pp. 892-901, October 1985. [12] Charles E. Leiserson, \VLSI theory and parallel supercomputing," presented at Caltech Decennial VLSI Conference, May 1989. 19

[13] B. S. Panda, \New linear time algorithms for generating perfect elimination orderings of chordal graphs," Information Processing Letters, vol. 58, pp. 111115, 1996. [14] Sadiq M. Sait and Habib Youssef. VLSI Physical Design Automation, Theory and Practice. New York: IEEE Press, McGraw-Hill Intl. Limited, 1995. [15] V. G. Vizing, \The chromatic class of a multigraph" (in Russian), Kibernetika (Kiev), no. 3, pp. 29-39, 1965. English translation in Cybernetics, vol. 1, pp. 32-41. [16] Yu-Liang Wu, Shuji Tsukiyama, Malgorzata Marek-Sadowska, \Graph based analysis of 2-D FPGA routing," IEEE Trans. Computer Aided Design, vol. 15, no. 1, pp. 33-44, January 1996. [17] Yu-Liang Wu and Douglas Chang, \On the NP-completeness of regular 2-D FPGA routing architectures and a novel solution," Proc. IEEE/ACM International Conference on Computer- Aided Design, San Jose, California, November 6-10, 1994, pp. 362-366. [18] S. Skiena, \The Algorithm Design Manual", Springer-Verlag, NY, 1998. [19] P. Klein, S. Plotkin, C. Stein, E. Tardos, \Fast Approximation Algorithms for the Unit Capacity Concurrent Flow Problem with Applications to Routing and Finding sparse Cuts", in Proceedings of the 22rd Annual Symposium on Theory of Computing, 1990. [20] T. Leighton, F. Makedon, S. Plotkin, C. Stein, E. Tardos, S. Tragoudas, \Fast approximation Algorithms for Multi-Commodity Flow Problems", Journal of Computer and System Sciences, April, 1995. [21] A. Goldberg et. al., \An Implementation of a CombinatorialApproximation Algorithm for MinimumCost Multi-Commodity Flow", Integer Programming and Combinatorial Optimization,1998.

20