High Performance Dynamic Resource Allocation for Guaranteed ...

1 downloads 0 Views 12MB Size Report
improved success rate due to parallel multi-slot multi-path search mechanism; (3) .... good scalability, but the drawback is the lack of the global knowledge, e.g. if ...
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2017.2765825, IEEE Transactions on Emerging Topics in Computing 1

High Performance Dynamic Resource Allocation for Guaranteed Service in Network-on-Chips Yong Chen, (Student Member, IEEE), Emil Matus, Sadia Moriam, Gerhard P. Fettweis, (Fellow, IEEE) Technische Universität Dresden, Vodafone Chair for Mobile Communications Systems

Abstract—This paper proposes a dedicated connection allocation unit - the NoCManager - implementing the connection allocation functionality in circuit-switched network-on-chip (NoC) based on time-division-multiplexing (TDM). The NoCManager employs a novel trellis-search-algorithm (TESSA) that solves the allocation optimization problem by making use of dynamic programming approach. This enables to explore all possible paths between source-destination node pairs in order to determine the shortest available path. Three different trellis structures are proposed and analyzed for the purpose of different application scenarios. In contrast to previous TDM allocation approaches, the proposed method offers the following advantages: (1) hardware supported fast and high-throughput allocation mechanism; (2) improved success rate due to parallel multi-slot multi-path search mechanism; (3) selection of the contention-free shortest path with a guaranteed low latency; (4) general mathematical formulation allowing a variety of optimization ideas. The proposed method is compared to the state of the art centralized and distributed techniques under uniformly distributed random traffic as well as real-application traffic. The experimental results demonstrate two orders of magnitude improvement in allocation speed and tens of times higher success rate against the centralized software solutions, and 5% to 10% higher success rate against the centralized hardware solution. Moreover, it achieves up to 8x higher allocation speed and up to 29% higher success rate against recently proposed distributed solution. Index Terms—Circuit Switching; Time-Division Multiplexing Network-on-Chip; Guaranteed Services; Connection Allocation; Hardware Accelerator

I. I NTRODUCTION As the Multiprocessor System-on-Chip (MPSoC) becomes more and more complex, traditional bus-based interconnects become limited in terms of efficiency and performance. The Network-on-Chip (NoC) has emerged as a promising scalable solution to the interconnection problem [1]–[3]. The flit, or flow control digit, is the elementary unit of information delivery in NoC. In modern complex SoCs, many applications may have specific performance requirements, such as a minimum throughput (for real-time streaming data) or bounded latency (e.g. for interrupts, process synchronization, etc). Therefore, providing Guaranteed Services (GS) in terms of bounded latency and bandwidth is crucial for design of predictable systems [4]. Circuit Switching (CS) is the frequently adopted technique enabling GS, which first allocates exclusively channels to form a circuit i.e. the connection from source to destination, followed by sending data along the established connection, as e.g. proposed in MANGO [5]. However, since the resource is exclusively occupied during the entire lifetime of the connection, it may lead to considerable system inefficiencies due to the blocking of resource for other traffic flows.

Slot table 1 slot o2 0 1 i3 2 i0 3 i0 Slot table 0 slot o1 0 i0 1 2 3

O1 i1 i0

b O0

R1

O1 i1 O2

i0

i2

O0

i3 O3

a

O0

R0 i3 O3

R2

i2

i3 O3

O1 i1 i0

O2

O1 i1

O2

i0

i2

O0

O2

R3

Slot table 2 slot o2 o3 0 i0 1 2 i0 3 i0

R: Router O: Output port i: input port

i2

i3 O3

Fig. 1. Contention-free TDM CS routing. Each flow has a dedicated connection. The two color allocations in slot table are assigned to two flows specifically as: red to flow a, purple to flow b.

In order to improve the resource utilization, two extensions of CS have been introduced: i) Time-Division-Multiplexing (TDM) and ii) Space-Division-Multiplexing (SDM). In TDM CS, the link capacity is split into multiple time slots, and a subset of link time slots is allocated to a specific connection according to the bandwidth requirements as e.g. adopted in parallel probe search [9], AEthereal [10], [11], Nostrum [12], etc. Analogically, the link is composed of multiple physical wires in SDM [6]–[8], and subset of the link wires is exclusively allocated to a specific connection. In the rest of this paper, the focus is primarily on GS for TDM NoCs, however, the proposed methods can be easily extended to SDM NoCs. In TDM-CS NoCs, the slot allocation information of each router’s output port associated with a output link is stored in a slot allocation table of size S (for the rest of this paper, S is used as the slot table size). Slot table contains slot assignment to a specific input port that represents the switching rule of router. The allocation tables along the specific connection are synchronized such that a flow occupying slot s on the input port of a specific router gets a slot (s+1) mod S at output port as illustrated in Fig.1. In this example, the network contains four routers R0, R1, R2, and R3, and TDM assumes four time slots per link i.e. S = 4. In addition, the figure defines two flows labeled a and b and associated routing configuration stored in the slot allocation tables along the connection. The slot table 0 switches flow a from input port i0 of R0 to output port O1 at slot 0. The connection of flow a is established along path R0 → R1 → R2 with slot sequence {0, 1, 2}. In terms of the bandwidth guarantee, for example, if flow b requests half of the link bandwidth, if the slot table size is four, two slots are assigned to flow b. Limited amount of TDM NoC resources (i.e. TDM slots) poses a challenge to their efficient usage while ensuring the

2168-6750 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2017.2765825, IEEE Transactions on Emerging Topics in Computing

provisioning of requested capacity for all flows. This becomes particularly critical for highly loaded networks. In addition to this, the resource allocation problem has exponential complexity with the path length, which makes it difficult or even impossible for large-scale networks or dynamically reconfigurable application scenarios. This calls for powerful and lowcomplexity connection allocation methods that comprise i) fast and efficient selection of a contention-free path between source-destination pair and ii) allocation of the appropriate slots and corresponding resources on the path. However, recently proposed allocation approaches either suffer from long allocation time (software search approach, unidirectional search) or limited allocation success rate (supports only minimal path search) or limited scalability. This paper tackles this problem by proposing a dedicated connection allocation unit - the NoCManager (NoCM). The NoCM employs a novel trellis-search-algorithm (TESSA) that solves the allocation optimization problem with linear complexity by making use of dynamic programming approach [14]. This enables to explore all possible paths between sourcedestination node pairs within a guaranteed low latency, i.e. at most 2H clock cycles for H hops. The found path is ensured to be the contention-free shortest available path. Three different trellis structures are proposed and analyzed for the purpose of different application scenarios: unfolded, folded and bidirectional. Moreover, the novel search approach in conjunction with efficient hardware implementation of NoCManager reasonably improves the overall performance against recently proposed methods. Furthermore, the mathematical formulation for the system is proposed, which allows a variety of criteria to be applied to obtain the global optimal results. In addition to this, in order to mitigate the scalability issue, the partitioned architecture is proposed that divides the system into multiple partitions served by multiple local NoCMs. The paper is organized as follows. The next section provides overview of related work. In Section III a brief overview of the system model is presented. In section IV three different NoCManager architectures are explained in details. Section V presents and compares synthesis results of different NoCManagers and section VI presents experimental results against previous centralized and distributed approaches. Section VII proposes the partitioned architecture to address the scalability issue. Finally, section VIII concludes the paper. II. BACKGROUND AND RELATED WORK The allocation techniques can be grouped into two categories: i) static (design-time) allocation [15], [18]–[20] and ii) dynamic (run-time) allocation. Since the static allocation is done at the design time and cannot be changed according to the applications’ requirements during run time, they are not well suited for dynamic systems. The dynamic connection allocation techniques can be divided into two categories: i) centralized [14], [16], [21]–[28] and ii) distributed allocation [9], [10], [29], [30]. The state of the art of distributed allocation is parallel probe search [9], in which the source node sends a setup flit for searching path that traverses through the NoC along

minimal paths to try to reach target node. The disadvantage of this approach is that several trials for success might be needed since this method investigates only single slot at a time. Moreover, the problem of multi-slot allocation was not addressed in recent work. The distributed allocation has good scalability, but the drawback is the lack of the global knowledge, e.g. if there are several searches at the same time, they might block each other. Furthermore, distributed approaches are usually constrained to search minimal paths, thus the path search diversity is limited. In the centralized allocation scheme, a central manager is responsible for the connection allocation. Since the central manager has the global knowledge of the system, it could achieve global optimal results. The centralized allocations are typically based on software solution. The authors in [16], [17] e.g. utilize Microblaze processor while ARM processor is employed in [24], [25]. Software solutions provide excellent flexibility, however, they might suffer from excessively long allocation time to support real-time systems. For instance, single path exhaustive path-search in [16], [17] tries to add links to the current path if the link provides sufficient slots and is closer to destination. If all links of current node fails, it falls back to the previous node and tries another direction. Due to investigation of a single link at a time and allocation of all required slots on a single path, thousands of processor cycles are required for single allocation. In [26], the GS monitoring and management is performed at the operating system level (microkernel). Depending on the NoC performance, two run-time GS techniques can be dynamically adopted: 1) flow priority adaptation or 2) establishment of CS. A flow requesting a soft latency guarantee will be assigned to higher priority. If it requests for hard latency and throughput guarantee, a CS connection will be allocated. Because of software implementation, thousands clock cycles are necessary for single GS request processing. To increase the allocation speed, HArdware Graph ARray (HAGAR) approaches were proposed in [27], [28], in which a dedicated hardware connection allocator is used to speedup the allocation by two orders of magnitude against software methods. In HAGAR, the connection allocation problem is solved as a shortest path problem in a graph representation of the NoC. However, HAGAR is employed for basic CS, and does not support link sharing techniques such as TDM. A hardware root unit which uses breadth-first searching algorithm to search path is proposed in [22], which has excellent path search performance. To improve the resource utilization, it combines the circuit-switched and the packet-switched networks, by using unused circuit resources to transfer packet-switched data. However, it is restricted to searching only minimal paths, so that there is no possibility to detour when there is no minimal path available. Moreover, in both HAGAR and breadth-first search approaches, the search always starts from source node, and thereby a bidirectional search that starts simultaneously at both source and destination nodes (explained in section IV) is not employed. Although the centralized system has the advantage of global knowledge and high performance, as the network size grows with higher rates of allocation requests, the central unit may

2 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 2168-6750 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2017.2765825, IEEE Transactions on Emerging Topics in Computing

3. Connection

Des

4. Connection release

Src

1. Connection request

NoCM

2. Al 2 Allocation info Allo nf

(a) System Model of the NoC Src

NoCM

Des

Connection request

Path search Allocation info

GS data GS connection NoC

NoC

(b) Request processing procedure Incoming req

Request queue

Path search

Allocation info

Link state memory

(c) Block diagram of the request processing procedure in NoCM. ‘Link state memory’ stores the real-time state of links. Fig. 2. Proposed system model of the NoCM based NoC platform

become the bottleneck. In this paper, we propose a dedicated allocator, NoCManager employing the TrElliS Search based Allocation (TESSA) algorithm for TDM CS NoCs. The trellis search can search non-minimal as well as minimal paths, and can find the desired shortest path within a guaranteed low latency. The bidirectional search is proposed to halve the search time without additional resource cost, and the partitioned architecture is proposed to address the scalability issue of large NoC size. Finally, an algebraic formulation for the TESSA system is proposed to allow possible optimizations. III. S YSTEM M ODEL The system model of a dedicated allocator (i.e. NoCManager) based NoC architecture is illustrated in Fig. 2. The NoCManager (NoCM) attempts to allocate the appropriate connections when it receives connection requests. We assume source routing, i.e. the hop by hop route to the destination is embedded in the packet header by the source node. Three bits are needed for this to indicate the 5 output port directions (east, west, north, south and local). As soon as the source node receives the allocation information from NoCM, the data transmission can start. The packets are

inserted into the network only at specific time slots, which is regulated by the source node, same as in Aelite [31]. There are three possible schemes for the communication between NoCM and source nodes: over a dedicated network separate from the NoC or via dedicated wires or using the existing NoC. In order to achieve high allocation speed, we assumed that the source node sends the connection request to NoCM via dedicated wires. In mesh network, for each source node, log(M ) bits wires are needed to indicate which node is the destination, where M is the number of nodes in the NoC. With the introduction of the partitioned architecture concept that divides the large system into multiple small logic partitions each having an own manager (explained in section VII), each NoCM only manages and connects a limited number of nodes in its local region, so the overhead of dedicated wires is greatly reduced. The allocation information from NoCM to the source is delivered via existing NoC as a GS packet to guarantee delivery delay. The complete procedure for connection allocation is as follows: 1) The source node sends the connection request to the NoCM. These requests are buffered in a request queue in NoCM. 2) NoCM processes requests within the request queue, and attempts to allocate the appropriate path and slots. 3) NoCM sends the allocation information to source node in the case of success or retries later if it fails. A failed request is retried only when the total time has not exceeded certain timeout, otherwise it is discarded. 4) After receiving the allocation information, source node starts to transmit data along the allocated connection. 5) After the data transfer is finished, the source node deletes the corresponding allocation information and informs the NoCM to free the corresponding allocated slots. The allocation information indicates the specific time slots at which the data at the source should be injected into the NoC, and the information for each hop to indicate which output port to go. Step 2 is the main work of this paper, and is explained in details in the next section. IV. N O CM ANAGER A RCHITECTURE The NoCManager solves the shortest path problem in a trellis graph description of the NoC in order to find the shortest free path and allocate slots between source-destination nodes. A block diagram of the NoCM is shown in Fig. 3 and comprises a trellis path search module and link state memory. NoCM collects and processes the incoming connection requests within the request queue. The resulting allocation parameters are sent through the NoC to respective source node. A. Formalizing Trellis Graph Structure The aforementioned shortest path search problem has exponential complexity with the length of the paths, and the exact complexity function depends on the network topology. For a mesh network, if non-minimal paths are allowed, at each hop we have the choice of going in 4 directions. Assume l is the distance between source and destination and d is the number

3 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 2168-6750 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2017.2765825, IEEE Transactions on Emerging Topics in Computing

To NoC

From NoC

Incoming req

Retry queue

2

n-1

n

0

0

0

0

1

1

1

1

Incoming release

GS request queue

Path deallocate Free Link & Slot

Discard Yes Retry again

P2,n-1

2 Retry deadline?

No

stage 1

Path search failed

Trellis Path search

Succeed

stage n mod S

B2,3,n

2 Path metric

Allocation info

3

3

3

3

Fig. 5. n stages of a trellis graph of the example network. Assume the branch from state 2 to state 3 at stage n produces the minimal path metric.

Fig. 3. Block diagram of the NoCManager

2

2

P3,n= P2,n-1+B2,3,n

Deactivate

TESSA Unit

0

2

link state memory

Time slot index (n+1) mod S

0

0

1

1

2

2

3

3

1

3

(a) Example network graph. (b) A trellis graph represents the A node can reach itself network graph (curve arrow). Fig. 4. Network graph represented by trellis graph

of allowed detours, then the approximate complexity function would be O(4l+d ). This problem can be efficiently solved by dynamic programming - an optimization approach that breaks down the complex problem into a sequence of simpler problems which are solved stage by stage, thus reducing the computation complexity to linear. This work is motivated by the Viterbi algorithm, which uses the dynamic programming approach for sequence estimation in the communication domain. We adapt the principles of Viterbi algorithm for efficient path search in NoC, called trellis path search algorithm. The successive path traversal from the source to the destination during the path search is represented by trellis graph. Thus the trellis graph is a time indexed version of the NoC graph (Fig. 4). 1) General Model: There are five most important characteristics of the trellis graph, which are discussed below: 1) Stages: The network traversal mapped on to trellis graph is structured into multiple stages (Fig. 5), which is solved sequentially stage by stage. The stage (i.e. the column) of the trellis graph contains all the nodes of the network and represents a single hop traversal through the network. We refer to the “decision stages”, referring to the number of stages which have to be traversed to

make a decision, excluding the first stage, since the first or starting stage does not require any decision making. By default, the number of the “decision stages” is 2N −2 for N ·N mesh network, which is equal to the longest minimal path (i.e. the longest possible path of all minimal paths) in the network. The stages have time implications to represent different time slots associated with different hops. 2) States: Each node of the trellis graph is called a state, and summarizes the knowledge in order to make the current decisions. At each stage, the decision in a particular state is determined simply by choosing one and only one of the active incoming branches as survivor path. 3) State transitions: The forward progress from one allowable state at a stage to another allowable state at the next stage in one unit of time (i.e. a time slot), is called a state transition. In the trellis graph, this is represented by a directed edge (or branch) connecting the two states. The state transition acts as the link between two respective routers corresponding to a forward hop in the network towards the destination. 4) Branch metric: The branch metric (B) is a measure of the transition that reflects the value (importance) of the branch. It is a function of several variables, such as the available slots of the branch (a), the number of requested slots (r), and the weight (w) that reflects the priority of the branch, etc. The information of available number of slots and the requested number of slots can be used to balance network load. The fewer the available slots the branch can provide and the more the number of slots are requested, the larger the branch metric will be, indicating that the inferior the branch is. When the number of requested slots exceeds the number of available slots that branch can provide, the branch metric becomes infinity, indicating that branch cannot satisfy the request and will be discarded. The function of the branch metric is as follows: { f (r, a, w), r 6 a B= +∞, r>a

4 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 2168-6750 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2017.2765825, IEEE Transactions on Emerging Topics in Computing

An example of branch metric function might be:   1 r·w, r 6 a B= a  +∞, r>a 5) Path metric: The path metric is the minimal accumulated branch metric over the shortest path from the initial state to the current state. For each state, the incoming branch that produces minimal path metric is selected as the survivor path. The branch metric for a transition from state i to state j at stage n is defined as: Bi,j,n Pj,n defined as the path metric for state j at stage n, and Sj is the set of states that have transitions to state j, then: Pj,n = min [Pi,n−1 + Bi,j,n ] i∈Sj

For the trellis graph shown in Fig. 5, P3,n = min{P1,n−1 + B1,3,n , P3,n−1 + B3,3,n , P2,n−1 + B2,3,n } If the branch from state 2 produces the minimal path metric, the path metric of state 3 at stage n will be P3,n = P2,n−1 + B2,3,n The goal of the shortest path problem is to find the path between source and destination node with the minimal path metric. At the last stage, the survivor path with the minimal path metric is the desired path. 2) Path Search Model Simplification: The previous section presents the general formalized model of trellis graph. However, the branch metric can be simplified. For the rest of the paper, we only consider the simple case: the branch metric can only have two possible values, either 1 or infinity: { 1, r6a B= +∞, r>a Therefore, the accumulation operation of branch metric in path metric can be omitted. As long as the branch metric is infinity, that branch will be discarded; if the branch metric is 1, that branch might be selected. In our system, every slot at the initial stage has such a representation of trellis graph, i.e. if the slot table size is S, there are S representations of trellis graph. Thus every slot from the initial stage has its own trellis graph and can search its own path in parallel with the others. Consequently, during a search, at each stage we only need to know the branch state at a specific slot. Since the branch at each slot only has two possible status, either free or unavailable (i.e. already allocated), the branch metric of each slot can be simplified as: { 1, branch is free B= +∞, branch is unavailable

It is worth to mention that although in this paper we restrict the analysis of TESSA to 2D-mesh topology, it is applicable to any other topology. Generally, TDM NoCs require global synchronization i.e that all routers share the global time notion. However, it can also be applied to mesochronous or even asynchronous networks, as in aelite [35]. In mesochronous network, the routers have the same nominal frequency but different phase relationships. We can put bi-synchronous FIFOs between neighboring routers, and also allocate a time slot for the link traversal. The FIFO adjusts the differences in phase between the writing and reading clock. In asynchronous network, a synchronization token can be used for handshaking. In our system, due to the partitioned architecture that divides the large system into small partitions (section VII), we can adopt synchronized clock inside each small partition. Between neighboring partitions, we can put the bi-synchronous FIFOs to adjust the clock phase difference. B. Three Trellis Path Search Structures In this section, three different trellis structures are presented and investigated: unfolded trellis [21], folded trellis [14] and bidirectional trellis. In general, the shortest path search in trellis by the NoCM comprises two steps: 1) Forward search i.e. traverse the NoC from source node to find the best free path to the destination node. 2) Backtracking i.e. sort out the saved survivor path from destination to backtrack the shortest path and collect the associated path and slot allocation parameters. 1) Unfolded Trellis Search: Since the NoC topology and size is known at design time, the respective trellis graph can be constructed as e.g. for 2x2 NoC illustrated in Fig. 6a the associated trellis graph is shown in Fig. 6b. Assume node 0 is the source (Src) and node 3 is destination (Des). The search signals propagate from Src at the first stage through the trellis to try to activate its connected neighbors at next stage. Src activates its connected neighbors node 1 and node 2 at the second stage, and then node 1 and node 2 try to activate their connected neighbors at the third stage. Assume the edge N 1 → N 3 at second stage is already occupied, so node 1 cannot activate node 3. During the forward search, if a node is activated by several nodes at the same stage, only one is remembered as its predecessor. When the Des is active, backtracking is started from Des to backtrack predecessors in order to collect the path information. Node 3 backtracks to its predecessor node 2 at the second stage, and node 2 backtracks to node 0 at the first stage. Now the path from Src to Des is obtained as N 0 → N 2 → N 3. Assume the starting slot at Src is n, then we can obtain the slot sequence along the path as {n, (n + 1) mod S, (n + 2) mod S}. 2) Folded Trellis Search: The proposed unfolded trellis path search algorithm exhibits regular structure and, hence, can be efficiently mapped on to a folded architecture. In such case, the hardware resources are reused by all partitions of folded algorithm. The folded architecture requires additional output registers in order to hold the values of intermediate results to be used as input values in the next iteration. The folded

5 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 2168-6750 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2017.2765825, IEEE Transactions on Emerging Topics in Computing

stage

Src

n mod S

Time slot index

(n+1) mod S

(n+2) mod S

0

0

0

Forward search

1

1

1

Backtracking Link unavailable, search failed

1

0

2

2

2

Des

3

2

3

3

3

(a) Example NoC graph

(b) Unfolded Trellis Search Search meeting check

Src

Src

reg

0

0

1

1

2

2

0

0

0

1

1

1

2

2

2 Des

Des

3 3

3

3

3

(c) Folded Trellis Search

(d) Bidirectional Trellis Search

Fig. 6. a)2x2 2D-mesh example NoC; b)schematic structure of the unfolded trellis Search for the example NoC; c)schematic structure of the folded trellis Search; d)schematic structure of the bidirectional trellis Search

path search algorithm of example in Fig. 6b is illustrated in Fig. 6c. There is a register at each node to store which predecessor activates it, and it only stores the predecessor that activates it first. When a node is active, in next cycle it will forward the search signal to its first stage node, and does the propagation search again. Note now the total search costs multiple cycles, i.e. one cycle per iteration. The search can be stopped in two cases: i) either the target node has been activated with sufficient bandwidth or ii) there are no new nodes being activated during the search any more. Hence, in this manner livelock is avoided. In the example shown in Fig. 6(c) the search signals start from Src and activate node 2. In the next cycle, node 2 continues to activate node 3. The backtrack (shown in red) starts from destination node 3 and sorts out nodes in sequence N 3 → N 2 → N 0. Therefore, the path from Src to Des is acquired as N 0 → N 2 → N 3. 3) Bidirectional Trellis Search: The path search presented in the previous sections is started at one side, i.e. from source node to target node. However, it is possible to start the path search from both the source and destination sides simultaneously. If the two searches meet in the middle, then the path search has been successful. The bidirectional path search algorithm of example in Fig. 6b is illustrated in Fig. 6d. The search from Src activates node 2, and the search from

Des also activates node 2. At the middle stage, the search signals from Src and Des meet at node 2, which means the search is successful. The backtrack starts from node 2 to Src and Des simultaneously. The path from Src to Des is obtained as N 0 → N 2 → N 3. In bidirectional search, the critical path is halved while the area stays almost the same. In the proposed approaches, each slot searches its own path in parallel and can set up multiple paths. So the communication flow is split over multiple paths, which increases the success rate significantly [23]. C. Trellis Path Search Implementation This section presents the implementation details of the unfolded trellis and the folded trellis. The Detect-Select-Shift (DSS) Unit is the core module that implements the function of state in the trellis graph, which evaluates the propagated search signals from previous stages and generates bit-vector flags representing slot availability on specific links. The knowledge about the actual link allocation state is stored in the ‘Link State’ register. When a link is allocated at a specific slot, its corresponding state register is set to ‘0’, thereby excluding it from future searches. Correspondingly, state register is set to ‘1’ when the link is released.

6 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 2168-6750 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2017.2765825, IEEE Transactions on Emerging Topics in Computing

Forward Edge’s width= S

Cyclic Shift

N0

slot 0

N2

AND

AND

N3

slot 0

Incoming search signal from previous stage

link state register

N1

slot 0

DSS

slot 0

slot 1

link state

link state register

register N3 N1 N3 N2 N2 N3 search signal

N1

slot 1 slot 1

link state

register N3 N1 N3 N2 N2 N3 search signal

Sel=2’b01

DSS

DSS

Sel=2’b10

BK_MUX

OR

AND

AND

S

reg N1

=

DSS

Sel=2’b00

slot 1

OR

S= slot table size

Next Cycle

BK_MUX

Output signal to next stage

Backtracking

slot 1

slot 0

predecessor[S-1:0]

Sel=2’b11

DSS

Sel Reversed Cyclic Shift

reg

MUX

Fig. 7. Implementation details of an example DSS unit of node 3

Des

Survivor Path N0

Fig. 9. Implementation schematic of the folded trellis for the example NoC

Src DSS

Intended Target On?

DSS

N1

edge’s width=#slots

DSS

DSS

DSS

DSS

DSS

DSS

N2

S = slot table size

Forward Search Path Backtrack

Des

N3

Fig. 8. Implementation schematic of the unfolded trellis for the example NoC

The details of the DSS unit of node 3 in the example NoC with a slot table size of 2 are shown in Fig. 7. Since the implementation for each slot is the same, we take one slot as example. The working flow for each slot in DSS unit is as follows: 1) Detect the available slot: The search signal from N 1 → N 3 and its corresponding ‘Link State’ register are connected to an AND gate. If the search signal as well as the corresponding link are both valid, the current node will be activated by this search. 2) Select active links as survivor path: Detect signals from predecessor nodes (i.e. the output of AND gates) that are at the same slot are connected to an OR gate. If the node can be activated (i.e. the output of OR gate is ‘1’), one of the active incoming branches is saved in register as the survivor path. 3) Cyclically shift slot: Cyclically shifts slots to synchronize with next hop. The search signal at slot s after shifting comes to slot (s + 1) mod S. The cyclically shift is realized by wire connections. We can see (in Fig. 7) the critical path of DSS unit is only an AND gate and an OR gate, which is very simple. For both unfolded trellis and folded trellis, the implementation structure of DSS unit is the same. 1) Unfolded Trellis Implementation: The implementation schematic of the unfolded trellis is shown in Fig. 8. The search begins at the source node by setting its all slots to logic ‘1’, then the search signals propagate forward along the edges to the connected neighbors via DSS Unit that checks the slot’s availability. The still valid signals continue to propagate in this way until the end of the trellis where a register stores which of the nodes could be reached through the NoC.

In the next cycle, the backtrack is started with each selected slot backtracking its own path simultaneously if the intended target is active. The path is selected by reading the stored predecessors starting at the destination node. 2) Folded Trellis Implementation: The search begins at the source node and propagates forward via the DSS Unit until it reaches the destination. If the destination node is activated, the backtracking is started at the destination node (Des) by reading the stored predecessors. The current node that is read out at last cycle will request its predecessor via multiplexer BK_MUX (in Fig. 9). The predecessors are sorted out in this manner until the source node is obtained. V. S YNTHESIS R ESULTS The NoCManager was implemented in synthesizable VerilogHDL and can be generated out of an XML description for different NoC sizes. Using Synopsys Design Compiler, the NoCM was synthesized with TSMC 65 nm technology, for different mesh NoC sizes of 4x4 to 10x10. In the synthesis, for folded TESSA, the critical paths were constrained to 1 nanosecond. For unfolded TESSA and unfolded bidirectional TESSA, the critical path constraints were gradually increased according to the increased NoC size. For example, in unfolded bidirectional TESSA, the critical path was constrained to 1.11 ns for 6x6 mesh and to 2 ns for 10x10 mesh. We can combine the folded and unfolded architectures that to fold several stages instead of one stage to provide a suitable tradeoff between area and performance. Therefore, we implemented a TESSA approach that folds half stages, called half-folded TESSA, i.e. in N.N mesh, we implemented N −1 stages, so that the forward search can be finished in two cycles at most. The synthesis results of the four different architectures are presented and compared in this section. The performance of the four different architectures are compared in terms of area, average Area.Time (AT) product per allocation and average energy consumption per allocation. For different architectures, the allocation time per allocation is different. In N.N mesh, the average allocation time per allocation of different architectures is shown as follows: • In unfolded TESSA, two cycles;

7 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 2168-6750 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2017.2765825, IEEE Transactions on Emerging Topics in Computing

300 #slots=4,Unfolded TESSA #slots=8,Unfolded TESSA #slots=16,Unfolded TESSA #slots=4,Half−folded TESSA #slots=8,Half−folded TESSA #slots=16,Half−folded TESSA #slots=4,Folded TESSA #slots=8,Folded TESSA #slots=16,Folded TESSA #slots=4,Bidirectional TESSA #slots=8,Bidirectional TESSA #slots=16,Bidirectional TESSA

250

4

Area/ 10 µm

2

200

150

100

50

0 10

20

30

40

50 60 #Routers

70

80

90

100

Fig. 10. Area of different TESSA in different size NoC with different slot table size

2000

4

2

AT per allocation (10 µm /GHz)

2500

1500

#slots=4,Unfolded TESSA #slots=8,Unfolded TESSA #slots=16,Unfolded TESSA #slots=4,Half−folded TESSA #slots=8,Half−folded TESSA #slots=16,Half−folded TESSA #slots=4,Folded TESSA #slots=8,Folded TESSA #slots=16,Folded TESSA #slots=4,Bidirectional TESSA #slots=8,Bidirectional TESSA #slots=16,Bidirectional TESSA

VI. E XPERIMENTAL R ESULTS

1000

500

0 10

20

30

40

50 60 #Routers

70

80

90

100

Fig. 11. Average AT cost per allocation of different TESSA in different size NoC with different slot table size • • •

In bidirectional TESSA, two cycles; In folded TESSA, 2.(N −1) cycles, because average path length is N − 1; In half-folded TESSA, 3 cycles. Because the search can be finished in the first iteration or in the second iteration, i.e. the allocation time can be two cycles or four cycles, i.e in three cycles on average.

Depending on the hardware structures, the results indicate that the complexity √ of unfolded and bidirectional TESSA grows with O(S· M· M ) in 2D-mesh (M= #routers, S=slot table size), and the complexity of √ folded TESSA grows with O(S· M) in 2D-mesh, where the M is related to the number

#slots=4,Unfolded TESSA

Energy per allocation (PicoJoule)

#slots=8,Unfolded TESSA #slots=16,Unfolded TESSA #slots=4,Folded TESSA 2000

#slots=8,Folded TESSA #slots=16,Folded TESSA #slots=4,Bidirectional TESSA

1500

#slots=8,Bidirectional TESSA #slots=16,Bidirectional TESSA

1000





success rate denotes the ratio of successful requests that established paths with sufficient bandwidth to the total requests. Metrics for comparison with centralized technique: background traffic refers to the certain percentage of slots which are already randomly marked as occupied to exclude these slots from path search, same as in [16]. allocation time denotes the time that the algorithms need to find out a solution or to determine that the allocation is not possible. 1 Metrics for comparison with distributed technique: total allocation time denotes the time (nanoseconds) that the algorithms need to find the solution, in addition to the time to send the allocation information to source node. GS connection Rate denotes the portion of clock cycles in which each master is active sending GS data.2 It is computed per master as: GS connection rate = #(GS connection requested)∗(#f lits sent per connection) #(simulation cycles)

500

0 10

To evaluate the influence of multi-path allocation of TESSA on the allocation success rate, we also realized a single path approach that employs the trellis search algorithm for path search but allocates all required slots (bandwidth) on a single path. This approach is referred to as the single path trellis. The trellis structure of the single path trellis is the same as TESSA except the required bandwidth is allocated on a single path instead of multipath. The allocation speed and success rate of TESSA NoCManagers are compared to previous centralized and distributed allocation techniques for different NoC sizes with different slot table sizes under synthetic uniform random traffic as well as real-application Splash-2 benchmarks. These results are explained in the following sections. For evaluation, several performance metrics were considered:



3000

2500

√ of trellis stages (2· ( M -1)). These are shown in the synthesis results in Fig. 10, Fig. 11 and Fig. 12. The area results in Fig. 10 show that the unfolded and bidirectional TESSA have the highest area consumption and are almost equal to each other. The area cost of folded TESSA is the lowest, since it reuses hardware. From the AT cost results shown in Fig. 11, we can see that the bidirectional TESSA presents the best results when considering area and time together. This is due to the halved critical path, while the AT cost of unfolded TESSA is the worst. The energy consumption per allocation is shown in Fig. 12. Here, the bidirectional TESSA also presents the best energy efficiency. Since in folded TESSA, additional registers are used to store the intermediate results and decision unit (to determine the search continue or stop) has to be executed at each hop, its energy efficiency is the worst.

20

30

40

50 60 #Routers

70

80

90

100

Fig. 12. Average Energy consumption per allocation of different TESSA in different size NoC with different slot table size

1 The original allocation time in the comparable references is measured in clock cycles, but it is converted into nanoseconds in this paper. 2 The metric GS connection Rate in this paper is the same as offered load in [9].

8 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 2168-6750 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2017.2765825, IEEE Transactions on Emerging Topics in Computing

9,000

Allocation Time/ ns

exhaustive path search on SW, 0% bk 8,000

exhaustive path search on SW, 10% bk

7,000

exhaustive path search on SW, 20% bk Folded TESSA

6,000 5,000 4,000 3,000 2,000 1500 12 8 4 2 1

2

3

4

5

6

#hops

Fig. 13. Allocation speed compared to [16] with different background in 4x4 NoC with slot table size of 16. The hops in [16] is adapted to the distance from router to router. 1 0.9 0.8

Success Rate

0.7 0.6

10% bk,single path trellis 20% bk,single path trellis 30% bk,single path trellis

0.5

50% bk,single path trellis

0.4

10% bk,multipath Folded TESSA

experiments, we produce 1000 samples at each background traffic percentage. We do the simulation for 4x4 meshes with requested slots from 1 to 16 under background traffic from 10% to 50%. The multipath folded TESSA, single path trellis, and software based exhaustive path-search [16], [17] are compared in Fig. 14. The searching algorithms of single path trellis and single path software approach are similar in that the required bandwidth is allocated over single path, so in the scenarios where the software method’s results are not provided we can imagine it is similar to single path trellis’s.3 From Fig. 14 we can see the success rate of folded TESSA can be several to hundred times higher than single path trellis, and it is even higher than exhaustive path-search. Under heavy background traffic with high requested bandwidth, the TESSA becomes far superior to the two single path solutions. E.g. in 4x4 mesh with 16 requested slots, under 20% background traffic, the success rate of folded TESSA is 32X higher than single path trellis and 49X higher than exhaustive path-search, which increases to 103X higher than single path trellis under 50% background traffic.

20% bk,multipath Folded TESSA 0.3 0.2 0.1

30% bk,multipath Folded TESSA

B. Comparison with distributed parallel probe search under synthetic traffic

50% bk,multipath Folded TESSA 10% bk,single path exhaustive search 20% bk,single path exhaustive search

0 0

2

4

6 8 10 Requested BW (#slots)

12

14

16

Fig. 14. Success Rate compared to single path solutions in 4x4 NoC with different background traffic with Slot Table Size of 16.

A. Comparison with centralized exhaustive path-search under synthetic traffic Folded TESSA is compared to the single path software based exhaustive path-search that runs on Microblaze processor (@288 MHz) [16], [17] in this section. 1) Comparison of allocation speed: We compare the allocation speed with the exhaustive path-search with 0%, 10% and 20% random background traffic (bk) in a 4x4 mesh network. With different background traffic, the allocation speed of exhaustive path-search is different, but the allocation speed of TESSA is always the same regardless of the background traffic. From Fig. 13, we can see the allocation time of exhaustive path-search increases linearly in the length of the paths without background traffic (0% bk) and increases exponentially with background traffic (10% and 20% bk). However, in comparison, the allocation time of TESSA always increases linearly with the path length. The speed of TESSA is about hundreds to thousand times faster than exhaustive path-search. For 6 hops, 12 cycles (12 ns @ 1GHz) are needed for TESSA, i.e. 6 cycles needed for forward search and 6 cycles for backtracking, while exhaustive path search requires 8848 ns with 10% background traffic. Hence, TESSA is up to 737 times faster than the exhaustive path-search approach. 2) Comparison of Success Rate: The requests sent to the NoCM are generated in this way: request to provide an allocation for every feasible source-destination pair combination with certain percentage of background traffic. In our

In this section, bidirectional unfolded TESSA is compared to the state of the art distributed parallel probe search [9]. All data points shown in the figures are obtained from simulation over 1 million cycles. The master issues a connection request to the NoCM in a uniform random traffic meeting the requirements of the GS connection rate. The first and last 100,000 simulation cycles were not considered in order to avoid transient effects. The connection lifetime, i.e. the number of flits that each connection delivers, is set as 100 flits, 200 flits and 500 flits. During simulation, half of the nodes are assumed as masters and half of the nodes are assumed as slaves. These master nodes are uniformly randomly distributed in the system. 1) Comparison of allocation speed: In parallel probe search several trials might be needed before success is achieved due to investigation of a single slot at a time. In contrast, in bidirectional TESSA, all slots are searched in parallel, which completes the search in two clock cycles and is independent of the number of slots. Though our design needs additional time to send the allocation information to source node, the path from NoCM to source is found in two cycles by NoCM as a GS path. If the allocation of GS path from NoCM to source node fails, the allocation information will be sent to source as best-effort packets. Because in this simulation setting the GS connection rate is not high (the connection rate is lower than 0.145 in 16x16 mesh and the connection rate is lower than 0.415 in 6x6 mesh), the corresponding allocation success rate is higher than 0.999, so the influence of the allocation failure of GS path from NoCM to source is negligible. For example, in 8x8 mesh with slot table size of 16 at GS connection rate 0.3, the average total allocation time for single 3 The success rate of single path trellis is higher than the software method, which is due to the reason that it can detour when there is no minimal path, but in [16] only searches the minimal path.

9 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 2168-6750 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2017.2765825, IEEE Transactions on Emerging Topics in Computing

1

250

200

probe search,6x6 Bi−dir TESSA,6x6 probe search,8x8 Bi−dir TESSA,8x8 probe search,16x16 Bi−dir TESSA,16x16

0.9

0.8 Success Rate

Average total allocaiton time (ns)

300

150

0.7

probe search,16 slot,6x6 TESSA,16 slot,6x6

0.6 probe search,16 slot,8x8

100

TESSA,16 slot,8x8

0.5 probe search,16 slot,16x16

50 0.4 0.1

0 0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

TESSA,16 slot,16x16

0.2

0.3

0.4 0.5 0.6 0.7 GS connection rate

0.8

0.9

0.4

GS connection rate

slot in probe search is 150 time slots (0.5 · 150 = 75ns) 4 . However, in TESSA only 2 clock cycles each are needed for finding the requested GS path and for finding the path from NoCM to source. Since the NoCM is connected to the center router of the NoC, an average of 4 time slots are needed by the NoCM to send the allocation information to source. With a slot table size of 16, there is on average 8 time slots waiting time at the NoCM to get its turn in the TDM scheme. So in total on average 12 time slots (= 4 + 8) in addition to 4 cycles (5.6ns @critical path 1.4ns) are needed, which is 11.6 ns (= 12 · 0.5 + 5.6ns). This is 546% faster than the parallel probe search apporach. When n slots are requested, the allocation time for probe search might be increased by n times, but the allocation time for TESSA will be the same as it is independent of the number of requested slots. Hence, when more slots are requested, our solution will present even better results in comparison to the parallel probe search approach. Fig. 15 shows the average total allocation time when single slot is requested for different network sizes. When the GS connection rate is low, the allocation speed of TESSA is similar to parallel probe approach. However, when the connection rate increases, the allocation speed of TESSA becomes much faster than parallel probe. Our approach can provide up to 710% higher speed in 6x6 mesh (@connection rate 0.4), up to 647% higher speed in 8x8 mesh (@connection rate 0.3), and up to 650% higher in 16x16 mesh (@connection rate 0.14). In parallel probe, after the saturation point of the network (connection rate 0.14 in 16x16 mesh and connection rate 0.41 in 6x6 mesh), the allocation time will increase dramatically. And consequently our approach will become far superior to parallel probe. 2) Comparison of Success Rate: We re-implement the distributed parallel probe search according to Liu’s work [9] for comparison, with retry deadline as 200 cycles. 5 In parallel probe search, when several connections are 4 time

slot is the routing time per router, in [9] a slot is 0.5 ns. We can assume the time slots in both systems are the same. 5 In parallel probe, each node is attached with a buffer to store the incoming requests, with the buffer size equal to the slot table size. The source node keeps retrying a request until it succeed or exceed the deadline or the buffer is full.

1 0.95 0.9

Success Rate

Fig. 15. Allocation speed of bidirectional TESSA compared to probe search in different networks with different GS connection Rate with Slot Table Size of 16.

Fig. 16. Success Rate of Bidirectional TESSA compared to probe search in different network. Each connection delivers 200 flits.

0.85 0.8 probe search,8 slot,100 flits,6x6

0.75

TESSA,8 slot,100 flits,6x6 probe search,16 slot,100 flits,6x6

0.7

TESSA,16 slot,100 flits,6x6 probe search,16 slot,500 flits,8x8

0.65 0.1

TESSA,16 slot,500 flits,8x8

0.2

0.3

0.4 0.5 0.6 0.7 GS connection rate

0.8

0.9

Fig. 17. Success Rate compared to probe search in different network with 8 or 16 slots. Each connection delivers 100 or 500 flits.

requested simultaneously, the concurrent searches might block each other. It only searches the minimal path and cannot explore non-minimal paths as in TESSA. Retry before deadline policy is employed, which can stop the search as failure before all slots are investigated even though there might be available paths. According to these factors, our system would have much higher success rate than parallel probe search. Fig. 16 and Fig. 17 shows the comparison results of success rate. From the simulation results we can see that the success rate of our method is higher than that of probe search. For example, in 6x6 NoC with connection rate between 0.6 and 1.0 and with slot table size of 16, our solution offers up to 26% higher success rate (@ connection lifetime of 100 flits). Moreover, our solution offers up to 29% and 24% higher success rate in 8x8 and 16x16 NoC, respectively. From Fig. 17 we can see with more slots (16 slots against 8 slots), the success rate becomes higher, which is due to the increased path diversity. C. Comparison under Benchmarks In this section, bidirectional TESSA is compared to centralized breadth-first search [22] and distributed parallel probe search [9] with Splash-2 benchmarks [36]. The Splash-2 application suite is a popular parallel programs for the evaluation of architectural ideas of centralized and distributed

10 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 2168-6750 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2017.2765825, IEEE Transactions on Emerging Topics in Computing

1

Centralized breadth−first TESSA Probe search

0.9 0.8

Success Rate

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Radix

FFT

WaterSp

Ocean

LU

WaterNs

Fig. 18. Success Rate of bidirectional TESSA compared to Centralized breadth-first search and parallel probe search under Splash benchmarks. Partition A Src

Partition B A NoCM_B

NoCM_A B

E

C

D

F NoCM_D

NoCM_C G

architecture that divides the large system into multiple small partitions, each partition having its own local manager [37]. Each local manager only keeps the status of nodes in its region, and is responsible for searching paths in its local region. The NoCMs are connected with each other in a 2D-mesh network via dedicated links. These local NoCMs can work simultaneously, and they need to exchange information only when requested connections cross partitions, i.e. the source node and destination are not in the same partition. Since the managers work simultaneously, the computation capacity is enhanced. As the NoC nodes only communicate with their local managers, the communication overhead is mitigated. If the source and destination nodes of the request are both inside the same partition, this request will be handled by the local manager only. Otherwise, as illustrated in Fig. 19, initially the local NoCM (NoCM_A) starts the path search from source node to reach the border nodes (node A,B and C), and then forwards the search message to its neighbor NoCMs (NoCM_B and NoCM_C), continually until the destination (NoCM_D) has been reached. When destination node is reached, backtrack starts from destination to select the survivor path. The search among NoCMs is a flood-based minimal path search. The Src NoCM sends the searches to its all productive neighboring NoCMs (which lead closer to the destination). The reached neighbor performs trellis search in its partition, and then forwards the message to its productive neighboring NoCMs. Hence, the search is forwarded to the Des along all possible minimal paths.

Des D

VIII. C ONCLUSION AND F UTURE W ORK

Partition C

Partition D

Fig. 19. The system is divided into four partitions with four local NoCMs. Green arrow: forward search, purple arrow: backtrack.

shared-address-space multiprocessors. In our evaluation, the Splash programs run on a 7x7 NoC with slot table size of 10. The data request and control messages use the packetswitching network, while response messages use the circuitswitched connection. We consider a non-uniform memory access (NUMA) model where the memory is distributed among nodes. We do the evaluation under six Splash applications. From the simulation results in Fig. 18, we can see our approach provides 5% (Ocean, Radix, LU and WaterSp) to 10% (FFT) higher success rate than centralized breadth-first search, and 21% (FFT) to 30% (WaterSp) higher success rate than distributed probe search. The reason of higher success rate is that our approach can provide higher allocation speed, as well as can detour when there is no minimal path available. VII. A DDRESS THE SCALABILITY ISSUE Though the centralized NoCManager has many advantages, there is a main disadvantage that the centralized manager will become the bottleneck as the network size grows. In order to solve this problem, we proposed the partitioned

In this paper we presented a dedicated connection allocator for TDM CS NoCs. In order to meet the requirements of diverse scenarios, three different trellis structures are presented and analyzed. In comparison to previous centralized allocations and distributed allocations in terms of allocation speed and success rate, our method achieves highly superior performance. There are four main reasons for our method being superior to other designs: • The path search problem is solved step by step as dynamic programming to reduce computation complexity, as well as to ensure path optimality (shortest path); • We allocate slots on multipath and do search in all directions simultaneously as flooding, which makes success rate and search speed much higher than previous methods; • The hardware architecture of the NoCManager is efficient, and the critical path of each stage is only an OR gate and an AND gate; • The algebraic formulation for the system is proposed, which allows complex branch selection criterion to balance the network load and to achieve global optimal results. Although in this paper, the bidirectional architecture is only applied to unfolded TESSA, it can also be applied to folded TESSA as well as half-folded TESSA. And then, we believe

11 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 2168-6750 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2017.2765825, IEEE Transactions on Emerging Topics in Computing

the bidirectional half-folded architecture will have the best AT performance. Since TESSA architecture is adapted from Viterbi algorithm, any optimization ideas of Viterbi algorithm are potentially possible to be applied to TESSA approach, and this is considered for our future works. R EFERENCES [1] Dally W J, Towles B P. Principles and practices of interconnection networks. Elsevier, 2004. [2] Coskun A K, Gu A, Jin W, et al. Cross-layer floorplan optimization for silicon photonic NoCs in many-core systems. DATE, IEEE, 2016. [3] Chen C, Meng J, Coskun A K, et al. Express virtual channels with taps (evc-t):A flow control technique for network-on-chip(noc) in manycore systems. HOTI,IEEE, 2011. [4] Stefan R A, et al. daelite: A tdm noc supporting qos, multicast, and fast connection set-up. Computers, IEEE Transactions on, 2014. [5] Bjerregaard T, Sparso J. A router architecture for connection-oriented service guarantees in the MANGO clockless network-on-chip. DATE, 2005. [6] Lusala A K, Legat J D. Combining sdm-based circuit switching with packet switching in a NoC for real-time applications. ISCAS, 2011. [7] Lusala A K, Legat J D. Combining SDM-Based circuit switching with packet switching in a router for on-chip networks. International Journal of Reconfigurable Computing, 2012. [8] Liu S, et al. A fair and maximal allocator for single-cycle on-chip homogeneous resource allocation. VLSI, IEEE Transactions on, 2014. [9] Liu S, Jantsch A, Lu Z. Parallel probe based dynamic connection setup in TDM NoCs. Design, Automation Test in Europe (DATE), 2014. [10] Goossens K, et al. AEthereal network on chip: concepts, architectures, and implementations. Design & Test of Computers, IEEE, 2005. [11] Goossens K, Hansson A. The Aethereal network on chip after ten years: Goals, evolution, lessons, and future. DAC, IEEE, 2010. [12] Millberg M, et al. Guaranteed bandwidth using looped containers in temporally disjoint networks within the Nostrum network on chip. DATE, 2004. [13] Gebali, Fayez, Haytham Elmiligi, and Mohamed Watheq El-Kharashi, eds. Networks-on-chips: Theory and Practice. CRC Press, 2011. [14] Chen Y, Matus E, Fettweis G P. Trellis-search based Dynamic MultiPath Connection Allocation for TDM-NoCs. GLSVLSI. ACM, 2016: 323-328. [15] Lu Z, Jantsch A. TDM virtual-circuit configuration for network-on-chip. VLSI, IEEE Transactions on, 2008. [16] Stefan R, Nejad A B, Goossens K. Online allocation for contention-freerouting NoCs. Interconnection Network Architecture: On-Chip, MultiChip Workshop. ACM, 2012. [17] Stefan R. Resource Allocation in Time-Division-Multiplexed Networks on Chip. PhD thesis, Delft University of Technology, 2012. [18] Schoeberl M, et al. A statically scheduled time-division-multiplexed network-on-chip for real-time systems. NoCS, IEEE, 2012. [19] Kasapaki E, Spars J. Argo: A time-elastic time-division-multiplexed NOC using asynchronous routers. ASYNC, IEEE, 2014. [20] Spars J, Kasapaki E, Schoeberl M. An area-efficient network interface for a TDM-based network-on-chip. DATE, 2013: 1044-1047. [21] Y. Chen, E. Matus and G. Fettweis. Centralized Parallel Multi-Path Multi-Slot Allocation Approach for TDM NoCs. CCECE. IEEE, 2016. [22] Pakdaman F, Mazloumi A, Modarressi M. Integrated circuit-packet switching NoC with efficient circuit setup mechanism. The Journal of Supercomputing, 2015, 71(8): 2787-2807. [23] Stefan R, Goossens K. A TDM slot allocation flow based on multipath routing in NoCs. Microprocessors and Microsystems, 2011. [24] Moreira O, Mol J J D, Bekooij M. Online resource management in a multiprocessor with a network-on-chip. SAC. ACM, 2007. [25] Marescaux T, et al. Dynamic time-slot allocation for QoS enabled networks on chip. Embedded Systems for Real-Time Multimedia, IEEE, 2005. [26] Ruaro M, et al. Runtime adaptive circuit switching and flow priority in NoC-Based MPSoCs[J]. IEEE Transactions on VLSI Systems, 2015, 23(6). [27] Winter M, Fettweis G P. A network-on-chip channel allocator for runtime task scheduling in multi-processor system-on-chips. DSD’08. [28] Winter M, Fettweis G P. Guaranteed service virtual channel allocation in NoCs for run-time task scheduling. DATE, IEEE, 2011: 1-6. [29] Heisswolf J. A Scalable and Adaptive Network on Chip for Many-Core Architectures. KIT, Diss., 2014.

[30] Liu S, Jantsch A, Lu Z. Parallel probing: dynamic and constant time setup procedure in circuit switching NoC. DATE. 2012. [31] Hansson A, Goossens K. On-chip Interconnect with Aelite: Composable and Predictable Systems. Springer Science Business Media, 2010. [32] Bradley S, et al. Applied mathematical programming. 1977. [33] Lou H L. Implementing the Viterbi algorithm. Signal Processing Magazine, IEEE, 1995. [34] Lin S, et al. Trellises and trellis-based decoding algorithms for linear block codes. Springer Science Business Media, 2012. [35] Hansson A, et al. aelite: A flit-synchronous network on chip with composable and predictable services. DATE, 2009. [36] Woo S C, Ohara M, Torrie E, et al. The SPLASH-2 programs: Characterization and methodological considerations. ACM SIGARCH Computer Architecture News. ACM, 23(2): 24-36. [37] Chen Y, Matus E, Fettweis G P. Combined Centralized and Distributed Connection Allocation in Large TDM Circuit Switching NoCs.GLSVLSI, ACM, 2017.

Yong Chen received his bachelor and M.Sc. degree in Electrical Engineering from the Northeastern University (China) in 2011 and 2013, respectively. He received his PhD degree at the Vodafone Chair, Technische Universität Dresden, Germany. His research interests include Network-on-Chips (NoC), Multiprocessor System-on-Chip (MPSoC), VLSI, Quality of Service (QoS) and Computer Architecture.

Dr. Emil Matus is senior scientist at Vodafone Chair Mobile Communication Systems. He received his MS and PhD degrees in Electrical Engineering from University of Technology in Kosice. Prior to joining Vodafone chair in 2003, he was research associate at University of Technology in Kosice and focused on wavelet transform and image compression. His current research includes hardware/software architectures for communications signal processing and methods for system management and optimization.

Sadia Moriam is a PhD student at the Vodafone Chair, Technische Universität Dresden, Germany. Her research interests include Network-on-Chips (NoC), resilience, simulation model.

Gerhard P. Fettweis earned his Ph.D. from RWTH Aachen in 1990. Thereafter he was at IBM Research and TCSI Inc., California. Since 1994 he is Vodafone Chair Professor at TU Dresden, Germany, with 20 companies from Asia/Europe/US sponsoring his research on wireless transmission and chip design. He is IEEE Fellow and an honorary doctorate of TU Tampere. He coordinates 2 DFG centers at TU Dresden, namely cfaed and HAEC and the 5G Lab Germany.

12 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 2168-6750 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.