Delay/Phase Regeneration Circuits - Semantic Scholar

2 downloads 0 Views 150KB Size Report
ing” designs contains the input time separation. In the remainder of the paper, all the simulation results are normalised to FO4 delay. Note that the results show.
Delay/Phase Regeneration Circuits Crescenzo D’Alessandro

Andrey Mokhov

Alex Bystrov

Alex Yakovlev

Microelectronics System Design Group, School of EECE Newcastle University, UK {crescenzo.d’alessandro, andrey.mokhov, a.bystrov, alex.yakovlev}@ncl.ac.uk

Abstract Designs which require a phase relationship between two signals to be maintained along a link benefit from the use of repeaters which actively regenerate this relationship. This paper discusses some implementations of phaseregeneration circuits and attempts to introduce the reader to the issues encountered in the design of such circuitry. The paper proposes various design solutions for the dual-rail case, extending the work to the multiple-rail case. A novel device which is able to reconstruct a sequence of events is also presented, the Transition Sequence Encoder. Simulation results are provided with discussion on the relative performance.

1. Introduction The design of reliable interconnects for on-chip blockto-block communication is becoming a crucial point for the development of new on-chip architectures, so much that designers now talk about communication-centric design. In some cases a given time relationship between two signals is available and needs to be maintained. One such case is the phase encoding protocol initially proposed by D’Alessandro et al. in [3]. In the work the authors propose the use of two out-of-phase signals to carry informations; the binary phase relationship (+/-) encodes the bit of information being sent across the communication link. The work was then furthered in [4] for multiple-rail implementation, where the link is composed of several communication lines and the symbols are encoded in the order of occurrence of transitions on the lines. It is shown that the phase corruption introduced by the link, mainly through cross-talk between the lines [11], is significant and can greatly affect the correct recovery of symbols. The assumption that the phase of the signals is kept along a path without additional circuitry still holds for short paths, as the cross-talk induced by each line on the neighbouring one is limited, thanks to the limited coupling between the wires. c

IEEE 2007

Devices which are able to recover the initial timing relationship between two incoming signals are therefore required. These will make sure that two edges travelling along a path will maintain the given phase throughout the path: given as input two identical signals delayed by a variable amount outputs the same signals with time delay always ±δ. In particular, in this work we focus on the type of repeaters that will only perform the delay correction if the input delay is less than the nominal delay δ. Interesting parallels can be drawn between this work and the work described in [7], where the SURFING interconnect technique is introduced together with some circuit design for a soft latch and an edge-to-pulse converter. Differently from the present work, the main requirement for the SURFING interconnect technique is to keep the travelling edges together; in the case described here the edges must be kept separated by a given amount. Also in [7] the edges may or may not occur, whiles in the phase-encoding method all edges always occur when a correct symbol is sent. The concept of wave pipelining was introduced by Cotten [2] in 1969; this concept was then developed further by, for example, Wong et al. in [12]. The main idea behind this concept consists in having multiple waves travelling on a line between two computational blocks: if the waves are appropriately spaced, the link does not need any additional pipeline logic. This concept can be readily applied to phaseencoding; employing delay/phase regeneration circuitry can be used to dynamically adjust the behaviour of the link. For long interconnect links, a serial solution has been proposed by Dobkin et al. in [6]. The authors propose the use of LEDR [5] for long-distance communication. In this scheme, two lines are employed; a change of state in one of the lines will result in either a 1 or a 0 to be received according to which wire has switched. The schemes proposed in this paper could be used to preserve the order of transitions on the two lines. The contribution of this paper consists in providing a range of design solutions for the problem of preserving the delay/phase relationship between long communication lines. The paper offers simulation results for the design 1

solutions and extends the analysis from dual-rail links to multiple-rail links, employing a Transitions Sequence Encoder (TSE) to reconstruct the delay/phase relationship between multiple wires. Finally, power consumption figures for the proposed solutions are shown and discussed.

dataIn1

dataOut1 DL En

dataIn0

dataOut0

d

DL a

En b

2. Classification and Experimental Setup

00

τ

c

Several possible implementations are possible for a repeater; some types of circuits can be distinguished: • latch-based design, where the outputs are controlled by latches which are on the “data path” • Mutual-Exclusion (ME)-based design where the MEs are on the data path • ME-based design where the MEs are not on the data path The designs can be implemented both at transistor-level and at gate-level. Some examples for the design styles will be analysed in the remainder of this paper. We introduce some measures to rank different designs: the capture range κ of the repeater is the set κ = [δmin , δmax ] of time separations at the input of the repeater that the device will be able to stretch back to the nominal value of δ. The linearity of the circuit corresponds to the linearity of the response of the design: given a range of time differences between two inputs, the linearity refers to the difference between the output delay between the signals and the input delay. The latency λ of the device is, intuitively, the time between the first input signal reaching the device and the corresponding output leaving the device. Finally, the response time ζ is the time between the first input reaching the device and the time limit before which the second event would not be delayed: if the first event occur at time t0 and the second event occur at time t0 + ζ < t < δ then the output time separation is restored to δ; otherwise, if the second event occur at time t ≤ t0 + ζ the time separation is not regenerated. Note that in some cases if the time separation is close to ζ the device can enter metastability or an equivalent state, according to the type of device. For phase-locked loops (PLLs) it is customary to provide the capture range (or “lock-in” range) as the difference between the maximum and the minimum frequency the PLL will lock onto, as the set is centred around the natural frequency of the loop. Following the same custom, we similarly define κ = δmax − ζ (as δmin = ζ). Several implementations of the device are possible, each with different issues to be addressed. In particular, we can distinguish between “analogue” implementations of the repeater and “digital” implementations of the device; in particular, this distinction is based on the generation of the c

IEEE 2007

Figure 1. Conceptual latch-based design pulse which stops the late arriving signal from propagating to the output. The analogue implementations rely on analogue differentiators to generate the pulse; these differentiators are built using series capacitors. In the case of digital implementations the differentiators are implemented using gates. In this article we will focus in particular on some digital implementations of the repeaters. A further distinction can be made between “earlypropagating” devices and “merging” devices. The first type allows events to propagate through regardless of the occurrence of a later event; on the other hand, “merging” designs wait for all the events to occurred before the outputs are enabled. The first type can be employed in situations where the presence of all the events is not necessary, while the “merging” solutions can be used in systems like phase-encoding, where all the events will occur if a correct transmission is performed. Importantly, the latency of “early-propagating” designs is independent from the input type separation of the events, while the latency of “merging” designs contains the input time separation. In the remainder of the paper, all the simulation results are normalised to FO4 delay. Note that the results show the response of the circuits in case of rising or falling input transitions; this is necessary as, due to imbalances in the pand n-type transistor ratio in the gates used, the latency and response of the circuits can vary. In particular, in the case of MEs this difference is marked: in fact, for the falling transitions NOR-based MEs are used, which perform worse than NAND-based equivalent MEs. The simulations and power analyses were performed using UMC 0.18µm technology. The wires were modelled using a distributed π-model, with a load at the receiver of comparable size to the sender’s drivers.

3. Latch-based design A conceptual idea of the device is given in Figure 1. The two input signals are fed into a pair of latches; the first input to propagate through one of the latches will cause a pulse at the “enable” input of the two latches, in turn causing the other signals to be prevented from propagating through the 2

δin < λ + a + b

(1)

the second edge will reach the output of the latch before the pulse is generated and therefore the circuit will have no effect. If λ + a + b < δin < λ + a + b + dsetup

(2)

the latch may enter metastability, as the second edge will reach the latch during the time the pulse is being generated and will not respect the restriction imposed by the setup time of the latch. Finally, when δin > λ + a + b + dsetup

(3)

the latch will prevent the second edge to propagate until the pulse is completed. The response time ζ will therefore be: ζ = λ + a + b + dsetup

(4)

The nominal value of δ at the output will be equal to δ =a+τ +c+d

(5)

The upper bound beyond which the device does not influence the path delays is going to be δmax = δ + λ; in fact if δin > δ + λ by the time the second event will occur the latch has been released. Note that if the input time separation is between δ and δ + λ, the output will still be δ. The capture range κ will therefore be: κ = δmax − ζ = τ + c + d − b − dsetup c

IEEE 2007

(6)

i1

o1

D Q

Pulse generator

G Q

i1 i2

o2

D Q

D Q

o1

G Q

G Q

i2

D Q

o2

G Q

Pulse generator

τ

(a) Single pulse implementation

(b) Dual pulse implementation

Figure 2. Latch-based implementations Response of latch−based design (1) 12 Output Time Separation/Latency (FO4)

latch for a given time. As can be seen from the simple example, two issues become of primary importance in the design of such devices: first, the latch of the signal arriving “late” should not enter metastability; second, which is correlated to the first, the time between an event triggering the pulse and the pulse reaching the “enable” input of the latches should be minimised in order to capture as many events as possible. In fact, if the latter condition does not hold, the late arriving signal will propagate through before the latch has been able to prevent this from happening. The following description of the system is based on Figure 1, but can be adapted to other designs. Define as dsetup the setup delay of the latches and as dq the delay d → q of the latch. From the figure, the letters a–d represent the path delays of the indicated paths; in particular, d represents the delay between the latch enable being released and the output being generated. Assume that two edges are travelling on the link and were generated with delay δ between each other and arrive at the input of the circuit with delay δin < δ. The latency of the device is trivially λ = dq . If

Fall Rise Latency (fall) Latency (rise)

10 8 6 4 2 0

0

2

4 6 Input Time Separation (FO4)

8

10

Figure 3. Simulation results for the circuit in Figure 2 (a) These equations can be used to determine the value of τ given δ and a set of requirements. This first analysis shows that the value of δ must be chosen in such a way that, even when the phase is maximally corrupted, the condition expressed in equation 3 is always respected. This can turn out to be a significant limitation to the speed of the circuit. Based on Figure 1 two possible implementations can be obtained, as in Figure 2 (a) and (b), according to the way the pulses which control the latch clock are generated. Figures 3 and 4 show the results obtained with these designs. Although the two designs have very similar responses, they also show some significant differences. For the XOR-based implementation (Figure 2 a) the expected output δ was around 10 FO4 delay, as this turned out to be the minimum time separation obtainable using a gatelevel pulse generator and the available latches; in the case of Figure 2 b) the values are decreased to around 7 FO4 thanks to the more light-weight system to generate the control pulses. The devices respond correctly, although they exhibit a marked difference between the rising- and fallingtransition response (less so in the case of Figure 2 b): in the case of 2 a) when the input time separation drops below 10 FO4 for a falling input transition and 8 FO4 for a ris3

Response of latch−based design (2) i1

Fall Rise Latency (fall) Latency (rise)

9 8

o1 τ

τ

τ

τ

7 6 5

o2

4 i2

3 2 1 0

0

2

4 6 Input Time Separation (FO4)

8

10

Figure 4. Simulation results for the circuit in Figure 2 (b) ing input transition, the device pulls the edges apart so that the output time separation is preserved at the output. The design fails when the input time separation drops below a minimum value (equation 1). The latency of the device is between 2-3 FO4 delay for falling and rising edges respectively. The fact that the rising and falling edges have different responses is due to the imbalance of the P- and N-type transistor in the gates available in the technology employed to obtained the results. The circuit of 2 b) performs in a more similar way for the rising and falling transitions; they both are preserved to around 6 FO4 when the input time separation drops below this value. The latency of the design is similar to that previously described. The main advantages of this type of design is the relative simplicity of implementation. The gate count is relatively small and allows the repeater to be easily tuned to the required specifications. Also it would be relatively easy to scale up the design for multiple-rail implementations, simply by OR-ing all the XOR gates between pairs of wire and controlling a single pulse for all the latches on all wires (this is described in more details in Section 7). The latency of the design is also relatively low so that several repeaters can be placed along a line. Finally both designs exhibit good linearity and the output curves are “flat” at the expected range. However, the capture range of the device is limited: the implementations stop working around 6 and 4 FO4 delay; around that point the risk of metastability in the latches increases until the devices fail. This type of design is therefore appropriate if the nominal δ of the link is in the order of around ten FO4. A faster design relies on transistor-level implementation; this will be described in Section 5.

4. ME-based designs ME-based designs can be in turn divided into several types. An ME can be used to identify the first-occurring c

IEEE 2007

Figure 5. ME-based “early-propagating” design Response of modified−ME design 10 Output Time Separation/Latency (FO4)

Output Time Separation/Latency (FO4)

10

Fall Rise Latency (fall) Latency (rise)

8

6

4

2

0

0

1

2 3 Input Time Separation (FO4)

4

5

Figure 6. ME-based design results for design in Figure 5 event and then, as both events have arrived, the signal is sent on with the correct time separation; otherwise the first output can be generated immediately after the first event arrives and the second after the second event, introducing some time separation if necessary; finally, the output of the ME can be fed to a wrapper logic to generate the outputs. An example of “early-propagating” design is shown in Figure 5. This design employs a cross-coupled pair of complex gates. The cross-coupled structure only regenerates the delay for rising input transitions, but not for falling transitions, in which case the δ is preserved. Two devices in series allow the device to regenerate the delay for both transitions. In Figure 5 the intermediate and final high-gain buffers at the output of the complex gates are used to restore the digital levels. A transistor-level metastability filter, as described in [10], could have been employed. Figure 6 show the response of this design. Although the latency of the design is in the order of around 7 FO4 (due to the presence of complex gates) the capture range of the design is good: the output δ was chosen to be as small as possible and was around 1 FO4. However the output time separation does not always match the expected value; in the figure it is clear that the response is not flat, although it is 4

g22

i1

g11 τ

g12

g11

i1

o1

o1

g12 g11

τ

i2 C g21

g12

τ C

o2

C g21

g22

+

g21 g21

ref



τ

i2

g22

o2

g11 g12

Figure 7. Automatic synthesis design Response of PETRIFY−generated design

g22

Output Time Separation/Latency (FO4)

18 Fall Rise Latency (fall) Latency (rise)

16 14

Figure 9. ME-based implementation (merging)

12 10 8 6 4 2 0

0

5

10 Input Time Separation (FO4)

15

20

Figure 8. ME-based design results for design in Figure 7 better for the rising transitions. A different example of “early-propagating” design is shown in Figure 7. In this case two MEs are used for the rising (NAND) and falling (NOR) input transitions. The outputs of the MEs are fed to a control logic which takes care of the generation of the output given the ME outputs and the input signals. This control logic takes care of resetting the MEs after the output has been generated; therefore the next input must be generated only after the previous output has been produced. This introduces a limit for the bandwidth. Figure 8 shows the output response of the circuit. The latency of the design depends on the transition direction and is of around 8 FO4 for falling transitions and 6 FO4 for rising (due to the difference in speed of the MEs), which increases slightly when the input time difference becomes small; this is due to the behaviour of the MEs (see [3]). The output δ was set to 3 FO4; the response is remarkably flat below 5 FO4, although the design has a slight different response for input rising and falling transitions. The design stops working below 0.04 FO4 when the NOR-based ME is used (“fall” curve in the graph); it works all the way down to 0.01 FO4 (simulation limit) when the NAND-based ME is used. One interesting feature of this design is that if the input δ is greater than the nominal value, the output time separation c

IEEE 2007

is less than the input: this is due to the inherent structure of the circuit. In fact, the first event will propagate through the circuit after having been through the ME; the second event, however, propagates through the circuit bypassing the arbitration stage. This does not affect the output if the two inputs arrive very close to each other, as the second event is prevented from propagating to the output by the circuit. For the NOR-based ME branch this is particularly effective, as the response is relatively flat between 5.5 and 0.04 FO4; for the other branch the limits are 5 and 0.01 FO4. Above the higher limit the design reduces the timing separation but not to the nominal value. The reduction, however, could allow the next stage to regenerate the nominal time separation. An important point for this design is the fact that the control logic which takes the outputs of the MEs and the inputs and thus generates the outputs was obtained using automatic synthesis. The initial specification was described using a Signal Transition Graph (STG); this was then adjusted to take into account timing assumptions. The design was finally synthesised automatically using PETRIFY [1]. Finally, an example of “merging” design is shown in Figure 9. In this design two MEs are employed as in the previous case, but the output is generated when both inputs have arrived. This allows the circuit to introduce the expected delay regardless of the input time separation. Figure 10 shows the response of the device. It is remarkably “flat”, in the sense that the time separation of the output signals is independent of the time separation of the input signals. Note that the capture range κ is infinite, as the upper bound δmax is infinite; however the response of the device is obviously limited by the metastability effect of the MEs. It can also be seen that the output time separation can be made very small: in this case it was only around 2 FO4, although it could have been arbitrarily smaller or larger. At very small delays (δin < 0.02 FO4) the design failed to regenerate the delay as the output value was different from the input. This 5

Response of "merge" design

i1

Output Time Separation/Latency (FO4)

18 Fall Rise Latency (fall) Latency (rise)

16 14

o1 τ

12 10 8 6 4 2 0

0

2

4 6 Input Time Separation (FO4)

8

10 o2

τ

i2

Figure 10. ME-based design results for design in Figure 9 is due to minute imbalances in the load of the input signal and of the output of the MEs, which lead to errors. This is the case only in the NOR-based ME: the NAND-based ME continues working throughout the required range, down to 0.01 FO4. Another unwanted behaviour of this design is that the latency of the device increases with the input time difference. This is due to the “merging” characteristics of this design: as the production of the output is dependent on the presence of both input transitions the latency is equal to: λ = δin + dC

(7)

where dC is the propagation delay of the logic which generates the “reference” signal. An important note though refers to the change in behaviour of the latency for the falling transitions which occur around 3.5 FO4 input time separation. this is due to the slow response of the NORbased ME; below 2 FO4 the ME delay dominates, so that the reduction in input delay is counteracted by an increase in ME resolution time, which causes the output delay to flatten out below this point.

Figure 11. Latch-based transistor-level implementation the design, although the power consumption is slightly increased (see Section 8). The response of the design is dependent on the direction of the input transitions, due to the difference in speed between the P- and N-type transistors. Taking into account both transitions, the latency of the device is around 1 FO4 if the input and output buffers (not shown) are not taken into account; the response time ζ is less than 0.5 FO4. Intuitively, when N-type transistors dominate the response time is reduced: in this case to 0.01 FO4. The capture range κ is around 3 FO4; however τ was chosen to be only 2 FO4, but could have been a larger value, in turn increasing the capture range. Note that the latency of the device increase exponentially when the input time separation drops, due to metastability of the device. However, in that region the device is already outputting a δ which is less than the nominal value. In this example, the operating area of the device would have been down to 0.5 FO4 for the keeper-enhanced design; slightly less for the design without the keepers.

5. Transistor-level design techniques 6. Passive and Layout Techniques The latch-based design is improved if the response time of the circuit is reduced. This can be achieved using transistor-level implementations of the various parts of the repeater. One such solution is shown in Figure 11. The circuit shows smaller response time than the designs previously analysed, and also a smaller number of gates, resulting in less area consumption. Figure 12 shows the simulation results for the device; in a) the results are shown for the case where no keepers are used, while in b) the simulated circuit is enhanced by the use of output keepers to prevent the outputs from being let floating. Note that the presence of keepers does not affect significantly the performance of c

IEEE 2007

Various “passive” and layout techniques can also be used to reduce the capacitative cross-talk between two adjacent lines. With “passive” we refer to techniques that do not involve employing modified buffers; rather, they achieve cross-talk cancellation using either the same coupling capacitance which would cause cross-talk in the first place, or carefully placed capacitors to achieve the same effect. Ho et al. in [8] propose two such techniques to achieve this goal. In the first, repeaters are staggered along the lines so that an inverter on one line is placed between two repeaters on the other line. This makes sure that the wires are switching 6

Response of transistor−based design (no keepers) Response of passive device

Fall Rise Latency (fall) Latency (rise)

4

4 Output Time Separation/Latency (FO4)

Output Time Separation/Latency (FO4)

5

3

2

1

0

0

0.5

1

1.5 2 2.5 Input Time Separation (FO4)

3

3.5

4

(a)

Fall Rise

3.5 3 2.5 2 1.5 1 0.5 0

0

0.5

1

1.5 2 2.5 Input Time Separation (FO4)

3

3.5

4

Response of transistor−based design (with keepers) Output Time Separation/Latency (FO4)

5

Figure 14. Simulation results of chargecompensation circuitry

Fall Rise Latency (fall) Latency (rise)

4

3

2

1

0

0

0.5

1

1.5 2 2.5 Input Time Separation (FO4)

3

3.5

4

Figure 12. Transistor-level design and simulation results. a) no keepers, b) keepers added for stable outputs

i1

o1 Sender

i2

o2

Figure 13. Charge-compensation circuit always in the opposite direction at the maximum distance from the inverters, thus minimising cross-talk. This technique is useful in case of phase-encoding, as the adjacent wires always switch in the same direction. A different approach consists in employing a simple charge-compensation circuit, shown in Figure 13. This circuit operates as follows: a rising transition present on wire i1 will induce a cross-talk effect on wire i2. However, the inverter connected to i1 will cause an opposite transition to appear at one end of the connected capacitor; this will act as a differentiator, generating a pulse in the opposite direction to the edge on i1. This will eventually cancel out the initial induced cross-talk. c

IEEE 2007

Simulation results for this device are shown in Figure 14. The device operates in a way similar to the circuits shown previously. The response time ζ is around 1 FO4; as there are no buffers the latency of the device is not considered. The response of this device is not entirely satisfactory, as can be seen from Figure 14. Additionally, the circuit creates voltage spikes during normal operation, which may not be acceptable. The circuit can be improved by adding extra capacitors to increase the time separation at the output; however, this will result in a greater power requirement.

7. Multiple-rail approach In order to use phase-regeneration circuitry for multiplerail communication links it is necessary to expand the proposed circuits to the necessary number of wires.

7.1. Latch-based Approach The most straight-forward way to achieve this is simply to use the technique described in Section 3, extended to the number of wires required. This scheme, shown in Figure 15, allows a simple extension of the principles described previously without difficulty. Of course, this “rough” approach has its strong limitations: the pulse generated by the outputs of the latches must be able to stop all the other wires; this implies a large gate or even a tree of gates, resulting in a relatively large propagation delay between an event occurring on a line and the latches being blocked. We recall that the equation for the response time is: ζ = δmin = λ + a + b + dsetup

(8)

The value of a increases linearly with the number of wires and imposes a limit to the value of δ. 7

R[1,3] i1

D Q o1

R[1,2]

G Q i2

req[1]

R[2,3] R[2,1]

D Q o2 G Q

req[2]

R[3,2] R[3,1]

i3

G Q i4

req[3]

D Q o3

ack[1] ack[2] ack[3]

go

D Q o4 G Q

Pulse generator

Figure 16. TSE gate-level implementation R[1,2]

i1

o1

Pulse generator Pulse generator Pulse generator

o2

i2

R[2,1] C

go

Figure 15. Multiple-rail latch-based phaseregeneration circuit Figure 17. Dual-rail TSE Phase Regeneration

7.2. Transition Sequence Encoder In Figure 9 an example is given of a circuit that first recovers the phase information between two edges and then sends a new symbol across; this allows the system to regenerate the phase relationship whether it be greater or smaller than a nominal value. It is easy to realise that a multiple-rail implementation of such device might become prohibitive in terms of latency and area: in fact, as described in [4], the repeater would need to implement a full receiver followed by a full sender. The decoding and encoding parts would be large, composed by layers of AND and OR gates, and therefore impractical. What is needed instead is a device that is able to recover the order of the transitions and send them across in the same order with a given time separation between them. This can be achieved using a more generalised approach, described in [9], from which the following definitions are reported. Let E = {1, ..., n} be a set of n events and let a matrix R : E × E → {true, f alse} of order relations between each pair of the events be provided, such that R(i, j) = true if event i ∈ E occurs before event j ∈ E. The Transition Sequence Encoder (TSE) is a circuit which generates request signals (events) req[k], k ∈ E in the order specified by the matrix R. Following the steps described in [9] the equation for req[k] is: ^ req[k] = (R[k, j] + ack[j]) 1≤j≤n, j6=k

The solution can be mapped to gate-level implementation (Figure 16). The signal go is a general “ready” signal c

IEEE 2007

which prompts the circuit to start generating requests. The TSE can be used as a phase regeneration circuit for multiple wires. The order matrix can be generated with a set of mutual exclusion elements and signals ack[k] are simply delayed versions of req[k]. In its simplest form the TSE can be used for dual-rail links. In this case, a simple crosscoupled structure following the ME is able to regenerate the signal, as shown in Figure 17; this structure is shown for rising edge only. The go signal is simply generated using a CElement (CE) from the input wires and signals the presence of a symbol in the circuit. Note that the ME outputs do not need a completion detection circuitry, as instead is needed in other designs, but only a metastability resolver. In fact, consider the case where a symbol has been received; the CE will generate a signal concurrently with the ME. However, until the ME has resolved, no output is generated; when the ME has resolved the order is correctly regenerated at the output. Therefore the race condition between the ME and the CE does not constitute a problem for the circuit timing. In order to employ both rising and falling edges the circuit would have to be duplicated, using for the falling edge the complementary version of Figure 17, driven by a NORbased ME; alternatively another solution consists in using a transistor-level implementation of the building blocks. The resulting circuit is shown in Figure 18. The pull-up and pull-down parts of the gate are not complementary; a keeper structure is therefore used to maintain the output stable. Figure 19 shows the response of this circuit, both in the case of gate-level and transistor-level implementation. Note the significant latency of the circuits: as the device is “merging”, the latency include the input time separation. 8

Response of TSE design 20

o1

τ R1[1,2] R1[2,1] C R2[1,2] R2[2,1]

o2

τ i2

R2[2,1]

Fall Rise Latency (fall) Latency (rise)

R2[1,2]

R1[1,2]

Output Time Separation/Latency (FO4)

i1

R1[2,1]

15

10

5

0

0

2

4 6 Input Time Separation (FO4)

8

10

8

10

(a) Gate-level design

The circuit proposed in Figure 17 can be extended to multiple-rail implementation by stacking more gates and employing more MEs. As in Figure 17, in the multiplerail implementation of the device, the ack[k] signals are delayed versions of the related req[k] signal; Figure 20 shows a gate-level implementation of such a device, which correctly regenerates the phase relationship of the input signals (only the rising transitions are affected by this circuit). The go signal is generated using a CE (not shown). Figure 21 shows the transistor-level implementation of the same device, useful for both rising and falling transitions. In this case two matrices R1 and R2 respectively are used, generated by the MEs. The Figure only shows the structure necessary for a single wire (wire o1). The go signal is also generated using a CE over the set of wires. The circuit in Figure 21 shows the implementation for 4 wires in order to show how the system can be extended for several rails: as the transistor stack cannot be arbitrarily deep, the circuit can be extended “horizontally”, using CEs to act as keepers.

8. Notes on Area and Power Consumption The power consumption of the repeater circuits depends obviously on the type of design, not only from the point of view of magnitude but also of dependence of power consumption on the input time separation. In general, as the input time separation decreases and reaches the response time, the power consumption of the devices increase, as the device gets close to metastability. If the time separation is below the response time, the power consumption will be lower, as no arbitration is taking place; this is also the case when the input time separation is greater than δmax . Finally, throughout the capture range the power consumption is pretty much independent on the input δ. Table 1 shows the average energy per bit requirements obtained during simulation of the devices. The test-bench c

IEEE 2007

Response of TSE design (transistor level) 20 Output Time Separation/Latency (FO4)

Figure 18. Transistor-level implementation of dual-rail TSE

Fall Rise Latency (fall) Latency (rise) 15

10

5

0

0

2

4 6 Input Time Separation (FO4)

(b) Transistor-level design

Figure 19. Simulation results for dual-rail TSE consisted in a 100 MHz reference signal which was modulated onto a dual-rail phase-encoded link with progressively reducing time separations. The energy shown only refers to the part of the simulation where the devices do not exhibit metastability, as during metastability (or equivalent) the power consumption increases. All circuits were implemented to work in a non-return to zero (NRZ) fashion; the energy shown refers to an average of the energy requirements for rising and falling transitions. The table shows that the charge-compensation circuit is the best in terms of power consumption. However, the regeneration performance was not entirely satisfactory, as described in Section 6. Instead, the latch-based transistor level design shows good power consumption and very good performance, making it the best design choice for dual-rail applications. The multiple-rail case requires more accurate analysis, depending on the complexity of the design.

9. Conclusions An introduction to the problem of designing phasealignment circuitry within the context of phase encoding 9

receiver

sender

i1

Table 1. Comparison of Area and Energy/bit requirements of circuits

o1 i2 o2

i3

Design – Figure o3

go

Figure 20. 3-rail phase regeneration circuit

R1[1,3]

R2[1,4]

Area1

Energy/bit (pJ)

Latch-based – 2 a) 58 0.82 Latch-based – 2 b) 68 0.59 Modified ME – 5 88 1.17 Automatic Synthesis – 7 94 0.9 ME-based merging – 9 110 0.98 Latch-based transistor level – 11 28/322 0.43/0.472 3 Charge-compensation – 13 24 0.22 TSE dual rail gate-level – 17 74 0.78 TSE dual-rail transistor-level – 18 52 0.79 1 transistor count, 2 using keeper structure, 3 estimating capacitors size

R1[1,2] C o2

R2[1,2]

o3

R2[1,3] o4

o1

R1[1,4]

go

[2] L. Cotten. Maximum rate pipelined systems. In Proc. AFIPS Spring Joint Ccomputer Conference, 1969. [3] C. D’Alessandro, D. Shang, A. Bystrov, and A. Yakovlev. PSK Signalling on SoC Buses. In Proc. PATMOS 2005. Springer, 2005. [4] C. D’Alessandro, D. Shang, A. Bystrov, A. Yakovlev, and O. Maevsky. Multiple-Rail Phase-Encoding for NoC. In Proc. 12th ASYNC, pages 107–116, March 2006.

Figure 21. Multiple-rail TSE (transistor-level implementation) has been presented. A number of designs and a categorisation of these have been described and simulation results presented. The availability of different solutions with different dynamic behaviour allow the designer to identify the best design style according to the particular design requirements of the work at hand. Also, a novel approach is presented, the Transition Sequence Encoder, which is able to regenerate the order of a number of transitions; this device offer scope for a number of research topics. Silicon results, preceded by a more thorough analysis of the layout information, are part of the future work, together with a method to explore the design space in an interactive manner. Acknowledgments The authors would like to thank Prof. D. Kinniment for the helpful discussions. This work is supported by the EPSRC grant EP/C512812/1.

References [1] J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, and A. Yakovlev. Petrify: a tool for manipulating concurrent specifications and synthesis of asynchronous controllers. In XI Conference on Design of Integrated Circuits and Systems, Barcelona, November 1996.

c

IEEE 2007

[5] M. E. Dean, T. E. Williams, and D. L. Dill. Efficient selftiming with level-encoded 2-phase dual-rail (LEDR). In Proc. of the 1991 University of California/Santa Cruz conference on Advanced research in VLSI, pages 55–70, Cambridge, MA, USA, 1991. MIT Press. [6] R. Dobkin, R. Ginosar, and A. Kolodny. Fast asynchronous shift register for bit-serial communication. In Proc. 12th ASYNC, pages 117–126, March 2006. [7] M. Greenstreet and J. Ren. Surfing Interconnect. In Proc. 12th ASYNC, pages 98–106, March 2006. [8] K. Ho, R. Mai and M. Horowitz. Managing wire scaling: a circuit perspective. In Proc. IEEE Interconnect Technology Conference, pages 177–179, 2003. [9] A. Mokhov and A. Yakovlev. Transition Sequence Encoder. Technical Report NCL-EECE-MSD-TR-2006-117, Newcastle University (UK), 2006. [10] C. Molnar and I. Jones. Simple circuits that work for complicated reasons. In Proc. 6th ASYNC, volume 1, pages 138– 149. IEEE CS, April 2000. [11] D. Pamunuwa, L. Zheng, and H. Tenhunen. Maximizing throughput over parallel wire structures in the deep submicrometer regime. IEEE Trans. Very Large Scale Integr. Syst., 11(2):224–243, 2003. [12] D. C. Wong, G. De Micheli, and M. J. Flynn. Designing high-performance digital circuits using wave pipelining: Algorithms and practical experiences. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 12(1):25–46, jan 1993.

10