Building-blocks for Designing DI Circuits

1 downloads 0 Views 416KB Size Report
Nov 1, 1993 - (switch-level) implementations of many important circuit modules, ...... But, we have demonstrated a construction of a multi-way Sequencer and.
Building-blocks for Designing DI Circuits

Priyadarsan Patra Donald Fussell Department of Computer Sciences The University of Texas at Austin Austin, TX 78712-1188, USA Nov 1, 1993

abstract

We show universality and minimality of sets of primitives, to construct delay-insensitive (DI) networks, which provide exibility of their use in synthesis, and eciency in their silicon implementation. Several approaches that improve on recent designs, in the literature, of a modulo-N counter are explored. Rudiments of a general design methodology for such asynchronous circuits via a few case-studies is provided. We obtain low constant latency, low constant response, constant power consumption and asymptotically optimal area-complexity for the counter. For moderately large N, the area complexity tends to match any possible design under synchronous (clocked) discipline and improves over existing delay-insensitive implementations. Additionally, an elegant mathematical analysis of this design problem gives a family of solutions to choose from. We also demonstrate the powerful property of timing-independent composability of DI circuits to obtain a very areaecient modulo counter circuit. Finally, we nd ecient DI decompositions as well as transistor (switch-level) implementations of many important circuit modules, e.g., Join devices.

1 Introduction

The past few decades have seen periods of waxing and waning interest in the design of asynchronous circuits, which have no global clock synchronizing the operation of various portions of the circuit. In the absence of the temporal simplicity obtained from discretization of time by a global clock, it can be much more dicult to deal with the correct synchronization of events in the circuit. Designing correct asynchronous circuits thus requires a much more rigorous approach to timing issues than has been the case for synchronous systems. As a result, asynchronous designs have rarely been used in practice. However, interest in asynchronous circuits has been maintained in spite of these diculties because of the potential advantages of such circuits over synchronous circuits ([Sut89, BM91, Bru91b]). These advantages include the robustness of fully asynchronous circuits in the face of technological changes such as scaling of integrated circuits feature sizes and large variations in operating conditions such as temperature, higher power-eciency, higher speeds, and greater resilience to metastability. It is clear that a great deal of power can be saved by eliminating a global clock which causes transitions at every storage element in a circuit every time a clock transition occurs. However, without satisfactory theoretical or practical demonstrations of alternative asynchronous 1

1 INTRODUCTION

2

circuits, it remains debatable whether or not the overhead of asynchronous circuit designs counterbalances this potential power savings. In fact, several researchers in the area claim that asynchronous circuits consume higher energy on the average due to the fact that data paths that are designed to be delay-insensitive are almost always designed \dual-rail" [MBL+ 89, vB92]. This means double the number of wires and twice the transitions (hence, twice the energy consumption) to transmit data between adjacent modules, in comparison to synchronous circuits. Additionally, communication disciplines based on 4-phase handshaking only add to the energy burden of this type of asynchronous circuits because of the \reset" transitions. Similarly, it is clear that, since only those portions of an asynchronous circuit actually used in performing a particular operation contribute to the delay in completing that operation, the average speed required to complete a complex computation need not be limited to the speed of the slowest path in the circuit as with synchronous systems. However, once again, in the absence of e ective demonstrations to the contrary, it can be argued that the added circuit complexity of asynchronous systems will negate this potential advantage. Moreover, there is still much contention about the relative area eciency of synchronous and asynchronous circuits. In [Mar92] a \delay-insensitive" parallel adder design using a 4-phase protocol is described which is claimed to compete well with its synchronous cousins in area eciency. However, many practitioners and even researchers on asynchronous circuit design continue to believe that asynchronous circuits entail a prohibitively high area penalty when compared to synchronous circuits and that they are therefore impractical to use in most cases. Perhaps surprisingly, there is little existing work on systematic methods for optimizing the area, speed and energy requirements of asynchronous circuits. For instance, Keller[Kel74] gave one of the rst and best attempts to characterize the class of delay-insensitive(DI) circuits, and also provided a universal set of circuit primitives such that any circuit in this class is realizable as a delayinsensitive network of the primitives, i.e. a DI decomposition into primitives exists. However, no even moderately ecient implementation strategy or methodology was provided for accomplishing this. More recently, [Udd84] formalized classes of delay-insensitive circuits in Trace theory by de ning closure properties of Trace structures representing delay-insensitive(DI) computations, but did not deal with the synthesis of ecient circuits using these elements. [Dil89] and [JU91] have developed composition operators and algebras to model speed-independence and to verify equivalence of DI speci cations, respectively. [Ebe89] has developed grammars to specify DI circuits that induce a syntax-directed translation into a basic set of primitives. But, all these methods seem unsuitable for the simultaneous need for easy, ecient speci cation and decomposition of DI modules (while making automation of synthesis practicable). On the other hand, [Chu87, Men88, Lav92, Now93] have devised more practical synthesis techniques, yet they impose several restrictions on the speci cation unrelated to delay-insensitivity or speed-independence and provide very limited means for composing and decomposing DI modules. Such restrictions not only limit the potential utility of these methods in designing large asynchronous systems, but they also limit the eciency of the resulting systems in various ways which are not required for correct delay-insensitive operation. We believe that previous research leads to several conclusions regarding the requirements for maximizing the potential bene ts of asynchronous circuit design in practice. First, we believe that delay-insensitive (DI) circuits are the most desirable asynchronous circuits. They make the weakest assumptions about the timing relationships of elements in the system, and thus are most robust to conditions which may change timing relationships such as temperature variations. They also most e ectively eliminate the need to explicitly reason about time for designing circuits and thus make the problem of correct design simpler than for other types of asynchronous circuits. However, there is reason to be concerned that they require the greatest overhead of all asynchronous designs. One

1 INTRODUCTION

3

of our goals is to mitigate this latter concern by demonstrating that the overhead can be kept quite small. Second, we believe that the use of formal speci cation techniques for designing asynchronous circuits is imperative. Otherwise, it will remain prohibitively dicult to make asynchronous circuits work in practice. Finally, we believe that the most exible basis for DI circuit design should be used, since imposition of unnecessary restrictions on the design style for the purposes of simplifying the design process are likely to lead to ineciencies in the circuits designed. Thus, our approach is to de ne a small set of ecient basic circuit primitives and their manipulations to build a methodology for ecient, large DI circuits out of other DI components based on these primitives. This methodology we are developing is based on formal speci cation techniques, and thus facilitates the creation of simultaneously correct and ecient DI circuits, which helps realize the full potential of asynchronous circuits as practical alternatives to today's synchronous systems. In this paper we will take two basic steps toward this goal. First, we de ne a small set of basic DI circuit primitives which are ecient in the sense that their area and power requirements are small with no sacri ce in speed. We then show that this set of primitives is universal in that any possible DI circuit (belonging to a large class) can be constructed of such primitives. Finally, we provide case studies in the use of these primitives to design several DI circuits which are comparable in area with the best synchronous circuits and simultaneously better in terms of their speed and power requirements.

1.1 A short taxonomy of asynchronous circuits

Di erent assumptions about input timing and delays in components and wires lead to di erent classes of asynchronous circuits (networks of components and interconnects) : Fundamental Mode: These are the conventional asynchronous circuits, also called Hu man circuits. A circuit is stable if each component and wire in the circuits have consistent inputs and outputs. A circuit becomes unstable immediately upon receipt of an input. A circuit in this class has two associated time intervals t1 and t2 , t1  t2 such that an input is allowed only when either the circuit is stable or within t1 of its becoming unstable. The circuit is guaranteed to become stable if no new inputs arrive for t2 . There are further divisions based on whether a single or multiple input changes are allowed within t1 , whether pure/intertial, integer/real, and/or bounded/unbounded delays are assumed. Synthesis methods involve informal

owtable descriptions and hazard analysis. Usually the conventional gates (Or, And, Nand, Not, etc.) and speci c delay-elements are used in the implementation. Timing-simulation, assuming certain bounds on all delays in the circuit, is done to ensure con dence in the correct behavior. Input-Output Mode: Here, further inputs are allowed to occur once the outputs for the previous inputs have appeared. This permits a more realistic and ecient operation of the circuit, in contrast to fundamental-mode operation. The various subclasses based on di ering assumptions about delays are:

speed-independent These circuits pioneered by Muller make the overly assumption that there is no delay in interconnection wires, but could be arbitrary component delays. But, if a wire is constrained to connect exactly two components, the resulting subclass coincides with the class of delay-insensitive circuits de ned below. delay-insensitive (DI) The functional correctness of this class of circuits are independent of any ( nite, non-negative) delays in the constituent components and wires. This class

1 INTRODUCTION

4

is by far the most robust but most pessimistic in its assumptions about delay values. We nd this class most promising for control as well as for bit-serial data circuits. To see that the class of speed-independent circuits properly include the DI class, consider Figure 1 and its description below. self-timed This assumes that a circuit can be divided into \equi-potential" regions [Sei80] such that delay in any wire within an equi-potential is assumed to be negligible while delays in wires between two di erent equi-potential regions are assumed to be arbitrary. quasi-delay-insensitive This type of circuits allow arbitrary component and wire delays but may contain isochronic forks. A symmetric isochronic fork is such that the di erence in the delays between its fanout branches is negligible compared to the delay in the components it connects to ([MBL+ 89]). The fork is asymmetric if one branch is assumed to have smaller delay than the other ([Bru91a]). Bounded-delay I/O-mode circuits This class of circuits assume lower and upper bounds on the various internal delays, for correct and hazard-free behavior (e.g.,[Lav92]). Locally synchronous, globally asynchronous These circuits employ local clocks and closely follow synchronous design style for smaller modules that communicate asynchronously outside externally. Synchronizers are used at the module interfaces. Under liberal conditions, such designs may be more prone to metastability problem than totally synchronous systems ocassionally receiving external (asynchronous) inputs. A

C

D

B

Figure 1: Speed-independence VS Delay-insensitivity Suppose there are no wire delays (as assumed for speed-independent circuits), but gates have arbitrary, bounded (inertial) delays. Consider the global stable state (A=1, B=0, C=1, D=1) for the gate circuit. Suppose the only circuit input A transits to the binary value 0 making gates B and C unstable concurrently. The gates B and C are said to be in a race. The outcome of the race (i.e. which one completes rst) will determine the nal state of the circuit as we shall see. If the delay in gate B is suciently large, gate C \wins" and transits to 0 which in turn makes gate D produce a 0. After gate B reacts, C eventually outputs a 1. At this point, D is already in the stable state 0 and the circuit is said to have settled at the state (0, 1, 1, 0). Note that if B wins the race, output of D will remain at 1 and the nal (stable) state of the circuit will be (0, 1, 1, 1). This illustrates that the circuit is not speed-independent with respect to the input transition 1 to 0 in initial state (1, 0, 1, 1). Next, assume the initial stable state to be (0, 1, 1, 1) and the input transits to a 1. The reader will see that the only nal (stable) state reachable is (1, 0, 1, 1) no matter what the delays in the gates are and hence the circuit is speed-independent with respect to this transition. However, suppose the wires also have arbitrary delays, in particular, the wire from input terminal A to gate C has suciently larger delay than the combined delay of the inverter and wire from A to gate C. In this case, it's possible for gate D to detect the transient transition of gate C to 0 and consequently gate D could stabilize to the value 0, thereby the circuit ending up in the stable state (1, 0, 1, 0). This points out that the wire delays could change the functional behavior of a circuit and hence pure delay-insensitivity is a stronger concept than speed-independence.

2 A FORMAL MODEL AND A LANGUAGE OF SPECIFICATION

5

Actual hardware gates have di ering levels of threshold for activation. Therefore a voltage signal that may be viewed as a logical 1 by one gate, may not yet be a 1 to another. This necessitates the assumption of `instantaneous output transition' in speed-independence theory. Because otherwise, two gates reading two di erent branches of a fork might see two inconsistent values at the same time making the assumptions of speed-independence theory invalid.

2 A formal model and a language of speci cation The rst step in functional design of a circuit is to obtain, from the problem requirements, a formal speci cation. A speci cation is the set of all admissible and required interface behaviors of a system. A behavior is an interleaving of input and output events. The speci cation models a circuit module by help of a theory: a formal language (or syntax) and a formal interpretation (or semantics). We chose Trace Theory ([Udd84, vdS85, Hoa85]) for exposition of our methods and for describing our primitive modules (gates). 1 In trace theory, symbols represent events and traces represent behaviors. Lower-case letters (subscripted or not) serve as symbolic names for communication events at similarly named communication ports. An event is an occurrence of the corresponding action. In circuit physics, signal (voltage) transitions are arguably the simplest and the most natural events of communication. There is a one-to-one mapping between actions and ports of a module. Sets of input and output actions { equivalently, their symbol sets { in a speci cation are implicit and disjoint: we append `?' or `!' to a symbol-name, to denote that the name stands for an input or output action, respectively. We may leave those suxes out for internal signals or when no confusion may arise. Symbol names can also be viewed as atomic trace-structures representing simple speci cations. A more complex speci cation is built recursively from the primitive speci cations by applying following operations: `pref' is pre x-closure, `;' is sequential 2 composition, `k' is parallel composition, `j' is non-deterministic choice, and `?' is the Kleene-closure or repetition. A symbol s raised to a (natural) N indicates sequential composition of N such symbols. The 1-place pref operation, when applied to a set of behaviors, represents all pre xes of those behaviors. All module behaviors are pre x-closed { this formally legitimizes the common-sense observation that all partial interface behaviors (i.e. communications) that lead to an admissible behavior are themselves admissible. The sequential composition of u and v , i.e. uv , denotes that behavior v necessarily follows behavior u. The 1-place `?' operation denotes all nite concatenations (i.e. sequential compositions) of the behaviors in its argument set. The 2-place operation `k' is more complex and denotes concurrency 3 between the two argument sets. The parallel composition of two argument sets of behaviors is the set of all behaviors satisfying the following: 1) It has only those symbols that appear in the implicit input or output sets of either argument. 2) When it is restricted (or projected) to the set of implicit input and output symbols of either argument behavior set, the result must be a behavior in that argument set. 1 Various other languages based on Temporal Logics exist that capture safety, progress as well as fairness properties of general systems. Later we will brie y introduce a less-intuitive, but more concise, algebra based on a variant of Trace Theory, called Receptive Process Theory [JU91], to verify a counter circuit's implementation. 2

Very often we will use mere juxtaposition to denote sequential composition in order to avoid clutter. For a good introduction to trace theory for specifying circuits, see [Ebe91]. 3 Concurrency is used to capture the e ect of arbitrary delays allowed in a DI model of communication wires carrying possibly simultaneous events. Therefore, all causally unrelated events are concurrent, and there is no notion of simultaneity unlike in many synchronous systems.

2 A FORMAL MODEL AND A LANGUAGE OF SPECIFICATION

6

The 2-place `j' operation denotes the union of the two argument sets of behaviors: e.g., S = a(b j c), where all the symbols are either inputs or outputs, speci es that after a, either b or c, but not both, is allowed; thus, the complete set of valid traces is fa b, a cg. 4

2.1 Primitive modules and notation We need a repertoire of basic devices or primitives to build general circuits and our choice is given

in Table 1. Each primitive's speci cation is shown next to it in the form of a speci cation. A m  n-Join is operationally described as follows: It has m row inputs, n column inputs, and a matrix of m  n outputs| one for each pair of row & column inputs. The device and its environment repeat the following behavior: The device waits to receive exactly one row-input and exactly one column-input; upon receiving the two inputs it makes a transition on the output corresponding to the input pair.

Example:

We illustrate trace-theory notation using, as an example, a 1  2-Join, whose set of traces is given by: pref(((a?kb0?) c0!) j ((a?kb1?)c1!))?. The symbols with the sux `?' represent transition events at the three input ports of this module, namely, a, b0, and b1. Similarly, output symbols c0 and c1 represent two output actions (and their occurrences). Hence, the input set is h a, b0, b1i and the output set is h c0, c1i. Examples of valid partial behaviors are: a, a b0, a b0 c0, a b0 c0 b0, b0 a c0, b1 a c1 a b1 c1, a b1 c1 b0 a c0. Some invalid traces are: (1) a b0 b1, (2) a a, (3) b0 c0, (4) a b0 c1. The rst two traces represent errors in the environment, while the last two in the module (refer to the operational description of a Join): (1) The environment cannot send both column inputs b1 and b0 without an intermediate output from the Join. (2) a cannot immediately follow itself for the same reason as above. (3) output c0 is produced too early. (4) c1 is the wrong output, c0 should be produced in stead. Note that any extension of an invalid trace by concatenating symbols on the right is also an invalid trace. A Fork, represented by branched lines, accepts one input and then produce two outputs, before repeating self. A Merge can accept exactly one input which is followed by an output transition, before it repeats itself. A Toggle device distributes an input transition between the two outputs alternately.

2.1.1 Additional conventions

A bubble at an input terminal of a device implies an initialization that corresponds to a state where a transition at that terminal is assumed to have been received initially. For instance, the behavior of a C-element with a bubble at the a-input is a device with the behavior of a 1  1-Join after receiving transition on a rst, i.e. pref (b? c!((a?kb?)c!))? . A heavy dot near a terminal of a device denotes an initialization where only the thus indicated terminal may produce the rst output of the device. We occasionally use a circle labeled with `P' as a short-hand for a tree of Merges, in our gures. 4 This

trace-structure is not delay-insensitive [Udd84], because it does not contain the trace ba, although ab is a valid trace, and both the symbols are inputs or outputs (hence, not causally related). Pre x-closure, a requirement for DI trace-structures, of S is f; a; a b; a cg.

2 A FORMAL MODEL AND A LANGUAGE OF SPECIFICATION

Name Fork

Schematic a? a? b?

Merge Toggle

d! c!

a?

Speci cation pref (a?( b!k c!))? pref ((a? jb?) c!)?

c!

b0?

2  1-Join

b! c!

pref (a? c! a? d!)?

b1?

pref(((a?kb0?) c0!)

a?

j((a?kb1?)c1!))?

c0!

c1!

b0? b1? p!

Ljoin

q!

a0? a1?

pref(((a0?kb0?) p!)j((a0?kb1?)q!) j((a1?kb0?)r!))?

r! r!

Tria

p! c?

p!

q!

a? b? r!

Sequencer

j((b?kc?)r!))?

q! a?

2  2-Join

pref(((a?kb?) p!)j((a?kc?)q!)

c?

b?

d?

r1?

g1!

r0?

g0!

d1! r1? r0? d0!

2-way resource arbiter

j((b?kc?)r!)j((b?kd?)s!))?

s!

c?

Resource Arbiter

pref(((a?kc?) p!)j((a?kd?)q!)

r! d?

pref((r0?g0!)?

k(r1?g1!)? k((g0!jg1!)c?)?)

pref((r0? a d0!)? k (r1? b d1!)? k((a j b) r! g?)? k((a d0!) j (b d1!))?)

Table 1: A (Non-minimal) Set of DI Primitives

7

3 CHARACTERIZING DELAY-INSENSITIVITY

8

Response time of a circuit module, de ned by function T (), is the longest time the module's environment must wait to receive the response signal(s) after supplying a set of necessary input signals to the circuit. A() computes area of a module's implementation in an arbitrary, but xed unit. Similarly P () gives `power consumption' of a circuit.

3 Characterizing delay-insensitivity De nition 1: Following [Kel74], let N = (Q; ; ; f; g) be a deterministic nite-state sequential machine(DFSM), where 1) Q is a nite set of internal states, 2) q0 2 Q is the initial state, 3)  is a nite input alphabet, 4)  is a nite output alphabet, 5)f : Q   ! Q is a partial state-transition function 6) g : Q   !  is a partial output function. For an input and a state, if either the next-state or output is unde ned, the input is considered unsafe. De nition 2: A module is a 6-tuple M = (I; O; N; Imap; Omap; A), where 1) I is a set of

input ports, 2) O is a set of output ports, 3) N is a DFSM, 4) Imap is a one-to-one function from  to I , 5) Omap is a one-to-one function from  to 2O , 6) A : Q ! 22 is a function specifying all the combinations of input symbols that may occur concurrently for each given state such that any order for assimilation of concurrent inputs is safe. Keller identi es 10 conditions to characterize the class of delay-insensitive (DI) modules that can be realized as a network of other modules. We call it Keller's class.

3.1 A minimal, universal set of primitives

We show that any module in Keller's class is realizable as a DI composition/network of Fork, Merge, Sequencer, and Ljoin (or Tria) as primitive modules. We believe that not only do these primitives have more ecient implementations under the robust transition signalling convention, but also they facilitate more ecient decompositions of a very large class of DI modules. Furthermore, nding such sets has deeper signi cance in parallel language design for asynchronous communicating processes and data ow networks in regard to the necessity and suciency of language operators for classes of computations. To show universality, we exhibit DI realizations of modules in Keller's universal sets, namely, Fork, Merge, Select, ATS and AC as networks of our primitives. (See the appendix for a direct proof). First, we show some needed constructions before we give the necessary decompositions. Figure 2 shows a realization of 2  2-Join as a network of Forks and Ljoins on the left while on the right, Merges and Trias replace Ljoins. Then, observe that by hiding appropriate inputs and outputs of a 2  2-Join, we can obtain 1  1-Join (Muller's C-element, shown as a circle around `C' ) and 2  1-Join. Next, we show that an arbitrary M  N -Join can be decomposed, with asymptotic optimality, into a set of Forks, Merges, 2  2-Joins, 2  1-Joins and 1  1-Joins.

3.1.1 Decomposing larger Joins

Figure 3 illustrates a decomposition method that Josephs [Jos] has developed. The idea is to determine to which `quadrant' of the M  N matrix the output, corresponding to a pair (row, column) of inputs, belongs. Four parity trees, as in Fig 3, feed a 2  2-Join which in turn steers the inputs of the M  N -Join to the appropriate quadrant. There is one more intermediate step: the p  2-Join, where p is bN=2c, bM=2c, dN=2e, or dM=2e, works like a `switch-box' switching its inputs to one of two quadrants. Each quadrant is a smaller Join that decomposes recursively just

3 CHARACTERIZING DELAY-INSENSITIVITY

9 c?

c?

q! P

P

q!

s!

a? s! a?

b?

p! b? r!

p! r!

d?

P

P

d?

Figure 2: Decomposition of 2  2-Join into Ljoins and into Trias as the bigger one does. The recursion `terminates' when the Join to be decomposed is a constant Join, namely, a 1  1, 1  2, or 2  2. Temporally, when an output from a M  N -Join occurs all the constituent constant Joins would have become `quiescent' and ready to accept another pair of inputs according to their speci cation. Note that when M or N is not a power of 2, we will have a

situation at a level of decomposition where one needs to determine one out of two halves, because two horizontal or vertical quadrants are non-existent. In such a case, a 2  1-Join replaces the 2  2. The response time of a p  2-Join can be seen as O (log p) and we have log(MmaxN ) sequential levels of signal ow before an output is produced. At each level i, a (N=2i)  2-Join responds in (lgN ? i) units of time, making the time complexity of this decomposition equal to

((log(MmaxN ))2). We assume that the response-time of each constant-sized device is a constant. 5

Our decomposition method indicated in Figure 4 composes a row and a column parity-tree (balanced, for eciency) with a M  N -Tjoin device. Each Tjoin device is then recursively decomposed as shown, until it terminates as in the previous method. A Tjoin is like a Join except that it receives all the internal signals{ Cl; Cll; Rul; Rlu, etc. { of the two parity trees for row and column inputs. This helps in retaining the `information' that a parity-tree output loses which is used in selecting appropriate quadrants in the matrix. Furthermore, in the decomposition of a Tjoin, the parallel to the switch-box in Fig 3 is a simpler module, called Tree-Mux. The Tree-Mux is decomposed into 2  2-Joins as shown. Intuitively, you may think of the transitions in a Tjoin to be progressing like a wave. In Fig 4, we have shown a shaded region to indicate the Joins that are computing if inputs of the M  N -Join are in the lower ranges of the numbering scheme shown. By the time the center Join has its row & column inputs and then computes its output, the Joins on its logical left and below should have their column and row inputs, respectively. The output from the center Join is forked to the other two 2  1-Joins which thereafter compute concurrently, and their outputs go on to activate the bN=2cbM=2c-Tjoin. This parallelism between the Tree-Mux and a Tjoin quadrant's computations helps reduce the response-time complexity to (log(MmaxN )), which is optimal. Moreover, since neither a TreeMux nor a TJoin uses Merges in their respective decompositions, our method improves on the area complexity of the decomposition presented earlier by a constant. The power consumption P (N  N -Join) is proportional to the number of internal and external 5 We assume

routing and layout are done properly to minimize dependence, on M and N , of interconnect lengths between logically adjacent components, so we can realistically discount the e ects of interconnects.

3 CHARACTERIZING DELAY-INSENSITIVITY

10 R M/2

MxN Join

RM-1

Parity Tree

Ru

Parity Tree

C N/2 M/2 x 2

M/2 x N/2

Join Cl

Cl N/2 x 2

M/2 x N/2

Join

Join

CN-1

Cu Ru N/2 x 2

Join

Join

Rl

C0 C N/2 -1

Cu M/2 x N/2

M/2 x 2

M/2 x N/2

Join

Join

Join

Parity Tree

Rl Parity Tree

R0

R M/2

-1

Figure 3: A decomposition of M  N -Join module signal transitions needed for producing an output and is, therefore, O ((log N )2). We conjecture that this is also optimal.

3.1.2 Universality

Now, we are ready to show the decompositions of Keller's SELECT and ATS modules as in Figure 5. The decompositions simulate the Finite State Machines associated with the Select and ATS modules directly. Very often the number of I/O ports is proportional to the complexity of interaction between a module and its environment and hence, inversely proportional to ease of use. For proofs of correctness concerning such practical issues as testability, it is far easier to deal with a smaller set of primitives. De nition 3: I/O-modularity of a set of modules is the maximum number of I=O lines on any module in the set.

3 CHARACTERIZING DELAY-INSENSITIVITY

11 CN-1

C0 BalancedColumnParityTree

Clll Cllu Cll Clu Cl

Cu Cuu Cul R0

M/2 x N/2

M/2 x N/2

Tjoin Cluu Clul

Cllu Clll

Clu Cll

Cl

Rul Ruu Ru

Tjoin

Col Tree Mux

Cu Ru Row Tree Mux

C’ul C’uu C’ll C’ lu

Rl

C’l C’ u R’u R’l

M/2 x N/2

Tjoin

Balanced Row Parity Tree

R’lu R’ll R’uu R’ul

Rl

Rlu Rll Rllu

M/2 x N/2

Rlu Rll

Tjoin RM-1

Rlll Rluu Rlul

MxN Tjoin

Figure 4: Optimal decomposition of M  N -Join module

De nition 4: A module is called serial if its speci cation prohibits concurrent inputs or outputs, i.e. each output event has exactly one unique input event that causes it. A module is called parallel if its speci cation allows concurrent inputs and concurrent outputs. It is important to be able to design circuits that do not have signals racing continuously for sake of energy-eciency and for speed. For example, a Sequencer can be implemented from two ATS modules as in Figure 6, but a signal races around continuously after c? is received until an arbitration is made. De nition 5: A module's decomposition is without busy-waiting if the internal wires and component modules have no transitions imminent, when no inputs are supplied to the module for a suciently long time. A set of modules is wbw-universal for a class if any module in the class is decomposable without busy-waiting. We show in Figure 7(a) decomposition of a 2-way resource arbiter and in Figure 7(b) we show how to build larger resource arbiters recursively, and nally, in Figure 7(c) a recursive composition to construct a m+n-way Sequencer, without busy-waiting, is depicted. For sake of completeness, we observe that an n-input Merge can be realized from 2-input Merges as a `parity tree', which we

3 CHARACTERIZING DELAY-INSENSITIVITY

12

SELECT module

ATS module

P t0

P

P

r t

t1

t s r

P

s’

P

r’

P t1 t0

Figure 5: Decomposing SELECT and ATS modules r0?

c?

P

r1?

g0!

r

r

ATS

ATS

t0

t t1

g1!

t0

t t1

Figure 6: Busy-waiting Sequencer often pictorially denote as a circle with a `P' on it. Theorem 1: The set fFork, Merge, Ljoin, Sequencerg is of I/O-modularity 7, cardinality 4, and is wbw-universal for Keller's class of DI modules. 2 Theorem 2: The set fFork, Merge, Tria, Sequencerg is of I/O-modularity 6, cardinality 4, and is wbw-universal for Keller's class of DI modules. 2 [Kel74] shows that any parallel module M may be realized by serializing the possibly concurrent inputs, and then feeding to an appropriate serial module M', the details of which are not important for the following discussion and for which the reader is directed to that paper. To avoid busywaiting in the serializer, Keller used a Arbitrating Call (AC) module which is exactly what we call a 2-way resource arbiter. But, we have demonstrated a construction of a multi-way Sequencer and way resource arbiter in Figure 7 using a 2  1-Join. Moreover, note that a Sequencer implements a 2  1-Join directly, if we interpret the column input of 2  1-Join as the c? of the Sequencer. 6 Hence, Theorem 3: The set fFork, Merge, Select, Sequencerg is wbw-universal for Keller's class of DI modules. 2 Keller exhibits a decomposition of Select into the set fK; G; Mergeg, where the new modules K and G, which we do not de ne here, are of I/O-modularity 5. Therefore, Corollary 1: All modules of Keller's class can be decomposed into primitive modules with no more than 5 I/O ports, i.e. fFork; Merge; Sequencer; K; Gg is wbw-universal. To the extent that low I/O-modularity and low cardinality of a universal set signify simplicity, we have partially answered some pertinent philosophical questions raised in [Kel74], by discovering wbw-universal sets which have a lower I/O-modularity in comparision to Keller's wbw-universal sets with same cardinality. 6 This points out that an implementation could be `better,' by being able to `tolerate' a worse environment, than a speci cation requires. But, bear in mind that the constructions in this section are not necessarily what we will use in practice. A methodology for eciently decomposing a module into primitives with ecient transistor-implementations is currently being researched.

3 CHARACTERIZING DELAY-INSENSITIVITY r0 g

0

d0

rm-1 gm-1

r!

r0 r1

r0 d0

d?

d1 (a)

r0 d0

rm+n-1 gm+n-1

rn-1 dn-1

13

d? m-way resource arbiter

r!

r!

r0 g

0

rm-1 gm-1

2-way resource arbiter

r0 d0

r! n-way resource arbiter

d?

r0 d0

d?

rm+n-1 gm+n-1

(b)

rn-1 dn-1

d? m-way resource arbiter

r! r! n-way resource arbiter

d? c? (c)

Figure 7: Realizing larger arbiters and sequencers Finally we observe that a 2  2-Join can be decomposed, albeit ineciently, into many copies of Select, ATS, Merge and Fork, and a Tria into three Ljoins and Forks. Theorem 4: All the wbw-universal sets { both ours and Keller's { are equivalent in that they are realizable by each other. 2

3.1.3 Minimality De nition 6: A set of modules is minimal if none of them can be realized by the rest. In other

words, a set S is minimal if no proper subset of S is universal with respect to S . De nition 7: An event e causally precedes event f if in all valid behaviors f occurs after e. e immediately precedes f if e precedes f and if there exist no event g, such that g precedes f and e precedes g . A predecessor event set of an event f is the unique set of all events that immediately precede f . An action a precedes an action p, if an occurence of a precedes an occurence of p. We similarly de ne a predecessor action set (PAS) of an action a { but, this set is no more unique because di erent occurences (events) of the action a may have di erent predecessors. An action p has a determining predecessor action (DPA) a if a is in each PAS of p, and a precedes only p. We will discuss minimality of the universal set fFork, Merge, Sequencer, Triag only. None other than Fork above produces more signals than it assimilates, hence, Fork is non-redundant. When two di erent inputs require a module to generate the same output, some signals need be `merged' to get a single output signal; Merge is the only module to produce the same output for two di erent inputs, hence, Merge is non-redundant. The Sequencer clearly is the only component for arbitration or mutual-exclusion. Lemma 1: Each of the outputs of each of the rst three modules above has a determining predecessor action. Therefore, a Tria, which is an acceptable module with DI behavior and whose outputs have no DPA, cannot be realized by any composition of the rst three kinds of modules. A Tria is also non-redundant: Fork and Merge are combinational elements; the Sequencer is not `sequential' either, in the sense that it cannot be composed with other Sequencers, Forks and Merges to store a state and implement a sequential function. A Tria is the only truly state-holding module and hence, not redundant | Figure 2 shows how to construct a 2  2-Join using Trias and hence, any deterministic sequential machines.

4 CASE STUDIES

14 Ack!

Req?

a0?

p0!

a1?

p1!

Figure 8: A 1bit-wide 4-place elastic fo It is strongly believed that no serial-module can be realized by primitive modules of I/Omodularity less than 5. Moreover, a serial-module such as a Select is conjectured to be not realizable by Fork, Merge, Sequencer and any other primitive module with a maximum of 5 input and output ports. Note that a Sequencer can simulate a 1  2-Join. This leads us to believe that our set of primitives is minimal in a strong sense. In the following, we show the power and exibility of these primitives in achieving eciency along multiple axes.

4 Case Studies

4.1 Elastic bu er | an example The consumer-producer paradigm in its simplest and most used form requires a linear bu er (a.k.a queue or fo) between two computational blocks. We highlight some of our design philosophy by designing a queue for illustration. A shift-register of the synchronous design discipline

corresponds to an elastic queue in the `asynchronous' world. Sutherland's Micropipelines use such elastic queues to increase throughput of `pipelined' systems. All three design aspects, namely, speci cation, implementation and veri cation, of this classic asynchronous circuit have been widely studied under various models and methods. Most of the studies have assumed a dichotomy of \data" and \control" parts of the queue. Some ([DNS92]) have only designed the control part due to certain limitations and specializations of their methodology, while other methods have generated much more complex circuits [Haz92](page 41). Here as well as elsewhere we have observed that it is often convenient and appropriate for asynchronous designs to relax the rst and not to have the second of the following two traditional `separations' in conventional design approaches:  separation of control and data  separation of storage and combinational logic We also think that purely delay-insensitive circuits are well-suited to bit-serial datapaths ([JP, Pat] in comparison to parallel datapaths which require extensive implicit synchronization between `bits', thereby \running against the grain" of fully asynchronous methodology . Below is a simple, fairly ecient asynchronous implementation of a one-bit wide four-place elastic fo (Figure 8) without these dichotomies and it illustrates the power of a pair of circuit primitives described above. An operational description: Initially the queue is empty and the consumer is ready to receive data. The producer `loads' data from the left and the consumer removes data from the right, bit by bit. Depending on the viewpoint, the queue holds at most 4 or 5 bits of serial data. Each stage { represented by the dotted subcircuit { `latches' its input i new data has arrived from left and the right neighbor has indicated readiness for new input (by acknowledging the previous input). At the FIFO-level, input from the producer is acknowledged by a Ack! transition, while a Req? transition

5 DESIGN AND ANALYSIS OF THE COUNTER

15

is interpreted as a request from the consumer for new data. Arrival of new data is signalled by a transition a0? or a1? | various dual-rail encodings for transmitting 1-bit data exist. Merge devices serve to generate the acknowledge signals.

4.2 A mod-N counter

Subcircuits in a circuit need to transfer information among themselves and this often requires some form of synchronization. If needed, data as well as control information may be transferred within a synchronization event. For conventional clocked circuits, a `global' clock serves to synchronize, while the asynchronous communication requires some form of hand-shake for synchronization among the communicating modules. Therefore, it is sometimes convenient, though not necessary, to think of circuits as communicating along hand-shake channels [vB92] as a way of `modularizing' the interface, i.e. a logical communication link is encapsulated by de ning hardware components that control/realize it. But, we will not insist on \channels" for internal communication to avoid ineciency. Accordingly, we will rst attempt to specify our mod-N counter as having two channels a and b and which behaves thus: It repeats the following behavior inde nitely: it participates in N sequential synchronizations along channel a followed by one along b. An application of such a circuit is where a task needs to be done once every N executions of some other task or every N clock `ticks.' When the tasks are simple synchronizations, the circuit is called a modulo counter. Separating a and b into two di erent hand-shake channels makes it possible for two physically independent modules to interact through the `services' of the counter, provided via the two channels. Such separation of control leads to very ecient, structured and modular `pipelined' designs in our methodology. Hence, our rst trace-theoretic behavioral speci cation of the counter is S 0 = pref ((aN ; b)?).

5 Design and analysis of the counter

5.1 Re nement based on choice of hand-shake protocol

Under the requirement of \delay-insensitivity" a transition on a port must be acknowledged, before another can follow, to avoid transmission interference [Udd84]. This makes a protocol for `handshake' necessary for synchronization. Each input need not be individually acknowledged, but in our case we have two independent interfaces and hence, two independent handshake protocols. Usually two physical wires, logically directed opposite to each other, are used to implement such a channel. As a matter of our convention, a channel-name subscripted with a 0 is the name for the Wire that undergoes the rst transition in a hand-shake cycle (i.e. the REQUEST line), the other Wire (the ACK line), which undergoes the second transition, is subscripted by 1. We choose a 2-phase protocol for its simplicity and economy { compared to a 4-phase protocol which is intuitively likely to consume twice as much power and to take twice as much time because of `reset' transitions { for our rst implementation. Later we will demonstrate ample evidence to justify this `suspicion'. In 2-phase handshaking, a request signal from one communication partner is followed by an acknowledge signal from the other to complete a synchronization/handshake on a channel. Such a communication involves no transfer of data, but just serves to synchronize the partners. Having decided on a hand-shake protocol, we have broken down each asynchronous communication (or, synchronization for our mod-N counter) into smaller atomic events (signal transitions

5 DESIGN AND ANALYSIS OF THE COUNTER

16

a?

P a!

b?

b!

Figure 9: Mod-N counter implemented by a N  1-Join and N -input Parity-tree at input or output ports). The logical next step is to decide: (1) What, circuit or environment, initiates a communication ? (2) When is a synchronization attempt deemed `committed' from circuit's standpoint ? Our answer to question (1) is \it can be arbitrarily `agreed upon' between the circuit's implementor and circuit's environment." As a rule of thumb, we suggest that the circuit (environment) should initiate a synchronization if the corresponding channel is an input (output) of the circuit. The answer to question (2) is \when the circuit receives a handshake signal from the environment." That is, the circuit is assured of a synchronization only upon receiving an input signal in accordance with an agreed-upon protocol as in answer (1). Hence, assuming a 2-phase hand-shake a straightforward re nement of S 0 at the level of transitions on connection wires is: S 1 = pref ((a0? a1 !)N b0 ! b1 ?)? where, without loss of generality, we assume that the environment sends requests on channel a while the counter is the `requester' for b. But, S 1 is not a delay-insensitive speci cation as may be checked against Udding's conditions of delay-insensitivity [Udd84], because the two input signals b1? and a0? are allowed only in one sequential order. (Informally, because of possible arbitrary delays in the interface it cannot be guaranteed that b1? is received before a0 ? if they are causally unrelated.) Considering our answer to question (2) above, and noting the need to preserve the original synchronization semantics of the counter's speci cation, we further re ne to get: S 2 = pref ((a0? a1!)N ?1 a0 ? b0! b1 ? a1 !)? where a suitable order between b1? and a0 ? is enforced by \enveloping" a handshake on b by the N th handshake on a. At this point, we have precisely stated the DI interface behavior of the circuit and its environment. Next we consider further re nements to derive an implementation of \good performance" characteristics.

5.2 Adhoc Design 1

It is easy to employ a N  1-Join device, which can be eciently decomposed into 2  1-Joins and Merges, both O (N), to simulate a `sequential state machine' suggested by S 2 (see Figure 9) But, the area and response-time costs for such a monolithic circuit are O (N ) and O (log N ), respectively. From the theory of algorithm design, we know that recursion-induction serves as a powerful tool to solve/analyze a problem indexed by an arbitrary integer. We will apply the same tool in order to decompose this family of state-machines (Mod-N counter) into time-, area-, switching energy-wise economical circuits. In the following, we will recursively apply the well-known technique of Divide & Conquer to build `bigger' counters from `smaller' ones. First, we attempt to design a \glue-cell" that can control two sub-counters, say, of size R and S . In one such arrangement (Fig 10(a)) the glue-cell distributes a? inputs rst to the Mod-R counter. After the mod-R counter nishes counting up to R { indicated by its request signal on m

5 DESIGN AND ANALYSIS OF THE COUNTER a?

b?

l!

p?

Mod-R l? Counter m? m!

p!

a? l? l!

Mod-S Counter

q! q?

m? a!

(a) b! a!

Mod-R Counter

m!

N = R+S+1

b?

17

b!

(b)

b?

N = 2R+1

a? l? l! Mod-R Counter m!

a?

b!

a?

b?

a!

a! b?

Mod-1 Counter Mod-2 Counter

m? a!

b!

(c)

N = 2R+2

b!

(d)

Figure 10: 1st decomposition of counter using `Divide & Conquer' { the glue activates the mod-S counter for receiving the subsequent inputs. Note that the mod-S counter has an unusual initial state, i.e. its speci cation is : pref ( q 1 ? p1 !(p0? p1 !)S ?1 p0 ? q 0 !)? When mod-S counter nishes, the mod-R counter is activated, and the behavior repeats. Notice that an input a? is acknowledged as soon as the glue-cell distributes it to a subcounter | which is motivated by our desire to minimize response time. When the mod-S counter nishes, the glue-cell is ready to produce a b! output if and only if it also receives an a?. Acknowledgement b? from the environment prompts another a! output through the 3-input Merge gate, as per speci cation S 2. This makes the network of subcounters and the glue-cell a mod-R + S +1 counter. How much area does this design take ? The area complexity is given by A(N) = A(3  1-Join) + A(3-input Merge) + A(connection wires) + A(mod-S counter) + A(mod-R counter), which is O(N ) as the rst 3 terms on right hand side can be assumed constant. How we try to improve this is by `time-multiplexing' one subcounter. That is, instead of a separate mod-S subcounter, we will reuse the mod-R subcounter. But we have a problem because the glue-cell treated the request signals on m and q di erently. But this can be taken care of by using a Toggle to di erentiate, as shown in Figure 10(b). The 3  1-Join reduces to a 2  1-Join by virtue of multiplexing. The counter in Figure 10(b) counts modulo 2R +1. This design which is for odd N can be, fortunately, modi ed easily to get one for even N { see Figure 10(c) { by making both outputs of the Toggle synchronize (and hence `count') with a? : the Toggle is `pushed through the 2  1-Join'. Since we are describing the design recursively, we provide the base-case designs of counters for N = 2 and N = 1 in Figure 10(d). We call the glue (all the logic excluding the subcounter) a counter-element.

5.2.1 Area complexity Area complexity of both the even and odd counter designs is given by

A(N) = A(Toggle) + A(2  1-Join) + 2A(Merge) + A(bN=2c), where the rst three terms on right of the equality are constants. Therefore, the area complexity is O (log N ).

5 DESIGN AND ANALYSIS OF THE COUNTER

18

5.2.2 Worst-case response-time calculation We will ignore wire delays in our (worst-case) response-time calculations as wire-delays will

depend on the placement & routing for the various primitives. And importantly, our designs are linear arrays and connections exist between neighboring elements and hence, each wire segment can be assumed to be a small constant. Moreover, there are no timing requirements on wire lengths (as it is a DI decomposition). Therefore, we do not consider interconnect length as `variables' in our circuit-cost calculations. Below, we calculate the upper bound on the mod-2R+1 counter's response time. Let T (mod-R), T (2  1-Join) and T (env) denote the response-times of the mod-R subcounter, the 2  1-Join, and the environment, respectively. Response-time of the counter is the maximum of the logic delay between the arrival of input on a0 and generation of the corresponding response signal a1! or b0!, and the delay between reception of b1 ? and resultant output of a1 !. Note that: 1) We don't show the subscripts to indicate a request/ack wire in the gures, to avoid clutter. 2) The 2  1-Join in the glue synchronizes external input a with inputs l1 or m0 from the subcounter.



T (Mod-2R +1 Counter)

/* for R > 0 */

MAX (T (Merge),

/* delay from b1? to the subsequent a1! */

T (Merge) + T (2  1-Join) + /* delay between a? and the subsequent a! */ +MAX ( T (mod-R)+T (Toggle)+T (Merge)+ T (m?-to-l!) ? (T (Merge) + T (env)), 0))

/* subcounter and Toggle could only add to response time*/

In the above, T (m?-to-l!) is the time between when the mod-R counter receives m? and then responds with l!; it is equal to T (Merge). But, note that response-time of a mod-R subcounter is at least T (mod-R). Therefore, the response-time of the subcounters is bounded by the constant 2T (Merge) + T (2  1-Join) + T (Toggle). But, this implies that the overall counter mod-N counter is bounded by the constant T (Merge)+T (2  1-Join)+(T (mod-R)+ T (Toggle)+T (Merge)) | assuming the counter's environment has no delay. If the environment has a positive delay then the circuit's response-time will look even better, by de nition. A similar conclusion can be drawn for a mod-2R + 2 counter.

5.2.3 Power Consumption

The energy consumption per external communication of a circuit is de ned as the average number of input and output signal transitions within the circuit per communication with the environment. This de nition has a bene cial characteristic in that the energy eciency of a circuit is necessarily tied down to its interface speci cation, making comparisons between implementations (their internal energy consumption) sensible. An easy inspection shows that a counter-element produces an external output to its right neighbor for every input it receives from the left, as in Figure 10(b,c). Each input event causes at most a constant number of transitions within a glue-cell (counter-element) itself. Since there are O (log N ) elements in a chain, each external a? input causes O (log N ) communications between adjacent elements. This means P (Counter) is O (log N ).

5.3 Adhoc Design 2

We remain unsatis ed with the energy eciency of the previous circuit. We attempt to explore another possibility of distribution of inputs by a glue-cell: suppose, now it alternates between the

6 RUDIMENTS OF A DESIGN METHODOLOGY a?

b?

a?

l! Mod-R l? Counter m! m? N = 2R+1

19

b?

a!

p?

l!

p! Mod-R Counter q! q?

l? Mod-R Counter m! m?

b!

b!

b?

a!

l! l? Mod-R Counter m! m?

N = 2R+1 b!

(b)

(a)

a?

N = 2R a!

(c)

Figure 11: 2nd decomposition using `divide & conquer' two subcounters. For better or worse, this does not allow R and S (as in previous design) to be very di erent! What we get is shown in Figure 11(a). Here we notice that the right hand subcounter is not very useful and hence can be replaced by a Wire as in Figure 11(b) to obtain a 2R +1 counter. A little manipulation of the structure allows us to realize an even mod-2R counter shown in Figure 11(c).

5.3.1 Analysis

The area measure of this design is seen to be about (A(3  1-Join)+ 2A(Merge)) log N , by simple analysis as done in previous section. The response-time is bounded by a constant as seen from the following inequality: T (Mod-2R +1 Counter)



MAX (T (3-input Merge), /* delay from b? to the subsequent a! */ T (3-input Merge) + T (3  1-Join) /* delay from a? and the subsequent a! */ +MAX (0, (T (sub) ? (T (3-input Merge) + T (env))))

The non- rst counter-elements have response time bounded from above by T (3 ? inputMerge ) + T (3  1-Join), while the rst element has response time about T (3  1-Join) higher, using same reasoning as in the previous section, The energy consumption for aN b is some constant multiple of the total number of internal and external communications, which is about

EN =

Xn (N div2i + 1) i=0

where n = blog N c. Since there are about 2N external communications with environment, the power consumption in the circuit is P (counter) = EN =2N = O (1), a constant [vB93]. All these possibilities for manipulating the structure of an implementation makes one wonder if we can systematically explore and arrive at a good, ecient design. This is our next topic for discussion.

6 Rudiments of a design methodology Although logical separation between channels aand b is an important speci cation principle for

exibility of the counter's composition with other DI modules, for simplicity we will use the speci cation pref((a? a!)N ?1 a? b!)? in this section.

6 RUDIMENTS OF A DESIGN METHODOLOGY a? p? a!

p!

b!

q!

a?

a? Mod-K Counter

20 a!

a!

a?

b!

(a) N = K+1

b! (b) N = 2

(C) N = 1

b!

(d) N = 2

Figure 12: Mod-N counters: area, power linear in N ; response-time constant States of a module induce an equivalence relation in its trace set such that all traces that can be similarly extended belong to the same partition or, equivalently, represent a state. A quiescent state is the set of equivalent traces that cannot be extended by an output symbol | representing a state in which the module is waiting for inputs to generate any further output. Any circuit of a mod-N counter, N > 1, will have at least two di erent quiescent states, as the speci cation implies at least two di erent responses of the counter on reception of a? under di erent histories: namely, an a! or a b!. Consequently, we employ a 2  1-Join to construct a 2state machine, which derives its state signals from a `cooperating' mod-N ? 1 counter on its right. A Merge is used to initialize the 2-state machine when b! is output, as shown in Figure 12. The 2  1-Join simulates a state machine whose column input de nes the inputs to the machine and the row inputs de ne the state. This design entails a constant response time (= 2  T (2  1-Join) + T (Merge)), but unfortunately uses O (N ) area and O(N ) power!

6.1 A more complex counter-element

We add a little more sophistication to each individual counter-element, so it can determine its next state in a greater number of situations, by itself. That is, a counter-element processes a little more of local information, and passes the rest to its `neighbor' for processing. Speci cally, in case of the modulo counter, the counter-element can be designed to determine its next state for every other state, so that it does not have to communicate with its neighbor for its next output. This implies integrating another quiescent state into the counter-element's state machine, and motivates the design in Figure 13(a),(b) for odd & even N , respectively. Analysis, as in previous sections, (also, see next section where we analyze a similar design) show that area, time and power complexities of this circuit are (log N ), (1) and (1), respectively. A 3  1-Join can be further decomposed as shown in Figure 13(c). Figure 13(d) shows the counter with this decomposition. Figure 13(e) is arrived at by observing that p? is followed by a p! or q ! in the subcounter, and also that a? is followed by a! or b!. A nice property of this optimization is that the top 2  1-Join in Figure 13(e) is the simpler Toggle device. and the gure is redrawn in (f). A circuit somewhat similar to the above has been independently developed in [EP92]. But, this `decoupling' of the two 2  1-Joins provides another powerful way to optimize area by trading o time { as a result of the important property of composability or modularity in asynchronous circuits { as will be seen next.

6 RUDIMENTS OF A DESIGN METHODOLOGY

21 a?

a?

a? p!

a!

p! Mod-K Counter

a!

p? q!

b!

Mod-K Counter

p1!

p? q!

b!

(a) N = 2K+1

c0?

p0!

c1?

p2!

(b) N = 2K

c2?

(c) 3x1 Join

a?

a? a!

a!

a?

p?

p?

p? a!

Mod-K p! Counter

a!

Mod-K p! Counter

a!

q!

b!

q!

b!

b!

(e) N = 2K

(d) N = 2K

p! q!

Mod-K Counter

(f) N = 2K

Figure 13: Mod-N counters: adding more logic to each counter-element

6.2 Final design

Usually a circuit should generate an output as soon as possible and permissible so that the receiving circuit could start processing it earlier. The subcounter is sent p! immediately after receiving p? from it. The necessary manipulation results in designs shown in Figure 14.

6.3 A sketch of the circuit's veri cation

[JU91] provide a nice algebra for delay-insensitive circuits which we will use to `derive' a circuit for the counter. First, we specify a counter CN (of size N ) in this algebra, with a; b renamed as p; q :

CN = if

elseif C1 = if then elseif

p0? then p1 !; CN ?1 q1? then ? p0 ? q0!; if

q1 ? then p1!; CN elseif p0? then ? q1? then ? Here, external choice is expressed by if then else, sequentiality by `;', and `chaos' or undesirable inputs by ?. Note that, the algebra is based on receptive process theory and speci cations explicitly represent conditions under which the particular outputs of the circuit are permitted, and conditions under which particular inputs from environment are prohibited. Inspired by the expression expr(N ), we attempt to build a larger counter from a counter element

6 RUDIMENTS OF A DESIGN METHODOLOGY

22

a?

a?

p?

a!

p!

b? b!

q!

Mod(N-2)/2 counter

p?

b? a!

p!

b!

q!

q?

q?

(a) b? a?

(b)

p?

a!

Mod- N/2 counter

b!

a?

a?

a!

a!

b? b!

p! q!

b!

Mod(N-1)/2 counter

b? Mod-2 counter

q?

Mod-1 counter

(d)

(C)

(e)

Figure 14: Energy-ecient DI decomposition of a mod-N Counter

E and CN . We choose M to be 2 as we already know a `good' implementation for that value: E = if a0 ? then a1 !; F elseif p1? then ? elseif q0? then ? elseif b1? then ? F = if

a0 ?

elseif b1? G = if a0 ? then if elseif elseif H = if b1 ? elseif a0? elseif q0?

then a1!; G then ? p1 ? q0 ? b1 ?

then a1!; F then b0!; q1!; H then ?

then a1!; F then ? then ?

In this notation, many equivalent states are concisely represented. For example, any one ordering of consecutive inputs (or outputs) represents the whole set of possible orderings arising out of arbitrary input (or output) wire delays. Also, only input transitions that are speci cally prohibited at a quiescent state are explicitly represented in that state. In speci cation of Counter-element E , there are 3 distinct quiescent states, namely, B, C , D and state E = C=p1? where `/' is the after operator, i.e. state E is equivalent to the next-state of C upon receiving p1?. A theory based on quiescent states has many desirable characteristics for proving progress properties of a system [CM88]. Moreover, these states form basis for decomposition into circuits via state machine implementations. It is straight forward to show that C2N +1 = E kCN . A 3  1-Join device and two Merge devices can be composed to implement E . In stead, we introduce internal symbol l to nd

6 RUDIMENTS OF A DESIGN METHODOLOGY

23

a more ecient decomposition, with F & G rede ned as follows:

F = if

a0 ?

elseif b1? G = if l? then if elseif elseif

then a1!; a0?; l!; G then ? p1 ? q0? b1 ?

then then then

a1!; F b0!; q1!; H

?

Now, quiescent state F clearly suggests a Toggle while G has the structure of the speci cation for a 2  1-Join. All similarly named signals in di erent states are combined through a Merge(s). Two signals occurring consecutively in a state are combined into a wire fork. Parallel composition of algebraic representations of devices in this Counter-element can be manipulated to verify counter element v E , where v is CSP's [Hoa85] re nement relation. The details of algebraic manipulation for equivalence checking is not shown here.

6.3.1 Response time

Below, we calculate the upper bound on the response time of the mod-N counter of Figure 14(c).



T (Counter) MAX (T (Toggle) + 2T (Merge),

T (Toggle) +T (Merge ) + T (2  1-Join) +MAX (0,

T (sub) ? (2T (Toggle) + 3T (Merge ) + T (env)))) From the above expression, if T (env )  T (2  1-Join), then T (counter) is a constant which is at most MAX (T (Toggle )+2T (Merge ), T (Toggle )+T (Merge )+T (2  1-Join)). Each stage (consisting of a Toggle, 2  1-Join, and 2 Merges) serves as the environment to a subcounter (pictured on the right). But, therefore, T (env ) for each subcounter is at least T (2  1-Join). Now, it is easy to see that if T (Toggle)+ T (Merge )+ T (2  1-Join)  2T (Toggle)+3T (Merge ), then T (counter)  MAX (T (Toggle) + 2T (Merge ), T (Toggle ) + T (Merge ) + T (2  1-Join)). Response-times of circuits in Figure 14(a,b) are over twice as small as that in Figure 13(e).

6.4 Energy optimality

That energy consumption is O(1) follows from the fact that each Counter-element generates one external communication (with its neighbor subcounter) for every two communications with the `left' neighbor. Therefore, the O (N ) endogenous signal transitions between neighbors is generated within a mod-N counter for every N handshakes on channel a.

6.4.1 Other variations

It is straightforward to modify the network of Figure 14(a) to implement this speci cation and is shown in Figure 15(b). Furthermore, a 4-phase-to-2-phase converter for channel a as shown in Figure 15(a) may be composed in to implement a 4-phase a channel (this can improve the response time by an additive constant.)

7 A RIPPLE COUNTER AND ITS USE a?

C

a!

c? c!

24

a?

p! p?

a! q?

b! 4P-to-2p

Mod-K Counter

alt-spec

(a)

(b) N = 2K+2

Figure 15: 4phase-to-2phase converter, alternative speci cation

6.5 A `parameterized' family of designs

In the abstract, a positive natural number N can be thought of as a product of smaller positive factors. If counters designed for the respective factors are composed together to form a linear asynchronous pipeline, we will have implemented a mod-N counter, with accompanying area and energy eciency. Figures 12(b) & 12(c) show implementations for N = 2 and N = 1. Moreover, in general, it is possible to modify (without much penalty) a given mod-M counter (which has a speci cation pref (aM b)?) into a circuit with behavior pref (aL b (aM b)? ), for all L  M . Now, consider an expansion of a natural number N as follows: if N  1 expr(N )  N Otherwise N ? L expr(N )  L + M  expr( M ) where L, M , NM?L are naturals and L  M ^ M > 1 This expansion gives us a clue to a particularly nice decomposition, where the outermost level of expr(N ) is `implemented' by a counter-element which is composed, in parallel, with a subcounter `implementing' the inner levels. We have discussed above the design with M = 2 and L = 1; 2.

7 A ripple counter and its use

p

Here, we intend to build a mod-N counter by composing together two counters that count N each, so as to `multiply their computation,' To motivate this, consider N = 2k expressed as a product of k 2's. The corresponding implementation is a so-called \ripple" counter as in Figure 16(a), where 2x1Join is con gured to work like a Toggle. The response-time is O (log N ) as is the area. Energy consumption is constant. For arbitrary N , consider the following scheme to implement a delay-insensitive ripple counter: Express N as a binary number B such that 2jBj  N . Let B be the 2's complement of B. Mark each successive Toggle of a jBj-long cascade with the successive bits of B such that the rst Toggle (from A-interface) marked up with the most-signi cant-bit. The \initial-terminals" of the Toggles marked 1 are fed, along with b? input, into a parity tree to form the a! terminal of the counter, while the rest (marked 0) are merged with input a? to form another parity-tree. The noninitial-terminal of the last Toggle is output b!. Figure 16(b) illustrate this scheme in implementing a mod-10 counter. In (c), we show a design that `skips useless states' of the underlying mod-16 counter. Note that a parity tree can be `balanced' or `linearized', without fear of hazards, to strike a suitable compromise between the respective reduction of response time from fewer logic levels and reduction in area from simpler layout. A proof-sketch of the scheme's correctness follows. Proof Sketch: N is expressible as a derivation in the following grammar: M ! 1j(2M )j(2M ? 1), where juxtaposition is ordinary multiplication. The number of levels of

7 A RIPPLE COUNTER AND ITS USE a?

a! p!

a Toggle b!

k-1 Mod-2 Counter

b? a! a?

p? q!

k (a) Mod-2 Counter b? a! a?

25

P P

b!

(b) Mod-10 ripple Counter p!

P

p?

P

q?

b! b? a!

P

a?

P

P

b!

(c) pipelined Mod-10 counter stage q!

(d) improved Mod-10 ripple Counter

Figure 16: Design of ripple counters and stages nested parentheses in such a derived expression equals the number of bits in its 2's complement B. Moreover, for a 1 bit in B, there is a \?1" in the corresponding level of expansion, with the outermost level matching up with the least signi cant bit. For example, 10 = (2(2(2(2) ? 1) ? 1)) while 2's complement of 10 is binary 0110. When k Toggles, one for each level, are cascaded together as in Figure 16(a), we get a 2k ripple-counter. But to achieve the e ect of `subtraction by 1' at a paranthetic level, we merge the output from the initial-terminal, of that level's Toggle, with the counter input a?. Thus the counter counts in an appropriate number (2jBj ? N ) of endogenously generated \ticks" in every cycle, so it can produce b! after receiving only N exogenous ticks. (End of Proof Sketch) A surprising use of this design is provided next.

7.1 Area-time tradeo

We notice that T (env ) for a counter-element nearly doubles from one element to the immediately next on the right, as the element generates one handshake to the right for every pair of handshakes it does on the left. Therefore, as evident from the response-time calculations above, a counter-element can `tolerate' a 100% degradation in response time of the subcounter on its right. So, we imagine a di erent value for m in each di erent level of parenthesis in expr(N ), where response time of a modP subcounter is proportional to the value of m on the outermost-level of subexpression expr(P ). We use ripple counter stages to implement counter-elements of increasing m value. Fig 16(c) shows how to build a counter-element (or stage) from an ordinary mod-10 ripple counter by using a 2  1-Join as a bu er between stages. Bu ers of this type give a pipelined structure to the counter's implementation. This way, the considerable area eciency of ripple counters is brought to bear without sacri cing the response-time of the over-all mod-N counter. All this works because any circuit delay-insensitive component/module in a circuit can be replaced by another with the same behavioral speci cation without causing delay faults, while large-scale timing-analysis and redesign is usually needed in a synchronous discipline to allow safe replacements of `parts'. It is also easy to realize a ripple-counter with L < M , by appropriately choosing the `initial terminals' of the Toggles in a counter-element, based on the scheme in Section 7, e.g. Figure 16(c) shows a counter-element for L = 10, M = 16. mi+1 can be as large as 2m { therefore, the number of i

8 SWITCH-LEVEL DESIGN

26

2  1-Joins in the mod-N counter circuit asymptotically equal log? N . It also follows that the total area needed for a mod-N counter nearly equal log N  (A(Toggle) + A(Merge )) for all suciently large N . This is about (16 + 6)  log N + 24  log? N transistors in total. (See the next section for some robust, yet ecient, switch-level implementations of our primitives.) Note that any design of a synchronous counter will use at least log N storage devices and some combinational logic, amounting to about as many transistors as in our asynchronous design. See [vB93] for a design using 4-phase protocol starting from a CSP-based language, Tangram. Optimized implementation therein uses about 26 + 120blog N c transistors. The latency in that implementation is O(log N ), but, the response time afterwards is a constant. The power consumption is O(1) per external communication, as in our design. Our designs considerably improve upon [vB93] by reducing the initial response time to a constant simultaneously maintaining all other complexity bounds while making them even tighter by multiplicative constants | e.g., our 2-phase signalling reduces the time to count N by a factor of 2, compared to 4-phase communication.

8 Switch-level design

[Kel74, Ebe89] have delay-sensitive implementations for 2  2-Join that are too inecient in transistor counts and yet require insertion of several delay-elements for a safe operation. In the following we show signi cantly more ecient and yet robust transistor-design schemes for the 2  1-Join,2  2-Join in Figure 17 { these can be extended to other types of Joins. The design in Fig 17(b) shows how the output signal p of a 1  2-Join { it is symmetric with respect to both outputs { is derived using a weak inverter to `hold' state. In (c), a circuit technique in [vB91] is adapted to generate p!, but care must be taken to ensure that the internal generation of a  q signal is faster than the Join's environment can respond after q !, to avoid glitch in p. Because of two current paths, this circuit technique allows faster response times than the previous (`trickle-inverter') design for similarly sized transistors. Although both designs exhibit early thresholds ([vB91]), absence of (global) isochronic forks in our design style makes them an asset by virtue of their earlier, hence faster, response to input transitions without the risk of hazards that could create in isochronic forks. (For an example of such a malfunction, see [Mar89].) Figure 17(d) shows how a  q signal may be derived. In (f) is an implementation for the output p of a 2  2-Join, but it may be improved by use of ratioed-logic as shown in Fig 17(g). The design in (g) uses fewer transistors and is more robust. Simulation results for these implementations are being compiled. A switch-level design for a Toggle, which is a specialized 2  1-Join, is given in Figure 18. If transition of input signal a is too slow, there is the risk of an undesirable race; we suggest use of an inverter with sharp transition response to generate the :a signal internally, as a remedy. We observe that outputs of a Tria can be derived in a very similar way as the outputs of a 2  2-Join.

9 Concluding Remarks We have found and demonstrated a small set of primitives that have low I/O-modularity, yet are wbw-universal with respect to a large class of DI circuits. We have shown the power of a few simple delay-insensitive primitives and illustrated several circuit design optimizations through manipulation of the primitives' initial states and composition structures; these help achieve simultaneously many circuit eciencies, while preserving absence of hazards, hence correctness, in asynchronous circuits. Having small DI primitives as building blocks permits a greater degree of freedom to tune

9 CONCLUDING REMARKS

27

c?

b

b

a?

q

VCC

VCC b?

a q

a a

q p p!

q!

a q

(a)

a

b

q

a Xor logic

p p LEGEND

a

w

a q

(d)

q

p-transistor

a q q

q

b signals written near middle of a transistor represent gate excitation.

p b

a

b

q

n-transistor

GND

GND

(b) Staticizer or ‘trickle-invertor’ (c) Complementary CMOS implementation of one ‘cell’ implementation of one ‘cell’ VCC c c?

p

p

a q

r

d?

p!

q!

a

b? r!

c

q

r c r

a q

r

a

s!

(e)

q

p

p

a?

c

p a

q

c

r

a

c r c

p c

a q

r

q a

r

q

GND (g) Ratioed-logic implementing one of four

(f)Complementary CMOS

cells in a 2x2 Join.

implementation of one ‘cell’

Figure 17: Ecient implementation of Joins

p

p

q

q

Toggle

a

a?

q

q

p

a

p Fast inverter

p!

q! a

a

Figure 18: Ratioed-logic implementation of a Toggle

REFERENCES

28

the circuit { at least, grossly { for various eciencies, at a more abstract level. Enhancement of these primitives by transistor-sizing or by some technological change preserves the correctness while improving the intended performance. For power eciency, it is very important that we use circuits that have no `busy-waiting'; our primitives make that possible in general. We have also pointed out a source of waste in computational energy through an example. We exploit the composability property of DI modules and show how to obtain high area ef ciency, with an example of using `ripple' counter stages within a constant-response modulo-N counter circuit. We have presented rudiments of a powerful method to systematically derive ecient DI circuits. We give several designs for modulo counters and optimize them along various axes of performance measures to signi cantly improve over existing designs. We believe that a small set of DI primitives in a module's construction makes for easy testability. Ecient and robust implementations of several primitives are also given, which allay the fear { partly based on the existing, inecient transistor-level implementations of DI primitives { that asynchronous circuits are highly area-inecient. We have presented several designs for modulo counters and optimized them along various axes of performance measures, methodically. As far as area complexity/transistor-count is concerned, it is our hunch that, 2-phase transition signalling is eminently suitable for modular and ecient clock-less circuits, while 4-phase protocol might be preferred for complex combinational, datapath logic. For o -chip communication, a 2-phase protocol o ers signi cant savings in time and energy consumption.

References [BM91]

Steven M. Burns and Alain J. Martin. Performance analysis and optimization of asynchronous circuits. In Carlo H. Sequin, editor, Advanced Research in VLSI: Proceedings of the 1991 UC Santa Cruz Conference, pages 71{86. MIT Press, 1991. [Bru91a] Erik Brunvand. A cell set for self-timed design using Actel FPGAs. Technical Report UUCS-91013, Dept. of Comp. Science, Univ. of Utah, Salt Lake City, August 1991. [Bru91b] Erik Brunvand. Translating Concurrent Communicating Programs into Asynchronous Circuits. PhD thesis, Carnegie Mellon University, 1991. [Chu87] Tam-Anh Chu. Synthesis of Self-Timed VLSI Circuits from Graph-Theoretic Speci cations. PhD thesis, MIT Laboratory for Computer Science, June 1987. [CM88] K. M. Chandy and J. Misra. Parallel Program Design. Addison-Wesley, 1988. [Dil89] David L. Dill. Trace Theory for Automatic Hierachical Veri cation of Speed-Independent Circuits. ACM Distinguished Dissertations. MIT Press, 1989. [DNS92] David L. Dill, Steven M. Nowick, and Robert F. Sproull. Speci cation and automatic veri cation of self-timed queues. Formal Methods in System Design, 1(1):29{60, July 1992. [Ebe89] Jo C. Ebergen. Translating Programs into Delay-Insensitive Circuits, volume 56 of CWI Tract. Centre for Mathemathics and Computer Science, 1989. [Ebe91] Jo C. Ebergen. A formal approach to designing delay-insensitive circuits. Distributed Computing, 5(3):107{119, 1991. [EP92] Jo C. Ebergen and Ad M. G. Peeters. Modulo-N counters: Design and analysis of delay-insensitive circuits. In Jrgen Staunstrup and Robin Sharp, editors, 2nd Workshop on Designing Correct Circuits, Lyngby, pages 27{46. Elsevier Science Publishers, 1992.

REFERENCES [Haz92]

29

Pieter J. Hazewindus. Testing Delay-Insensitive Circuits. PhD thesis, California Institute of Technology, 1992. [Hoa85] C. A. R. Hoare. Communicating Sequential Processes. Prentice-Hall, 1985. [Jos] M. Josephs. Private communication. [JP] M.B. Josephs and P. Patra. An asynchronous bit-serial adder and its delay-insensitive decomposition. Under preparation. [JU91] Mark B. Josephs and Jan Tijmen Udding. An algebra for delay-insensitive circuits. In Robert Kurshan and Edmund M. Clarke, editors, Workshop on Computer-Aided Veri cation, volume 3 of DIMACS Series in Discrete Mathematics and Theoretical Computer Science, pages 147{175. AMS-ACM, 1991. [Kel74] Robert M. Keller. Towards a theory of universal speed-independent modules. IEEE Transactions on Computers, C-23(1):21{33, January 1974. [Lav92] Luciano Lavagno. Synthesis and Testing of Bounded Wire Delay Asynchronous Circuits from Signal Transition Graphs. PhD thesis, UC Berkely, 1992. [Mar89] Alain J. Martin. The design of a delay-insensitive microprocessor: An example of circuit synthesis by program transformation. In M. Leeser and G. Brown, editors, Hardware Speci cation, Veri cation and Synthesis: Mathematical Aspects, volume 408 of Lecture Notes in Computer Science, pages 244{259. Springer-Verlag, 1989. [Mar92] Alain J. Martin. Asynchronous datapaths and the design of an asynchronous adder. Formal Methods in System Design, 1(1):119{137, July 1992. + [MBL 89] Alain J. Martin, Steven M. Burns, T. K. Lee, Drazen Borkovic, and Pieter J. Hazewindus. The design of an asynchronous microprocessor. In Charles L. Seitz, editor, Advanced Research in VLSI: Proceedings of the Decennial Caltech Conference on VLSI, pages 351{373. MIT Press, 1989. [Men88] Teresa H.-Y. Meng. Asynchronous Design for Digital Signal Processing Architectures. PhD thesis, UC Berkely, 1988. [Now93] Steven M. Nowick. Automatic Synthesis of Burst-Mode Asynchronous Controllers. PhD thesis, Stanford University, Department of Computer Science, 1993. [Pat] P. Patra. A delay-insensitive bit-serial multiplier. Under preparation. [Sei80] Charles L. Seitz. System timing. In Carver A. Mead and Lynn A. Conway, editors, Introduction to VLSI Systems, chapter 7. Addison-Wesley, 1980. [Sut89] Ivan E. Sutherland. Micropipelines. Communications of the ACM, 32(6):720{738, June 1989. [Udd84] Jan Tijmen Udding. Classi cation and Composition of Delay-Insensitive Circuits. PhD thesis, Dept. of Math. and C.S., Eindhoven Univ. of Technology, 1984. [vB91] C. H. van Berkel. Beware the isochronic fork. Nat. Lab. Unclassi ed Report UR 003/91, Philips Research Lab., Eindhoven, The Netherlands, 1991. [vB92] Kees van Berkel. Handshake Circuits: An Intermediary between Communicating Processes and VLSI. PhD thesis, Eindhoven University of Technology, 1992. [vB93] Kees van Berkel. VLSI programming of a modulo-N counter with constant response time and constant power. In S. Furber and M. Edwards, editors, Proceedings of IFIP Working Conference on Asynchronous Design Methodologies, pages 1{11, Manchester, UK, 31 March { 2 April 1993, 1993. Elsevier Science Publishers. [vdS85] Jan L. A. van de Snepscheut. Trace Theory and VLSI Design, volume 200 of Lecture Notes in Computer Science. Springer-Verlag, 1985.