An Asynchronous Wrapper with Novel Handshake ... - Semantic Scholar

6 downloads 0 Views 75KB Size Report
Abstract: In this paper, we propose an asynchronous wrap- per with novel .... port and is combined with the clock stretch request signal on the R-port for reading ...
An Asynchronous Wrapper with Novel Handshake Circuits for GALS Systems1 Shengxian Zhuang2, Weidong Li, Jonas Carlsson, Kent Palmkvist, and Lars Wanhammar Division of Electronics Systems Department of Electrical Engineering Linköping University, SE-581 83 Linköping, Sweden Abstract: In this paper, we propose an asynchronous wrapper with novel handshake circuits for data communication to be used in GALS systems. The handshake circuits include two communication ports and a local clock controller. We present two approaches for the implementation of communication ports; one with pure standard cells and the other with Müller C elements.The detailed design methodology is given and the circuits are validated with VHDL and circuits simulation in standard CMOS technology. Keywords: GALS, asynchronous wrapper, handshake circuit.

1. Introduction Recent researches show that the state-of-art globally synchronous designs are hardly applied to high frequency, high performance ULSI circuits which require low power. The problems such as clock skew, EMI and power consumption caused by global clock distribution would become more significant in SOC. Asynchronous circuits without using clocks to avoid such problems, demonstrate advantages on many aspects over the globally synchronous designs. But many drawbacks exist in fully asynchronous design, such as large overhead, complication of designs and lack of familiar synthesis CAD tools. All of these make it still unacceptable to substitute the current VLSI design methods. The GALS (Globally-Asynchronous Locally-Synchronous) [1] approach for building large deep submicron system on chips has been recently viewed as a promising method to handle the problems associated with global clocks. By adding asynchronous interfaces to locally synchronous (LS) modules, the interiors of a specific module are encapsulated from the interfaces and it is possible for each module to use its own clock, power supply voltage, as well as low swing buses, or even to collapse the power supply for the synchronous part without affecting any other parts of the chip[7][8]. Hence, GALS systems possess many advantages, i.e. not only can it mitigate the clock distribution problems due to large chips, power consumption in the clock distribution, and the problem of clock skew, it also simplifies the reuse of modules as the modules do not need to use the same clock signal.

1. 2.

Supported by the Swedish Strategy Research Foundation Corresponding author, E-mail: [email protected]

However, compared to the fully asynchronous design, less research interests focus on GALS systems, partly because designing an unified reliable and robust asynchronous wrapper is difficult. The main challenges in the design of GALS circuits are: the interface circuits must be easy to glue with various synchronous paradigms, can generate a reliable stretchable clock and handle fast data communications. Meanwhile, they should require small chip area and have small latency. In this paper, We discuss the design of a high performance asynchronous wrapper with new handshake circuits and stretchable clock controller. Section 2 introduces the basic concept of handshake circuits and proposes the communication ports both with standard cells and Müller C elements, as well a stretchable local clock generator. Section 3 presents design methods for GALS systems. Finally, the simulation and performance is discussed in section 4.

2. The handshake circuits for an asynchronous wrapper 2.1. Handshake circuits The handshake circuits are used to generate signals in response to their input signals (from environments or so called dummy signals) which in combination form a specific protocol to implement asynchronous data communication between two modules. Generally speaking, the design of the handshake circuits depends on the specific data communication protocols, the structures of modules and the organization of systems. Two- or four-phase protocols and various data encoding schemes are used for asynchronous data communication. Due to the complex of control circuits for implementing 2-phase protocols and the large overhead for dual-rail data transference to generate the completed signal, the four-phase bundled data communication is commonly employed in most of the fully asynchronous systems. The fundamental circuit implementing four-phase handshaking protocol between asynchronous blocks can be composed of Müller C elements which is shown in Fig. 1. For a successful data transfer, the handshake circuits will go through the following signal transitions viewing from outside of the interface circuits: Rin+-Aout+-Rout+-Ain+ - Rin- - Aout-- Rout- - Ain-

Aout

C

C

Rout

Ain

WR+

RD+

STRETCH1+

STRETCH2+

REQ+

W-port

Rin

stretching the clock’s low phase. REQ+ is sent to the Rport and is combined with the clock stretch request signal

ACK+

REQ-

R-port

, by which data communications in the whole system are consecutively coupled with behaviors of subsequent modules. It is suitable for the pipelined asynchronous computations implemented with special circuit styles such as DCVSL[2].

ACK-

Fig. 1. Interface circuit with Müller C elements

In GALS systems, interface circuit with Müller C elements can not be applied directly to data communications between two locally synchronous modules, because there is a significant difference between GALS and fully asynchronous systems that the activations of data input and output must be synchronized with their local clocks. Hence, for GALS systems to complete data transferes between the locally synchronous modules, it is necessary to use special ports to implement the handshakings and the stretching of local clocks. Such LS modules with an asynchronous wrapper, which can be used as an IP block, could be connected with ease. A detailed discussions on the design of data-ports and about the communication channel models are found in [4][5]. To simplify the design of the asynchronous wrapper, but without loss of generality, we assume the handshake circuits in GALS work in the following mode: a) The request of data communication is always activated by a data output interface circuit, namely W-port, which is equipped to a master LS module, The data input interface circuit, namely R-port, which is equipped to a slave LS module, is always passive for accepting the data. b) When the W-port activates data output, it might stop its internal clock and wait for the acknowledge from the corresponding R-port. Likewise, when the R-port initializes reading a data, it must maintain the state until the W-port sends a request. This means every activation of each port completes an effective data transmission. c) Both the W-port and the R-port are independently enabled by the internal requests from their own LS modules. This is the case of data communications in many GALS systems.

STRETCH1-

STRETCH2-

WR-

RD-

Fig. 2. Signal transitions on the W-port and the R-port

on the R-port for reading a data, i.e., STRETCH2+, to generate ACK+ that connects to the W-port. ACK+ on the Rport will lead to REQ- which in turn resets ACK+ to ACK-. The data should be latched before both REQ and ACK return to their initial states. The stretching of the clock is canceled with STRETCH1- which may make WR+ go to WR-(not included in the handshake circuits). For the regular STG, all of the signal transitions can easily be generated using the standard cells such as D-FFs and latches with set and reset functions, except for the STRETCH1-, because the signals REQ and ACK both experience two states (high and low) before STRETCH1-, that makes REQ and ACK can not be directly employed to generate STRETCH1-. In order to make the synthesized circuits as simple as possible, we let STRETCH1- be generated soon after REQ-. As ACK- is immediately generated by REQ-, the correct handshaking is still guaranteed with such small modifications of the regular STG. Thus, ACK+ and REQcan be used to produce STRETCH1-. The synthesized circuits for the W-port is shown in Fig. 3. In the same manner, we can synthesize the handshake circuits for the Rport, which is shown in Fig. 4. Q

D

REQ

2.2. Communication ports using standard cells Based on the assumptions and constraints above, the regular signal transitions on the W-port and the R-port, is shown in Fig. 2, where STRETCH1 and STRETCH2 are the requests for stretching the local clocks. This signal transition graph(STG) is safe for asynchronous communications in GALS systems, but time consuming since no signal transitions occur in concurrency. For the W-port, STRETCH1+ and REQ+ are successively generated immediately after a demand signal for data output, i.e., WR+, is issued. STRETCH1+ should be valid before the next rising edge of the local clock for reliably

WR

LD CLR

VDD D

ACK Q

CLK CLR STRETCH1

Fig. 3. The W-port circuit with standard cells

The interface circuits above certainly feature several useful properties due to using standard cells. Firstly, the

interface circuits are not only easily matched with LS modules, but also they are reliable and robust to their environments. Secondly, high average communication speed could be achieved due to signal transitions obtained directly by the output or the reset of D-FFs and latches. Finally, the handshake circuits can be easily captured in high level hardware description languages and synthesizable using current CAD tools, except the Müller C element in the local clock controller which is discussed in 2.4.

REQ- to ACK- is need to be as short as possible, it could be done by adding a reset to the Müller C element that outputs ACK. RD

Set Delay

C

ACK

C

REQ

STRETCH2

REQ

Fig. 6. The R-port using Müller C elements

VDD CLR ACK

D

D

LD

Q CLK

Q

2.4. Stretchable clock controller RD

CLR Delay

STRETCH2

Fig. 4. The R-port circuit using standard cells

2.3. Communication ports using C-elements The data communication ports discussed above, although highly reliable and robust for 4-phase bundled asynchronous data communications, they are still slow because the standard cells such as D-FFs and latches have longer delays than basic gates. Thus, we propose another interface circuits for the W-port and the R-port using Müller C element, which are shown in Fig. 5 and Fig. 6, respectively.

A stretchable clock controller is a key component in a GALS systems. With stretching the low phase of clock and adjusting the number of inverters which consist of a ring oscillator, we can provide the LS module a highly flexible clock to avoid synchronization failure. The pausible clock controller (PCC) is the first scheme applied to GALS systems [3]. Its main drawback is that a metastability state could occur when the request and the rising edges of the clock arrive simultaneously, because it takes a strategy to “toss a coin” to determine which passes the ME circuit. Another stretchable clock [6] has a simple structure using only two basic gates, but it could be unreliable because the output can not be hold for some states of inputs. Our stretchable clock controller shown in Fig. 7 has a similar Xa

STRETCH

Xb

STRETCH1

Xout (lclk)

C

Reset

WR

C

C

REQ

Delay

STRETCH1

STRETCHi

Fig. 7. Stretchable clock generation

ACK

Fig. 5. The W-port using MüllerMüller C ele-

For the W-port, we have the follow signal transitions when the LS module sends out a data by firing WR+: WR+STRETCH1+-REQ+(RD+)-ACK+-REQ--STRETCH1-. The signals on the R-port may go through a similar transitions as the W-port when it reads a data by firing RD+: RD+STRETCH1+-(REQ+)ACK+-(REQ-)STRTECH1--ACK-. However, there is a slight difference for the requirement of WR and RD in the circuits. Because both their transitions from low to high generate a low-to-high(+) on the outputs of AND gates, there is no need for them to be held at high before the STRETCH1-2 goes low. Thus it may speed up the data communication and facilitate the interface with LS modules. For the R-port, if the transition time from

architecture to [6]. The generated local clock signal is feed back to the input terminals of Müller C element both with inversions , but with different delay. According to the properties of Müller C element, if STRETCH is not asserted to Low, the output and inputs of Müller C element will follow the signal transitions in Fig. 8. If STRETCH is STRETCH-

Xout+

Xa-

Xb-

Xout-

Xa+

Xb+

Xout+ STRETCH+

Xa-

Xb-

Xout-

Xb+

Fig. 8. STG of Stretch clock controller

asserted to high, the input Xa is set to low, the output of Müller C element could be either at low or high. However, the output will eventually be maintained at low level. Thus, the next rising edge is postponed by the STRETCH+. For the multiple requests of stretching the local clock, it

can be obtained by connecting all the STRETCH1-i to an OR gate. Whenever a request of the stretch of a clock arrives, there is STRETCH which will be output to the clock controller.

3. Configuration of GALS systems with the asynchronous wrapper 3.1. Point-to-point communication The GALS systems may include many locally synchronous modules that are connected through the asynchronous wrappers. Though the LS modules are different in functions and structures, we could categorize the LS modules with simplicity into three classes in terms of data flow in the communication ports: source LS modules with only data output and control ports, sink LS modules with only data input and control ports and intermediate modules that own both data input /output and control ports. With the LS module encapsulated in an asynchronous wrapper presented above, a typical GALS systems with point-to-point data communication is configured in Fig. 9. If the source LS wants to send a data to the intermediate LS, it outputs the data on data bus and activates the W-port with WR+. The intermediate LS can receive the data by giving an acknowledge with RD+. Both WR+and RD+ will trigger STRETCH+ causing the internal clocks stretched if the data communication is unfinished before the next lclk+. The data communications between the intermediate LS module and sink LS module follow the same way. The handshake circuits in the W-port and the R-port can be directly interfaced with the asynchronous FIFO. If each LS module has different clock, a FIFO can be added between the W-port and the R-port to improve the transmission frequencies.

LS

lclk

R-port W-port Rd Req Wr Ack Stretch Stretch1

W-port Wr Req LS

lclk

LS module (M-LS). Figure10 shows the basic structure of the multiple-points communication network. As for the data input of the M-LS module, a simple way is to consider that there are independent data paths between each S-LS module and the M-LS module. If the M-LS needs reading a data from the preceding S-LS modules, it activates the RD+ and stretches the internal clock to wait for the S-LS modules sending the requests. After receiving all the requests from the S-LS modules, it sends the acknowledge to all senders and the data are latched by ACK+. For the data output of the M-LS module, we consider only one communication mode in which the data is simultaneously broadcasted to all S-LS modules. And it must wait for the acknowledgements from all receiving LS modules after which it sends the request to all of them. Such configuration of the M-LS is completely compatible with the interface circuits for the W-port and the Rport. Only an AND gate is required to synchronize the data communications between the sending LS modules and the receiving LS modules. The main problem is that it has no flexibility to adapt the independent data communications between the M-LS module and a S-LS module. An arbiter is needed to let only one request to pass to the receiving LS module if two data communications are necessary to process independently. However, the competence can not be resolved if two requests go into the arbiter at the same time. Additionally, the receiving LS module can not sense where the request comes from, if the occurrence order of the requests is not arranged. It is difficult for the M-LS module to have an independent acknowledgement to each of the sending S-LS module. Thus, special efforts is needed to design the M-LS module to be adapted to the different communication modes.

S-LS

R-port Rd

S-LS Stretch

Ri

Ack1

LS

Ack

Stretch2

Ri1

lclk

Fig. 9. Basic structure of point-to-point GALS system

Ri2

Acki

Ri3

M-LS Ro

Acko

Ack2

S-LS

Ack3

Ri4

S-LS Ack4

Fig. 10. Multi-port data communication

3.2. Multiple-points communication

4. Simulation and evaluation

In terms of interfacing methods that most frequently used in GALS systems, there possibly are two forms of multi-points data communication. Either is one data outport driving multiple data in-ports, or multiple data outports driving one data in-port. In other words, we can call them either a multi-output LS module (M-LS) interfaces with several single-input blocks (S-LS) or several singleoutput LS modules (S-LS) interface with a multiple-input

A GALS LS module is configured in Fig. 11.by connecting the handshake circuits and stretchable clock controller to a simple synchronous computation block. To simulate the performance of the asynchronous wrapper, a 4-bit accumulator is used. We assume that there is one data exchange per clock cycle. Due to the unknown time when the output data is accepted and the next input data is ready, in order to avoid the overlay of data, a constraint is applied on the wrapper, i.e., the R-port is not allowed to

send the acknowledge to latch the input data before the LT LT Dout Din + Ack2 Req2

R-port

T RD

W-port WR

Ack1

Str2

Req1

Lclk Str2

Fig. 11. A basic GALS module

output data has been accepted by the next module. We assign the tasks to different clock phase, i.e. the computation is done during the period of clock+ and data communication occures during the time of clock-. It is not efficient for the system as computations must be completed during the period of clock+. For the computation takes multiple clocks, the communication time can be omitted compared to the computation. In that case, the efficiency can be improved. With the simple module, We have done both the VHDL and circuits simulation in standard 0.35 µm CMOS technology with Spectre in Cadence. Figure12 shows the clock signals produced by the local ring oscillator.

Fig. 12. Stretch control and clock signals

Fig. 13. Simulation results of the handshake circuits

A 100 MHz clock frequency is generated with inverter chains in the ring oscillator. In the target LS module, the computation time is less than 4 ns, thus the time of clock+ is long enough to complete a computation. Figure13 shows the analog traces of the control signals when two data-ports work together, in which we can see the correct

operation is guaranteed by the delay of the acknowledge signal. The simulation results show that interface circuits with Müller C elements can operate at higher frequency than those with standard cells.

5. Conclusion In this paper, we have presented an asynchronous wrapper with novel handshake circuits including two data-ports and a stretchable clock controller. With a basic GALS module, its performances are simulated and verified.

References [1] D.M. Chapiro, Globally-Asynchronous Locally-synchronous Circuits, Ph.D. diss., Stanford University, U.S.A., Oct., 1984. [2] T.H.-Y. Meng, R.W. Brodersen and D.G. Messerschmitt, “Automatic Synthesis of Asynchronous Circuits from High-Level Specifications,” IEEE Tans. Computer-Aided Design, Vol.8, No.11, pp.1185-1205, 1989. [3] K.Y. Yun and R.P. Donohue, “Pausible clocking: a first step toward heterogeneous systems,” In Proc. of Int. Conf. Computer Design (ICCD), Texas, USA, Oct. 7-9, pp.118-123, 1996. [4] A.M.G. Peeters, Single-Rail Handshake Circuits, Ph.D. Diss., Eindhoven Univ. of Technology, Eindhoven, The Netherlands, June, 1996. [5] J. Muttersbach, T. Villiger, and W. Fichtner, “Practical Design of Globally-Asynchronous Locally-synchronous Systems,” In Proc. of Int. Symp. on Advance Research in Asynchronous Circuits and Systems (ASYNC), Eilat, Israel, pp.52-59, April 4-6, 2000. [6] D.S. Bormann and P.Y.K. Cheung, “Asynchronous Wrapper for Heterogeneous Systems,” In Proc. of Int. Conf. on Computer Design (ICCD), Texas, USA, pp.307-314, Oct. 12-15, 1997. [7] H. Zhang and J. Rabaey, “Low-Swing Interconnect Interface Circuits,” In Proc. of Int. Symp. on Low Power Electronics and Design, California, USA, pp.161-166, Aug. 10-12, 1998. [8] T. Njølstad, O. Tjore, K. Svarstad, et al., “Towards a Universal Socket Interface for Globally-Asynchronous Locally-synchronous Systems using Multiple Supply Voltages for Rate-Adaptive Energy Saving,” In Proc. of 14th IEEE Int. ASIC/SOC Conf., Washington D.C., USA, pp.110-116, Sept. 12-15, 2001.