A Low-Power Globally Synchronous Locally

A Low-Power Globally Synchronous Locally Asynchronous FFT Processor Yong Li, Zhiying Wang, Jian Ruan, and Kui Dai National University of Defense Technology, School of Computer, Changsha, Hunan 410073, P.R. China [email protected], {zywang,ruanjian,daikui}@nudt.edu.cn

Abstract. Low-power design became crucial with the widespread use of the embedded systems, where a small battery has to last for a long period. The embedded processors need to efficient in order to achieve real-time requirements with low power consumption for specific algorithms. Transport Triggered Architecture (TTA) offers a cost-effective trade-off between the size and performance of ASICs and the programmability of general-purpose processors. The main advantages of TTA are its simplicity and flexibility. In TTA processors, the special function units (SFUs) can be utilized to increase performance or reduce power dissipation. This paper presents a low-power globally synchronous locally asynchronous TTA processor using both asynchronous function units and synchronous function units. We solve the problem that use asynchronous circuits in TTA framework, which is a synchronous design environment. This processor is customized for a 1024-point FFT application. Compared to other reported implementations with reasonable performance. our design shows a significant improvement in energy-efficiency.

1 Introduction In recent years, special-purpose embedded systems have become one very important area of the processor market. Digital signal processor (DSP) offer flexibility and low development costs, but it has limited performance and typically high power dissipation. Field programmable gate arrays (FPGA) combine the flexibility and speed of application specific integrated circuit (ASIC), but it cannot compete with the energy efficiency of ASIC implementations. For a specific application, TTA can provide both flexibility and configurability during the Application Specific Instruction Processor (ASIP) design process. In TTA processor, special function units can be utilized to increase performance or reduce power dissipation. Fast Fourier Transform (FFT) is one of the most important tools in the field of digital signal processing. When operating in embedded environment, the devices are usually sensitive to power dissipation. So the energy-efficient FFT implementation is needed. There are many FFT implementations and low power architectures were described in paper [1], [2], [3], [4] and [5] et al. They all emphasized the importance of the low power consumption in embedded FFT applications. In CMOS circuits, power dissipation is proportional to the square of the supply voltage [6]. Reducing the supply voltage can achieve a good energy-efficiency, but this will result in circuit performance [7]. R. Perrott et al. (Eds.): HPCC 2007, LNCS 4782, pp. 168–179, 2007. © Springer-Verlag Berlin Heidelberg 2007

A Low-Power Globally Synchronous Locally Asynchronous FFT Processor

169

There is a problem for designers to solve that how to make systems have low power dissipation without performance loss. Since the early days, asynchronous circuits have been used in many interesting applications. There are many successful examples of asynchronous processors, which were described in [8], [9], [10], [11] and [12]. In these papers, asynchronous circuits have advantages of low power dissipation and high performance. The asynchronous circuits are very suitable for systems that are sensitive to power consumption and performance. In order to take advantage of flexibility and configurability of TTA and low power consumption of asynchronous circuits, we attempt to use asynchronous circuits in TTA. In this paper, we solve the problem that use asynchronous circuits in TTA that is synchronous design environment. This novel architecture may be viewed as a globally synchronous locally asynchronous (GSLA) implementation. Based on this architecture, a low power TTA processor is customized for a 1024-point FFT application. Asynchronous function units are utilized to obtain the low power dissipation. The performance and power dissipation are compared against the synchronous version. The results showed that GSLA implementation based on TTA can can offer a low power solution for some application specific embedded systems. This paper is organized as follows. Section 2 briefly describes the radix-4 FFT algorithm. Section 3 describes the Transport Trigger Architecture. Section 4 describes the implementation of the GSLA TTA processor. Next, the power and performance analysis results are presented. The last section gives the conclusion.

2 Radix-4 FFT Algorithm There are several FFT algorithms and, the most popular FFT algorithms are the CooleyTurkey algorithms [13]. It has been shown that the decimation-in-time (DIT) algorithms provide better signal-to-noise-ratio than decimation-in-frequency algorithms when finite word length is used. In this paper, a radix-4 DIT FFT approach has been used since it offers lower arithmetic complexity than radix-2 algorithms. The N -point DFT of a finite duration sequence x(n) is defined as X(k) =

N −1

x(n)WNnk

(1)

n=0

Where WN = e−j(2π/N ) , k = 1, 2, ..., N − 1, known as the twiddle factor. The direct implementation of the DFT have a complexity of O(N 2 ). Using the FFT, the complexity can be reduced to O(N log2 (N )). For example, one may formulate the radix-4 representation of the FFT in the following manner: Let, N = 4T , k = s + T t and n = l + 4m, where s, l ∈ {0, 1, 2, 3} and m, t ∈ {0, 1, ..., T − 1}. Applying these values in equation (1) and simplifying results in, 3 T −1 sl W4lt W4T x(l + 4m)WTsm (2) X(s + T t) = l=0

m=0

170

Y. Li et al.

For N = 1024 and T = 256, the 1024-point FFT can be expressed from equation (2) as 255 3 lt sl sm W4 W1024 x(l + 4m)W256 X(s + 256t) = (3) l=0

m=0

Equation (3) is obvious and suggests that the 1024-point DFT can be computed by first computing an 4-point DFT on the appropriate data slot, then multiplying them by 765 non-trivial complex twiddle factors and computing the 4-point DFT on the resultant data with appropriate data reordering. Because the architecture is determined by the characteristics of the application set, the first step of the architecture design is to analyze the application [14]. By the analysis of FFT application, we find that the major operations of FFT are add, multiply, et al. According to the type of the major operations, designer can quickly decide what function unit to implement; similarly, the amount of the function units is decided to the proportion of the equivalent operations. Furthermore, we expect to customize special function unit to complete these operations in order to achieve a good power-efficiency.

3 Transport Triggered Architecture As compared to conventional processor architectures, one of the main features of TTA [15] processor is that all the actions are actually side-effects of transporting data from one place to another. Thus, the processor has only one instruction move, which moves data. The programming is oriented on communications between FUs and, therefor, the interconnection network is visible to the software level, when developing TTA applications. A TTA processor consists of a set of function units and register files are connected to an interconnection network, which connects the input and output ports. The architecture is flexible and new function units (FUs), buses, and registers can be added without any restrictions. Naturally, the approach is well suited for application-specific processors. In addition, application specific support is provided by implementation of user-defined function units customized for a given application. The advantages of TTA processors are, also, short cycle time, and fast and application specific processor design. The MOVE software toolset [16] enables an exhaustive design-space exploration. The entire hardware-software co-design flow is presented on Fig. 1. We modify the MOVE tools in order to make it possible that the special asynchronous function units library can be also included in MOVE design flow. The applications can be described by high level languages (HLLs), such as the C/C++ programming language. In our case, hardware design starts with a C language description of the FFT algorithm. By using modified MOVE tools and the library of designed special asynchronous function units we are able to generate description of power efficient processors. After that, the processor description file can be converted into a hardware description language (HDL) representation of the processor core by using MOVEGen tool. Automatically generated HDL code for the processor core together with HDL predesigned components (instruction dan data memories and other peripherals) can be used by the ASIC synthesis tools. As mentioned, HDL code of our special asynchronous function units cannot be directly


171

Application (C/C++)

Library of FUs

MOVE Tools

Library of SFUs

Processor Description

MOVE Gen

Processor (HDL Code)

Logic Synthesis Library of SFUs Place & Route

Fig. 1. Design flow with specifical function units form high level language code to hardware implementation

used by the synchronous synthesis tools. They should be synthesized and implemented according to the asynchronous circuits design flow. As predesigned components, the special asynchronous units can be used by IC design platform in order to obtain layout of the target processor.

4 Implementation of FFT Processor 4.1 Special Function Units We propose a 32-bit wide ASIP architecture (buses and ports of FUs are 32 bit wide). The design of ASIP architecture is optimized by implementing several special function units. The design of the SFUs is started with the most frequent operation, i.e., ADD, MUL. Asynchronous adder unit (AADD) is one of the designed SFUs. Another SFU, the asynchronous multiplier unit for arithmetic operations with sub-word parallelism is also implemented (sub-word multiply operation between two 32-bit numbers represents two parallel multiply operations on packed 16-bit operands, for example). The implementation of these SFUs help to achieve high energy-efficiency. By implementing all of these SFUs we are able to significantly reduce the power dissipation, and to optimize the overall architecture design. Asynchronous Adder Unit. In this work, we implemented a 32-bit asynchronous adder. Pipelining is a standard way of decomposing an operation into concurrently

172

Y. Li et al.

Ain1

Lt2 Rout1

Latch Controller

Rin2 Delay

Aout1

Ain2

Lt3 Rout2

Latch Controller

Combinatory Logic

Rin3 Delay

Aout2

Ain3

Data Output

Lt4 Rin4

Rout3

Latch Controller

Latch

Lt1 Rin1

Combinatory Logic

Latch

Combinatory Logic

Latch

Latch

Data Input

Delay Aout3

Ain4

Rout4

Latch Controller

Aout4

Fig. 2. The pipeline architecture of asynchronous adder unit that has four pipeline stages

operating stages to increase throughput at a moderate increase in area. The adder employs four pipeline stages, and the architecture can be illustrated in Fig. 2. As shown in Fig. 2, the adder unit is composed of tradition combinatorial circuit and control circuit. The tradition combinatorial circuit is the same as synchronous circuit. In synchronous circuit, the communication of data between pipeline stages is regulated by the global clock. It is assumed that each stage takes no longer than the period of the clock and data is transferred between consecutive stages simultaneously. In asynchronous pipeline, the communication of data between the stages is regulated by local communication between stages, therefor, the global clock is replaced by a local communication protocol. The communication protocol employed in pipeline can be either a 2-phase or 4-phase signaling. The control circuit, called handshake circuit, realizes the communication protocol locally inside a pipeline between adjacent stage. When one stage has data which it would like to send to a neighboring stage, it sends a request (Rout) to that stage. If that stage can accept new data, it accepts the new data and returns an acknowledgment (Ain). The pipeline designed with request and acknowledgement signal that governs the transfer of data between stages. This pipeline circuit is similar to Sutherland’s Micropipeline [17], except that it uses latches and relies on a simple set of timing constraints for correct operation. Between signal Rout and Rin, the matching delay elements provides a constant delay timing constraint that matches the worst case latency of combinatory logic in each stage. Latch controllers can generate the clock, Lt1, Lt2, Lt3, Lt4, to control the operation of pipeline circuit. Based on our asynchronous design flow in [18], it is very simple to design an asynchronous function unit quickly. We employe ripple-carry adder, which performance is limited but this implementation can satisfy the requirement and save layout area. The combinatorial circuit can be described by VHDL/Verilog and synthesized by Synopsys Design Compiler. The control circuit can be described as signal transition graphs (STG) [19] and synthesized by Petrify [20]. Then the Verilog description that comes from Petrify also can be further synthesized by Design Compiler. The logic gates of control circuit that is synthesized by Petrify and implemented by C-element [21] can be illustrated in Fig. 3. The control circuit will increase extra cost in terms of area. In addition to the combinatorial circuit itself, the delay element represents a design challenge: to a first order the delay element will track delay variations that are due to the fabrication process spread as well as variations in temperature and supply voltage. On the other hand, wire delays can be significant and they are often beyond the designer’s control. So some design policy for matched delays is obviously needed.


173

Rin Lt C Ain C

Rout C C Aout

Fig. 3. The control circuit of one pipeline stage

In our procedure, the post-layout timing analysis has been done and custom delay cells is implemented to provide the constant delay. Asynchronous Multiplier Unit. Any multiplier can be divided into three stages: partial products generation stage, partial products addition stage, and the final addition stage. The second stage is the most important, as it is the most complicated and determines the speed of the overall multiplier. In high-speed designs, the Wallace tree construction method is usually used to add the partial products in a tree-like fashion in order to produce two rows of partial products that can be added in the last stage. According to the multiply stages of multiplier, our multiplier is divided into three stages execution pipeline. The multiplier supports arithmetic operations with sub-word parallelism. While being similar to the asynchronous adder unit, the multiplier employs three stages pipeline architecture and matching delay elements. 4.2 Address Generation In the radix-4 FFT operation, the address of any operand can be defined as follows [22]: A=4×

r−1

4i−1 × A(i) + A(0)

(4)

i=1

Let, B =4×

r−1 i=1

4

i−1

× A(i) +

r−1

A(i)

(mod 4)

(5)

i=1

For any address A(0 ≤ A < N ), the A and B are in one-to-one correspondence. r−1 i−1 If m = r−1 4 × A(i), b = A(i) (mod 4), so i=1 i=0 B = 4m + b

(6)

Based on the equation (6), the address A can be mapping to the memory. The memory can be divided into four memory banks. The b is the bank number and the m is the inner address.

174

Y. Li et al.

At the p stage of FFT, the m of the four operands that required by the calculation can be described as follows: ⎧ r−1 p−1 m = i=p+1 4i−1 × A(i) + 4p−1 × 0 + i=1 4i−1 × A(i) ⎪ ⎪ ⎨ 0 m1 = m0 + 4p−1 (7) ⎪ m = m0 + 2 × 4p−1 ⎪ ⎩ 2 m3 = m0 + 3 × 4p−1 Such an can be easily implemented with the aid of counter, which repreoperation r−2 p−1 sents the i=p 4i−1 × A(i + 1) + i=1 4i−1 × A(i). Then the m can be produced. 4.3 Globally Synchronous Locally Asynchronous Wrapper The GSLA methodology aims to combine the advantages of asynchronous design with the convenience of standard synchronous design methodologies. Globally Synchronous Locally Asynchronous Wrapper Interface

Locally Asynchronous Island

Data

O1load

Local Clock Rin

T1load Ain Global Clock

Fig. 4. Function unit with locally asynchronous island surrounded by the GSLA wrapper

Fig. 4 depicts a block level schematic of a GSLA module with its wrapper surrounding the locally asynchronous island. The wrapper contains an arbitrary number of ports, a local asynchronous unit, and port controller. Each function unit in TTA has one or more operator register (O), only one trigger register (T) and one or more result registers (R). In GSLA wrapper, the operator and triggered registers are both controlled by global clock. When the signal O1load is high, the operand will be send to operator register. When signal T 1load is high, the another operand will be send to trigger register and the function unit will be triggered to work. Locally asynchronous island is driven by local clock. The control circuit uses two or four phase handshaking protocol to control the data flow between adjacent pipeline stages. In our modified MOVE framework, the latency of specifical asynchronous function unit should be converted to cycles according to global clock period. For example, if the latency of our asynchronous multiplier unit is 14ns and the global clock period is 5ns, the cycles of this unit should be defined as 3. It means that asynchronous function unit


175

can complete one operation and other unit can read the result register after three clocks periods. This method can solve the problem that how to use asynchronous function units in synchronous TTA environment. In fact, the latency of asynchronous is only 14ns, which is less than three clock periods. The latency conversion may cause performance loss, but it a worthy trade-off between performance and power to some systems that are sensitive to power dissipation. 4.4 General Organization The processor is composed of nine separate function units and a total of eight register files (RF) containing 32 general-purpose registers. The function units include one asynchronous multiplier unit (AMUL), two asynchronous adder units (AADD), one comparator unit (COMP), one data address generator (AG), one I/O unit (I/O), one shift unit (SH) and two Load/Store units (LSU). These function units and register files are full connected by interconnection network consisting of 6 buses. The 32-bit buses are used to transport data. In addition, the processor contains instruction and data memories. The general organization of proposed GSLA TTA processor tailored for FFT (GSLAFFT) processor is presented in Fig. 5.

AMUL

RF

AADD

COMP

AADD

AG

SH

I/O

LSU

LSU

Fig. 5. Architecture of the proposed GSLAFFT processor core

The structural description of the FFT processor core was obtained with the aid of the hardware subsystem of the modified MOVE framework, which generated the Verilog description. The structures of the address generator unit, comparator unit, shift unit, I/O unit, and Load/Store unit were described manually in Verilog. The predesigned asynchronous function units library including Verilog description and layout is implemented according our asynchronous circuits design flow.

5 Simulation Results In this work, we also implemented synchronous multiplier unit and adder unit using the same data path as their asynchronous versions. The synchronous FFT processor core (SFFT) replaced the asynchronous function units of GSLAFFT with their synchronous versions. The SFFT and GSLAFFT were implemented in 0.18µm 1P6M CMOS standard cell ASIC technology. The layouts were both implemented in standard cell

176

Y. Li et al.

automatic place and route environment. The Mentor Graphics Calibre was used for the LPE (Layout Parasitic Extraction) and the Synopsys Nanosim was used for performance and power analysis. In performance and power analysis, the simulation supply voltage was 1.8V, the temperature was 25°C, and the device parameters used the typical values that comes from the foundry. The clock frequency of these processor was 200MHz. The obtained results are listed in Table 1. It should be noted that the power dissipation of instruction and data memories are not taken into account. Table 1. Characteristics of 1024-point FFT on SFFT and GSLAFFT Design

Clock Frequency [MHz] SFFT 200 GSLAFFT 200

Execution Time [µs] 26.09 26.09

Area

Power Energy

[mm2 ] [mW] [µJ] 0.8543 51.24 1.34 0.9001 43.41 1.13

Due to characteristics of fine-grain clock gating and zero standby power consumption, the total power of GSLAFFT is less than SFFT. It shows that the GSLA implementation based on TTA can offer a low power solution for some application specific embedded systems. Because of the area cost of the control circuits and independent power rings layout, the area of asynchronous function unit is lager than its synchronous version. The area cost is a disadvantage of the asynchronous circuits implementation without any area optimization. Designers should seek the trade-off between power consumption, performance and area. In different applications, optimization techniques for different targets should be used [23]. Taking into account of the improvement in performance and power dissipation, it is worth to pay attention to the design and application of asynchronous circuits. Table 2 presents how many 1024-point FFT transforms can be performed with energy of 1mJ. The results are presented for some different implementations of the 1024-point FFT. The 1024-point FFT with radix-4 algorithm can be computed in 6002 cycles in TI C6416 when using 32-bit complex words [24]. The Stratix is an FPGA solution with dedicated embedded FFT logic using Altera Megacore function [25]. The MIT FFT uses subthreshold circuit techniques [26]. The FFTTA is a low power application-specific processor for FFT [1]. Compared to other FFT implementation, the proposed GSLAFFT processor shows significant energy-efficiency. The MIT FFT outperforms the GSLAFFT. However, due to its long execution time, the MIT FFT is not suitable for high performance design. The FFTTA also shows a significant improvement in energy-efficiency. It should be noted that the instruction and data memories take 40% of the total power consumption of FFTTA. If the power consumption of instruction and data memories is not included, the FFTs per mJ of FFTTA should be 1076 and outperforms our GSLAFFT. However, the performance of our processor can be scaled, i.e., the execution time can be halved by doubling the resources. On the other hand, if we use advanced process technology, such as 130nm, the clock frequency will be increase and the supply voltage will be decrease, moreover, the power consumption will be reduced. So the FFTs per energy unit will be


177

Table 2. Statistics of some 1024-point FFTs Design

Technology Clock Frequency [nm] [MHz] GSLAFFT 180 200 130 720 TI C6416 130 600 130 300 130 275 Stratix 130 133 130 100 MITFFT 180 0.01 180 6 FFTTA 130 250

Supply Voltage [V] 1.8 1.2 1.2 1.2 1.3 1.3 1.3 0.35 0.9 1.5

Execution Time [µs] 26.09 8.34 10.0 21.7 4.7 9.7 12.9 250000 430.6 20.9

FFT/mJ

884 100 167 250 241 173 149 6452 1428 645

increased. Simply, if the clock frequency of GSLAFFT on 130nm technology can be increased to 250MHz, the FFTs per mJ will be increased to 1104 even that the power consumption is not reduced.

6 Conclusion Because TTA is very suitable for embedded systems for its flexibility and configurability. Supported by special low-power function units, the architecture is easy to be modified for different embedded applications that are sensitive to power. In this paper, a low-power application-specific globally synchronous locally asynchronous processor for FFT computation has been described. The resources of the processor have been tailored according to the need of the application. Asynchronous circuits have been used for reducing the power consumption of the processor. We implemented GSLA wrapper, which make it be possible to use asynchronous function units in TTA. The processor was implemented on 180nm 1P6M CMOS standard cell ASIC technology. The results showed that GSLA implementation based on TTA can can offer a low power solution for some application specific embedded systems. Although there is some area cost, it is worth to pay attention to the design and application of asynchronous circuits in embedded systems that are sensitive to power dissipation. The described processor has limited performance but the purpose of our experiment was to prove the feasibility and potential of the proposed approach. However, the performance can be improved by introducing additional function units and optimizing the code of the application.

Acknowledgment This work has been supported by the National Natural Science Foundation of China (90407022). Thank Andrea Cilio and his colleagues in Delft Technology University of Netherlands for their great help and support to our research work.

178

Y. Li et al.

References 1. Pitkanen, T., Makinen, R., Heikkinen, J., Partanen, T., Takala, J.: Low–power, high– performance tta processor for 1024–point fast fourier transform. In: Vassiliadis, S., Wong, S., Hämäläinen, T.D. (eds.) SAMOS 2006. LNCS, vol. 4017, pp. 227–236. Springer, Heidelberg (2006) 2. Wang, A., Chandrakasan, A.P.: Energy-aware architectures for a real-valued FFT implementation. In: Proceedings of the 2003 international symposium on low power electronics and design, pp. 360–365 (2003) 3. Lee, J.S., Sunwoo, M.H.: Design of new DSP instructions and their hardware architecture for high-speed FFT. VLSI Signal Process 33(3), 247–254 (2003) 4. Takala, J., Punkka, K.: Scalable FFT processors and pipelined butterfly units. VLSI Signal Process 43(2-3), 113–123 (2006) 5. Zhou, Y., Noras, J.M., Shepherd, S.J.: Novel design of multiplier-less FFT processors. Signal Process 87(6), 1402–1407 (2007) 6. Weste, N.H.E., Eshraghian, K.: Principles of CMOS VLSI design: a systems perspective. Addison-Wesley Longman Publishing Co., Inc, Boston, MA, USA (1985) 7. Chandrakasan, A., Sheng, S., Brodersen, R.: Low-power CMOS digital design. IEEE Journal of Solid-State Circuits 27(4), 473–484 (1992) 8. Werner, T., Akella, V.: Asynchronous processor survey. Computer 30(11), 67–76 (1997) 9. Furber, S.B., Garside, J.D., Temple, S., Liu, J., Day, P., Paver, N.C.: AMULET2e: An asynchronous embedded controller. In: Proceedings of the International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 290–299 (1997) 10. Garside, J.D., Bainbridge, W.J., Bardsley, A., Clark, D.M., Edwards, D.A., Furber, S.B., Lloyd, D.W., Mohammadi, S., Pepper, J.S., Temple, S., Woods, J.V., Liu, J., Petlin, O.: AMULET3i - an asynchronous System-on-Chip. In: Proceedings of the 6th International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 162–175 (2000) 11. Furber, S.B., Edwards, D.A., Garside, J.D.: AMULET3: a 100 MIPS asynchronous embedded processor. In: Proceedings of the 2000 IEEE International Conference on Computer Design, pp. 329–334. IEEE Computer Society Press, Los Alamitos (2000) 12. Kawokgy, M., Andre, C., Salama, T.: Low-power asynchronous viterbi decoder for wireless applications. In: Proceedings of the 2004 international symposium on Low power electronics and design, pp. 286–289 (2004) 13. Cooley, J., Turkey, J.: An algotithm for the machine calculation of complex fourier series. Math Computer 19, 297–301 (1965) 14. Jain, M.K., Balakrishnan, M., Kumar, A.: ASIP design methodologies: Survey and issues. In: Proceedings of the The 14th International Conference on VLSI Design (VLSID ’01), pp. 76–81 (2001) 15. Corporaal, H.: Microprocessor Architecture: from VLIW to TTA. John Wiley & Sons Ltd. Chichester (1998) 16. Corporaal, H., Arnold, M.: Using Transport Triggered Architectures for embedded processor design. Integrated Computer-Aided Engineering 5(1), 19–37 (1998) 17. Sutherland, I.E.: Micropipelines. Communications of the ACM 32(6), 720–738 (1998) 18. Gong, R., Wang, L., Li, Y., Dai, K., Wang, Z.Y.: A de-synchronous circuit design flow using hybrid cell library. In: 8th International Conference on Solid-State and Integrated Circuit Technology, pp. 1860–1863 (2006) 19. Piguet, C., Zahnd, J.: STG-based synthesis of speed-independent CMOS cells. In: Workshop on Exploitation of STG-Based Design Technology (1998)


179

20. Cortadella, J., Kishinevsky, M., Kondratyev, A., Lavagno, L., Yakovlev, A.: Petrify: a tool for manipulating concurrent specifications and synthesis of asynchronous controllers. IEICE Transactions on Information and Systems E80-D(3), 315–325 (1997) 21. Shams, M., Ebergen, J., Elmasry, M.: A comparison of CMOS implementations of an asynchronous circuitsprimitive : the C-element. In: International Symposium on Low Power Electronics and Design, vol. 12(14), pp. 93–96 (1996) 22. Xie, Y., Fu, B.: Design and implementation of high throughput FFT processor. Computer Research and Development 41(6), 1022–1029 (2004) 23. Zhou, Y., Sokolov, D., Yakovlev, A.: Cost-aware synthesis of asynchronous circuits based on partial acknowledgement. In: Proceedings of the 2006 IEEE/ACM international conference on Computer-aided design, pp. 158–163. ACM Press, New York (2006) 24. Texas Instruments: TMS320C64x DSP Library Programmer Reference (2002) 25. Lim, S., Crosland, A.: Implementing FFT in an FPGA co-processor. In: The International Embedded Solutions Event (GSPx), pp. 27–30 (2004) 26. Wang, A., Chandrakasan, A.: A 180-mV subthreshold FFT processor using a minimum energy design methodology. IEEE Journal of Solid-State Circuits 40, 310–319 (2005)