Automated Design Space Exploration for DSP Applications

5 downloads 12357 Views 726KB Size Report
generates hardware architectures for DSP applications while simultaneously minimizing area, delay and power for a given set of design constraints. In this paper ...
1

Automated Design Space Exploration for DSP Applications Ramsey Hourani, Ravi Jenkal, W. Rhett Davis, Winser Alexander Department of Electrical and Computer Engineering North Carolina State University Raleigh, NC 27695, USA (919) 513-2839 {rhouran, rsjenkal, rhett davis, winser}@ncsu.edu

Abstract— We present a performance analysis framework that efficiently generates and analyzes hardware architectures for computationally intensive signal processing applications. Our framework synthesizes designs from a high level of abstraction into well-constructed and recognizable hardware structures that perform well in terms of area, throughput and power dissipation. Cost functions provided by our framework allow the user to reduce the design space to a set of efficient hardware architectures that meet performance constraints. We utilize our framework to estimate hardware performance using a set of pre-synthesized mathematical cores which expedites the synthesis process by approximately 14 fold. This reduces the architectural generation and hardware synthesis process from days to several hours for complex designs. Our work aims at performing hardware optimizations at the architectural and arithmetic levels, relieving the user from manually describing the designs at the RTL and iteratively varying the hardware architectures. We illustrate the efficiency and accuracy of our framework by generating finite impulse response (FIR) filter structures used in several signal processing applications such as adaptive equalizers and quadrature mirror filters. The results show that hardware filter structures generated by our framework can achieve, on average, a 3 fold increase in power efficiency when compared to manually constructed designs.

I. I NTRODUCTION The accelerated advances in very large scale integration (VLSI) technology have contributed to reduced integrated circuit (IC) area and increased clock frequency. Designers are focused on the difficult task of optimizing the overall area and throughput for signal processing hardware implementations. The requirement to minimize power dissipation in these designs adds to the complexity of developing hardware architectures that perform well. Current electronic design automation (EDA) tools assist designers in modeling and characterizing hardware architectures that are described using various levels of design abstraction other than the register transfer level (RTL) [1] [2]. This permits designers to apply design optimizations and explore the behavior of alternative hardware architectures. However, the task of manually varying the architectural representations often degrades design productivity. DSP algorithm developers, as well as hardware designers, concentrate their efforts on establishing methodologies that efficiently refine a DSP algorithm to a synthesizable hardware design. Architectural diversities expand the design space for

a signal processing algorithm to tens of realizable hardware architectures. Varying technology size and supply voltages further extends the design space to hundreds of possible design permutations, each with a unique set of area, throughput and power metrics. This leads to performance trade-offs in the design space with no accepted way of quantifying performance efficiency using a single cost function that combines area, throughput and power. Designers are therefore forced to exhaustively search the design space looking for hardware designs that perform well and meet final design specifications, a process which also prolongs design time and degrades productivity. Various methodologies have emerged to synthesize designs described using high-level languages such as C ++ and MATLAB [2] [3]. There are a number of optimization techniques which are not included in these tools such as cut-set retiming to efficiently pipeline a structured signal processing architecture. Since synthesis tools traditionally preserve the behavior of the hardware architecture specified by the designer, they are unable to suggest a more efficient architecture, with a different behavior, that performs better. Therefore, most of the design effort is spent coding the various behaviors of an algorithm and re-synthesizing to assess the performance. The large gap between the DSP algorithm and the efficient synthesizable hardware design further hinders productivity. The goal is therefore to develop a framework that efficiently generates hardware architectures for DSP applications while simultaneously minimizing area, delay and power for a given set of design constraints. In this paper, we evaluate the effects of architectural optimizations on data-intensive hardware designs using our performance analysis framework. The goal of this work is to efficiently generate recognizable, yet efficient, hardware structures for common DSP functions. Currently, our framework generates hardware architectures for digital filters, and algorithms that utilize filters such as adaptive equalizers and quadrature mirror filters. We provide a MATLAB graphical user interface (GUI) that automatically generates various architectures for DSP functions and guides the designer in selecting hardware structures that perform well for given constraints. Our framework reduces the design space to a smaller set of hardware architectures that meet design specifications, making the architectural selection process more efficient. We introduce

2

a methodology for synthesizing large hardware architectures in seconds versus tens of minutes per design, which expedites the process of accurately evaluating area, throughput and power dissipation. This allows the user to generate and explore the design space in hours versus weeks or months if these tasks were performed manually. We illustrate the efficiency of our design exploration methodology by using our framework to generate and analyze finite impulse response (FIR) filters used in DSP applications with specific design constraints. This paper is organized as follows. Section II discusses relevant tools and methodologies developed over the years for modeling DSP algorithms in hardware. Section III presents the hardware performance metrics and figures of merit we use in our work to assess the quality of hardware designs. Section IV introduces our performance analysis framework for generating hardware architectures and evaluating design performances. Section V highlights the details of our work using the FIR filter design space as a fundamental DSP algorithm. Section VI illustrates how a designer can utilize our framework to design and evaluate high-performance architectures for DSP applications requiring FIR filters, and section VII concludes the paper. II. R ELATED W ORK Several tools and methodologies exist that facilitate DSP designers in refining their algorithms to realizable hardware architectures. In cases where an exhaustive search of the design space is employed, the hardware structures generated by such tools, while functional, may be unrecognizable to a hardware designer. This tends to deter the hardware designer from further refining or altering the architectures generated by the automated tools. Synthesis tools are adapted to refining a behavioral DSP algorithm to a low-level gate netlist. However, the synthesis times involved in efficiently laying out the standard-cell gates contribute to prolonged design times. Additionally, tools that explore different architectural layouts for hardware designs often require the user to manually measure the performance of the hardware design, an iterative and complicated process which also degrades design productivity. The following is a brief survey of tools, some of which have made it into main stream commercial use, that indicate several concerns in improving DSP algorithm refinement. HYPER is an automated high-level synthesis system that applies optimizations at the architectural level to minimize power dissipation while maintaining acceptable throughput and area measurements [1] [4]. The user provides the initial signal processing architecture, which is represented using signal flow graphs. At that point, the tool explores the design space using flow-graph transformations that alter the structure of the initial design while preserving the input/output behavior. We adopt a similar design space exploration methodology of expediting the hardware generation process. This method proved useful in improving productivity that would otherwise require the designer to manually optimize the hardware designs and analyze the performance. However, the tool is constrained to optimizing the throughput of designs at the architectural level and may not provide alternate design permutations at the

arithmetic, logic, or circuit levels. Additionally, the hardware structures generated by the tools may deviate considerably from the initial structure, making it difficult for the designer to recognize the final hardware design. Providing a set of parametrized IP cores for basic DSP functions is an alternative scheme for reducing the overall design time and increases productivity. Synthesis tool makers, such as Synopsys [5], support parameterized DesignWare IP cores for common DSP applications such as FIR and Infinite Impulse Response (IIR) filters. The IP cores include parameters for filter order and signal bit-widths, and can therefore be used for general-purpose applications without having to describe the IP cores at lower levels of detail. However, the Synopsys IP cores lack optimization parameters that improve the performance of the design for certain classes of applications. For example, the DesignWare FIR filters are implemented using a basic transpose form structure. While this is a high-throughput structure, the hardware designer cannot vary the filter at the architectural level to include additional optimizations such as pipelining or parallelism. This would require the designer to manually construct the basic filter structure, such as that required by the HYPER tools, and then proceed to iteratively modify the structure to implement design optimizations. Additionally, the structure of the Synopsys DesignWare FIR filter limits its use in DSP systems with varying parameters such as adaptive equalizers. For such applications, designers would need to either manually construct the application-specific hardware structure, or resort to purchasing the specific DSP block from a hardware IP vendor. As with any IP block, the designer is constrained to the structure of the IP design and has to contend with the performance of the hardware block since altering the structure is virtually impossible. Our work addresses the need to construct application-specific hardware structures that perform well by including additional design parameters that control the behavior of the hardware blocks. Optimization tools that work in combination with synthesis tools, such as Synopsys’ Module Compiler, are used to invoke logic-level timing variations to improve the critical path delay in datapath-intensive designs [6]. Module Compiler builds the design into optimized gates using a hardware description of the DSP architecture. Module Compiler is also used to simultaneously generate area, delay, power, critical path and latency reports which the designer uses to evaluate the design. Any architectural improvements would require the designer to manually re-describe the architecture and proceed to evaluate the performance of the design once more. While this interactive design tool promptly returns synthesis results to the designer, experienced hardware designers are still required to manually describe the overall architecture at the RTL and sift through the design space searching for the optimal design. The hardware designer remains detached from the DSP algorithm and is unaware of how architectural variations may affect the functionality of the DSP algorithm. MATLAB has been used in previous works as the gateway for generating and analyzing DSP hardware designs. Researchers at Northwestern University developed tools for synthesizing hardware code described using MATLAB [2] [7]. Their work founded the Xilinx tool “AccelDSP Synthesis

3

Tool” [8] (formally “AccelChip”). Such tools were catered for developing and integrating verification flows for automatically mapping MATLAB algorithms onto field programmable gate arrays (FPGA) and application specific integrated circuits (ASIC). Additionally, the synthesis tools were able to explore hardware designs using metrics such as area, delay and quantization error, which aimed at improving hardware utilization for the target FPGAs. Similar to the Synopsys synthesis tools, AccelDSP includes a library of IP blocks suitable for DSP applications. The hardware architectures for the IP blocks limited the user from applying additional low-level optimization techniques catered towards application-specific designs. Any modifications to the hardware structures required hardware designers to manually re-describe the design using MATLAB as a hardware description language in order for their synthesis compiler to work properly. Using MATLAB as a low-level hardware description language may not be obvious to most hardware designers. Additionally, the hardware designs generated by the compilers, while functional, were difficult to recognize. This made it difficult for the designers to modify or further refine the hardware architectures. Our framework bridges the gap between the high-level DSP algorithm and the low-level hardware architecture by automating much of the algorithm refinement process. While other tools provide one or two IP blocks for general purpose DSP applications, we provide a system that generates application specific, and therefore efficient, hardware blocks. Hardware optimization techniques included in our design environment ensure increased efficiency in the performance of the final DSP architecture. Our framework includes multiple performance metrics and user-defined cost functions, discussed in the next section, that guide the designer in selecting hardware architectures that performs well. The hardware architectures generated by our framework stay within the set of well known DSP structures that makes it easier for the hardware designer to recognize them, allowing him or her to further refine the hardware architecture. III. D ESIGN S PACE E XPLORATION In this section, we briefly present examples of performance metrics and figures of merit we use to assess the quality of hardware architectures. Designers are accustomed to using cost functions for simplifying the process of searching the design space for hardware structures that perform well [9]. The quality of a design can be measured by its area, throughput and power metrics, all of which depend on technology size, architectural variations and supply voltage. This information is used to guide the designer through the three-dimensional design space, struggling to strike a balance between all three performance metrics. For example, throughput is a vital design constraint in ASICs, whereas hardware resource utilization is the driving force behind minimizing design area for FPGAs. Minimizing power dissipation for a given throughput constraint is imperative for heat-sensitive chip realizations and battery-operated devices. In the case of digital circuits, trying to optimize one performance metric often results in a degradation of the others.

Our work addresses the needs of both DSP algorithm developers and hardware designers by providing a design environment that allows both algorithmic and architectural exploration at different levels of design abstraction. Our framework interactively assesses the quality of the DSP algorithm by measuring pertinent performance metrics such as design latency and computational precision. Additionally, our framework refines each DSP algorithm to several realizable hardware architectures, while conforming to the correct functionality and constraints of the algorithm. Hardware performance metrics provided by our framework allow the designer to analyze the quality of the hardware design in terms of area, throughput and power dissipation. A. Levels of Design Abstraction Gemmeke divided the design space for DSP algorithms into several levels of abstraction which facilitate optimizing hardware designs [10]. We adopt similar levels of design abstraction in describing our methods for improving the performance of the hardware architectures considered in our work. The algorithmic level can be as simple as a few lines of code that equate the output to a function of the input. Any variations in the construction of the DSP algorithm can impact the final hardware representation. The arithmetic level provides a means of exploration with varying roundoff methods and overflow modes, which are vital when considering fixed point implementations. The next level of abstraction is the architectural level, where signal flow graphs (SFG) are used to show the transition of signals through a set of registers and mathematical blocks, an example of which is shown in Section V. The flexibility in design variations is reduced at the lower levels of abstraction. For example, at the logic level, hardware designers can chose from a limited set of mathematical operations for implementing the DSP algorithm. The overall performance for the DSP design is effected by the hardware architecture of the arithmetic blocks [11]. The lowest level of abstraction is the circuit level which determines the depth of the combinational logic between sequential circuits. All of area, delay and power dissipation are affected by variations in the algorithm and hardware architecture at all levels of abstraction. B. Hardware Performance Metrics VLSI designs targeted for ASICs can be uniquely characterized by their area, delay and power metrics which in turn depend on several design factors [12] [13]. The total area, A, for a DSP hardware design depends on the number of basic computational elements used to implement the signal processing algorithm, and the technology size λ for which the hardware designs are synthesized. The critical path delay Td depends on both λ and supply voltage VDD used for circuit operation. The operating frequency fop is governed by the largest critical path delay Td,max where fop = 1/Td,max. Pipelining reduces the critical path delay by placing registers in the path of heavily clustered combinational circuits [13] [14]. This process, however, may increase the latency L; the number of clock cycles required to generate the first output

4

of the circuit. Throughput R is a function of both latency and operating frequency and depends on the signal processing algorithm under investigation. In this paper, we define throughput as the number of output samples generated by the DSP hardware architecture in one second (samp/sec). Dynamic power Pdynamic is the main source of power dissipation in CMOS circuits for feature sizes greater than 65nm, and can be computed as 2 Pdynamic = αCswitch VDD fop

(1)

where α is the switching activity factor and Cswitch is the switching capacitance [15]. In most cases, hardware designers are burdened with the iterative task of varying design parameters in at attempt to generate efficient hardware architectures, which complicates the design process.

voltage. It is known that the delay of a circuit is proportional to [13] [19] VDD (2) Td ∝ (VDD − Vth )2 where Vth is the threshold voltage of the CMOS circuits and is assumed constant for a given technology node. Therefore, increasing VDD decreases the critical path delay by a factor of 1/VDD , but quadratically increases the dynamic power dissipation as Equation 1 suggests. Figure 2 illustrates how VDD scaling can be combined with pipelining techniques to achieve desirable throughput and power dissipation measurements. In most cases, designers need to manually explore the best combination of pipelining and VDD scaling to generate high-throughput, low-power designs. This process contributes to the degradation in overall design productivity.

C. Figures of Merit Designers have come to utilize two-dimensional cost functions in an effort to balance area, throughput and power. Power density, P/A, which measures power dissipation per unit squared of area [16], is a two-dimensional cost function used to measure heat dissipation and battery life in portable and implantable devices. Power efficiency, P T , could be used to select a design with equally important throughput and power constraints. Similarly, area efficiency, AT , balances between design area and throughput. Each two-dimensional cost function is crucial for determining the best architecture that meets constrained design specifications. Figure 1 illustrates the trends in trade-off curves between pairs of performance metrics for a hardware design space [17] [18].

Fig. 2: Power-Delay trade-off curve.

IV. P ERFORMANCE A NALYSIS F RAMEWORK

Fig. 1: Trade-off curves in hardware designs. A closer look at the power-delay curve reveals an interesting and important design trend in digital circuits. Two common techniques exist for improving both throughput and power dissipation for digital circuits [13]. Pipelining may be used to reduce the critical path delay within a design. While this process improves design throughput, other performance metrics such as area, dynamic power and latency may suffer as a consequence of pipelining, which are undesirable side effects. An alternative method to increasing the throughput without affecting area or latency is increasing the supply

We develop a performance analysis framework that is catered to both the high-level DSP algorithm developer as well as the expert hardware designer. Our framework improves overall design productivity by generating recognizable hardware structures that perform well for specific design constraints defined by the user. Additionally, our work aims at significantly reducing synthesis times, allowing the designer to explore the hardware design space in hours versus weeks. This allows the designer to investigate alternate hardware architectures in a relatively short time, should the designers change or enhance the system specifications. We provide the designer several user-defined cost functions which can be used to evaluate trade-offs between area, throughput, power and latency. The final output of our framework is a reduced set of synthesizable hardware architectures with improved performance metrics compared to general purpose, “commercial off the shelf” hardware designs. The structure of the designs generated by our framework allows proficient hardware designers to apply additional optimizations at multiple levels of design abstraction; a key advantage for further improving the design performance. The user remains an integral part of making architectural decisions without swamping him or her with hundreds of design options.

5

A. EDA Tools We combine well-known design tools and scripting languages accepted by both DSP algorithm developers and hardware designers as the engines behind our work. Our purpose in selecting common EDA tools is to facilitate the use of our framework, as well as allow designers to expand the set of hardware designs within the framework’s library. We chose MATLAB as the interface between the user and the hardware designs generated by our framework since algorithm developers are more comfortable using MATLAB to efficiently and accurately model their signal processing algorithms. The data generated by the MATLAB algorithmic models are used as reference data for validating the functionality of the lowerlevel designs. Additionally, MATLAB is a good choice for tools for numerically and graphically analyzing the design space in search of hardware architectures that perform well. This is accomplished through the use of cost functions that assess the quality of a hardware design using one or more performance metrics. Our framework provides the designers the freedom to define the cost functions they need to assist them in searching the design space for hardware structures that meet design specifications. SystemC is an additional design language we utilize in our work that extends the capabilities of C/C ++ by providing a set of C ++ class libraries capable of modeling designs at different levels of abstraction: from the algorithmic to the logic level [20] [21] [22]. The main advantage of using SystemC in our work is modeling the DSP algorithms at the RTL, while using high-level data-types and communication interfaces that expedite the simulation process of the hardware design [23]. Our choice for using Verilog to model the hardware designs at the synthesizable RTL allows designers to use our framework as the front end tool to their synthesis CAD tools. We chose to use Synopsys Design Compiler, a common CAD tool used both academically and commercially, for our synthesis environment. B. Hardware Design and Synthesis Process The designer initially selects the type of DSP function within our framework to be modeled in hardware. Examples of DSP functions currently included in our framework are FIR, IIR, quadrature mirror and adaptive filters. The user then specifies high-level parameters, such as filter order and word sizes. Our framework proceeds to generate each hardware design starting with common and well recognized structures. We employ a bottom-up, modular design methodology for each design where scripts in our framework construct the basic computational blocks using basic mathematical cores. Design parameters within the scripts guide our framework in connecting the computational blocks using computational cells to implement the hardware architecture for the DSP function [24]. Figure 3 illustrates the process for constructing a single hardware architecture for a DSP algorithm using basic arithmetic hardware blocks and computational cells. The variations in hardware performance for the DSP functions depend primarily on the construction and placement of the computational cells [14].

Fig. 3: Bottom-up modular design using computational cells. Figure 4 illustrates the underlying process for our work in generating and synthesizing efficient hardware designs for filters, a fundamental DSP function. Our framework starts with recognizable hardware structures for the filter, such as the direct and transpose forms, and proceeds to generate architectural variations of the basic structures k = 1, ..., K where K is the total number of filter architectures in the design space. The details of the filter architectures generated by our framework are discussed in the next section. Each filter is modeled in hardware using SystemC and Verilog RTL code. Scripts within our framework generate a C ++ testbench for each SystemC filter, where a set of simulations are executed in order to verify the structural precision of the filter designs. Our previous experiments showed that simulations performed at the arithmetic level using the SystemC models were approximately five times faster than simulations performed using the Verilog models for designs generated by our framework [23]. We utilize scripts and compilers that translate each SystemC filter structure to its Verilog RTL equivalence [25]. Each filter structure is generated using basic arithmetic cores such as multipliers, adders and delay units. These blocks are used for functional verification of the filter architectures and, therefore, are not synthesized. Scripts within our framework replace the basic arithmetic cores with pre-synthesized and optimized Synopsys DesignWare equivalent arithmetic cores, relieving the user from having to manually optimize each arithmetic block at the gate level. We utilize a limited set of arithmetic blocks for the signal processing algorithms considered in this paper. We use optimized Booth-recoded Wallace-tree multipliers and carry-look-ahead adders available in the Synopsys IP library. The same methodology can be used in cases where the user selects alternative architectures for the mathematical blocks. The estimated performance results for the hardware designs obtained using the pre-synthesized cores are comparable to synthesizing the entire design [26]. Using the pre-synthesized cores significantly reduces synthesis times from tens of minutes to seconds for each design, depending

6

"

%

y=

b( k ) x (n − k )

2

! k

( ! $ !

(

%

)

"

( -

! .

#

#01.

%#. "

! ,

) % ! & ! " # " * + , ,

!

*"#,/#+ * #+ * , #01+

"

%

% "

0 !

!

" # $ ! % ! & " #

! '

%'

(

$

$

(

*

,

+

*

,

% +

% * +

$ *

,

+

%

!

Fig. 4: DSP algorithm refinement and architectural selection process. on the designs complexity. The synthesis scripts promptly return area, timing and power reports back to the user who then applies cost functions for searching the design space for efficient hardware architectures. C. Performance Values and Cost Functions Our framework characterizes each filter structure for different pipelining techniques and VDD values using similar scripts and techniques described in [27] [28]. This provides the user with alternative design options for improving both throughput and power dissipation. Our framework repeats the generation and synthesis process for each filter structure in the design space. The performance values for each design are relayed back to MATLAB, and temporarily saved for detailed architectural analysis using cost functions. Our framework provides the designers the freedom to select the type of cost function they wish to use in assessing the quality of the hardware designs. For the sake of simplicity, we will focus on one and two-parameter cost functions commonly used to measure hardware performance. Examples of two-parameter cost functions are area efficiency (AT ), power efficiency (P T ), and power density (P/A) previously discussed in Section III. Our framework returns several closely matched designs with similar performance values rather than returning a single architecture that meets the design constraints. This allows the user to make final architectural selections, or possibly relax the performance constraints which may provide a better design option. The end product for our flow is a reduced set of designs with specific details of hardware architecture, supply voltage required to operate the design and the performance and efficiency values. V. FIR F ILTER D ESIGN S PACE We consider the FIR Filter design space to illustrate the variations in hardware designs employed by our framework.

Digital FIR filters are one of the most basic, yet important, functions in signal processing systems. Applications for FIR filters include removing unwanted parts of a signal, such as random noise, or extracting useful parts of a signal, such as the components lying within a certain frequency range. The basic computational components found in FIR filter designs, such as adders and multipliers, are common examples of the type of mathematical blocks found in other DSP hardware designs. Therefore, the type of design optimizations applied to FIR filters can be extended to other DSP algorithms of similar computational intensity. The convolution sum for an N th order FIR filter is given as N X h(k)x(n − k), n≥0 (3) y(n) = k=0

where h(k) is the impulse response of the filter, x(n) is the input sequence, and y(n) is the filter output. The direct form structure, shown in Figure 5(a), is a direct implementation of the convolution sum, whereas, the transpose form structure, shown in Figure 5(b), is a high-throughput implementation with equivalent algorithmic functionality. The choice of architecture for implementing an FIR filter depends on the required hardware complexity, desired throughput and constrained power dissipation. Other variations of the direct and transpose form structures exist (Figures 5(c) - 5(e)), which provide trade-offs between area, throughput, latency and power. The right choice for hardware implementation of the FIR filter depends heavily on the size and number of multipliers and adders, and on the layout of the computational blocks. The delay through the N -tap direct form structure is Tmult + N × Tadd + Treg where Tmult , Tadd , and Treg are the delays through the multiplier, adder and delay units respectively. Whereas, the delay through the transpose form structure is Tmult + Tadd + Treg . While it seems that the transpose form is a more desirable structure than the direct form in terms of throughput, there are several disadvantages to the former. The transpose form requires a larger area due to the doublesized registers that store the outputs of the adders versus storing the smaller word sizes for the input in the direct form structure. Additionally, the large fan-in at the input x(n) for the transpose form structure requires a larger supply voltage to drive the input signal to all the multipliers, which increases the power dissipation. A. Throughput and Power Optimization using Pipelining and VDD scaling The pitfalls of the large delay path for the direct form structure and the large fan-in of the transpose form can be overcome using pipelining and VDD scaling [13]. Equation 2 suggests that the delay increases as VDD is reduced. Pipelining can effectively reduce the delay back to the desired value as VDD is reduced. We observe from Equation 1 that reducing the supply voltage quadratically reduces power dissipation for a fixed throughput, which is highly desirable. However, manually pipelining the filter architectures along with VDD scaling

7

(b) Transpose Form (TF)

(a) Direct Form (DF)

(c) Direct Form II (DFII)

(d) Transpose Form II (TFII) (e) Direct Form tree adder (TreeDF)

Fig. 5: Signal Flow Graphs for FIR filter contribute to the complexities of searching the design space for optimal filter structures. Therefore, we use our framework to investigate the effects of combined pipelining and VDD scaling on the hardware performance for the filter structures. Figure 6 shows an example of the different types of pipelining we use to improve both throughput and power dissipation for the FIR filter architectures. We use optimized Synopsys DesignWare adders, multipliers and pipelined multipliers to construct basic signal processing computational cells for the FIR filters.

B. Expanding the FIR Filter Design Space Beyond the basic cascaded FIR filter structures shown in Figure 5, there exists alternative designs for specific DSP applications. Linear-phase FIR filters exhibit symmetry properties in the filter coefficients which permit designers to use a folded architecture for a reduced hardware implementation. The folded design shares the multipliers between pairs of input signals which effectively reduces the hardware complexity by approximately 50% [29]. This is observed by the following difference equation for a 9-tap, linear-phase FIR filter. y(n) = b0 x(n)+b1 x(n–1)+b2 x(n–2)+b3 x(n–3)+b4 x(n–4)+ b3 x(n–5)+b2 x(n–6)+b1 x(n–7)+b0 x(n–8)

(4)

y(n) = b0 [x(n)+x(n–8)] +b1 [x(n–1)+x(n–7)] + b2 [x(n–2)+x(n–6)] +b3 [x(n–3)+x(n–5)] + b4 x(n–4)

Fig. 6: Different types of pipelined computational cells for FIR filter architectures.

(5)

Linear-phase FIR filters using a folded architecture are just a subset of filters and are typically used in specific DSP applications. The FIR filter structures illustrated in Figure 5 can be folded for implementing linear-phase filters. Pipelining techniques to improve throughput and power dissipation can also be applied to linear-phase filters. However, care must be taken since the SFG for folded filter structures are more complex than the regular cascaded designs. Design options within our framework permit the designer to select either even or odd-tap linear-phase and pipelined FIR filters, depending on the filter order. Polyphase structures are effective design options for multirate signal processing where decimation and interpolation filters are required, such as quadrature mirror filter banks [30]. A decimation filter can easily be implemented using a regular filter followed by a downsampler, as shown in the SFG of Figure 7(a). However, this design degrades the hardware utilization, since only 1/M of the filtered signal is retained, where M is the decimation factor. The polyphase structure

8

increases the hardware utilization by using sub-filters that filter the signal in parallel, as shown in Figure 7(b). Both filter structures in Figure 7 are of comparable complexity; the advantage of using the polyphase structure is improved hardware utilization, which reduces power dissipation for a given throughput constraint. Each sub-filter for the polyphase structure can be implemented using any of the filter structures in Figure 5. Our framework generates different architectures for decimation filters using the polyphase structure and applies different types of pipelining to further improve the performance.

of 128 design permutations for delay and power. Figure 8 shows the power-delay curve for the pipelined variations of the basic filter structures presented in Figure 5. We indicate the average power-delay curve for the entire design space with a dotted line. The results illustrate that the critical path delay for the filter structure decreased with increased values of VDD in keeping with theoretical assumptions. However, this resulted in a quadratic increase in power dissipation. This curve matches the power-delay trend of Figure 2. We used a semi-log plot to increase the visibility of the power-delay curves. Designs to the left of the dotted line were pipelined filter structures operating at lower supply voltage. These designs maintained a constant throughput, but at reduced power dissipation due to a balance between the pipeline type and VDD scaling. These initial results provide us with a sense of validation for the designs and measurements generated by our framework.

(a) Decimation filter

Fig. 8: Power-Delay curves for 32-tap FIR filters.

(b) Efficient decimation filter

Fig. 7: SFG for decimation filters used in DWT systems: decimation by M

C. Framework Evaluation We highlight some of the features of our framework using the FIR filter design example. We setup an experiment, with the following design entries, to analyze the effects of pipelining and VDD scaling on all of area, throughput and power dissipation. 1) DesignWare mathematical IP blocks implemented using the 0.18µm standard cell library from OSU [31]. 2) 31st order FIR filter with 16-bits for the input and coefficients 3) 1000 randomly generated samples for input sequence to measure power dissipation 4) Area, delay and power measured for 8 values of VDD : (1.0, 1.2, 1.5, 1.8, 2.0, 2.1, 2.5 and 2.7 V) Our framework generated 16 different filter structures (listed in Table I), where each filter structure was characterized for 8 different values of supply voltage. This resulted in a total

Our framework expedited the synthesis process by using pre-synthesized mathematical blocks which reduced the requirement for Synopsys Design Compiler to generate an optimized netlist for the entire design. Synthesizing the entire design without using the pre-synthesized blocks required Synopsys’ optimization tools to iteratively analyze the optimal layout for the final hardware design. Figure 9 compares the performance for the hardware designs using the pre-synthesized blocks to the performance of completely synthesizing the designs. These performance values were collected for a subset of filters, using a 32-tap, 16-bit FIR filter, synthesized using the 0.18µm standard cell library, and characterized using a 1.8 V supply voltage. The results show that the estimated area, delay and power for the designs measured using the pre-synthesized blocks are slightly greater than the fully synthesized designs. The discrepancies are within 12% for the area and delay, and within 4% for the power. Similar trends were observed for other filters in the design space. Related results for a smaller set of filter structures were observed for area and delay measurements in our previous work [26]. The estimated performance metrics collected using the pre-synthesized blocks provide the designer accurate design trade-offs in a fraction of the time required for completely synthesizing the designs. VI. D ESIGN E XAMPLES We demonstrate, through the use of user-defined cost functions, how the designer can efficiently search the design

9

TABLE I: Pipelined FIR Filter Architectures Architecture

Pipe

Direct Form (DF) Figure 5(a)

0 1 2 3 0 1 2 3

4-6 Transpose Form (TF) Figure 5(b)

Computational Cell Figure 6 (i) (ii) (iii) (iv) (v) (vi) (vii) (viii)

0 2 0 2 0 1 2 3

0.9 0.8

1

0.7

20

Estimate Synthesized

15 10

0.6

Power (W)

Estimate Synthesized

Delay (ns)

1.5

0

1

5

9

11

13

1.824 1.926 1.915 1.81 1.813

Synthesized 1.635 1.745 1.738 1.61 Architecture

(a) Area

1.62

Estimate

Estimated Synthesized

0.5 0.4 0.3 0.2

5

0.5

Computational Cell Figure 6 (ix) (x) (xi) (xii) (xiii) (xiv) (xiv) & pipelined tree adder

Estimated vs. Synthesized Power

25

2

Area (mm2)

Direct Form II (DFII) Figure 5(c) Transpose Form II (TFII) Figure 5(d) DF Tree Adder (TreeDF) Figure 5(e)

30

2.5

Estimate

Pipe

Estimated vs. Synthesized Delay

Estimated vs. Synthesized Area

0

Architecture

0.1

1

5

9

11

0

13

26.57 5.65 5.65 25.8 10.7

Synthesized 23.63 5.57 5.56 22.99 9.54

Estimated

1

5

9

11

13

0.053 0.82 0.813 0.024 0.066

Synthesized 0.053 0.834 0.834 0.025 0.068

Architecture

(b) Delay

Architecture

(c) Power

Fig. 9: Estimated versus synthesized metrics for 32-tap, 16-bit FIR filters space for hardware structures that perform well. The design optimizations presented in the previous sections provide a more comprehensive set of hardware designs with variations in performance. We considered four design examples to illustrate the effectiveness of our framework and the usefulness of the cost functions. The first two design examples focused on FIR filters used for adaptive equalizers. For these two cases, we looked at the design space for high-throughput FIR filters and then considered area-efficient/high-throughput filter designs, respectively. For the third design example, we considered a matched FIR filter design used in signal communication applications where we analyzed the design space for lowpower/high-throughput filter structures. The fourth example focused on developing efficient decimation filters used for quadrature mirror filters, and for this case, we examined the performance of low-power/area-efficient FIR filter architectures. It must be noted that the results reported here do no consider place and route results which would result in different area, timing and power measurements. However we make the supposition that the resulting values would scale in the same relative manner for the different implementations given that these designs are logic dominated and not interconnect dominant. Moreover, all design comparisons were done by the re-implementation of the architecture being compared to by us. We used our framework to generate various pipelined filter structures and analyzed their performance for different values of VDD . Cost functions provided by our framework enabled us to reduce the design space to a smaller set of filter structures and design permutations that met performance specifications.

For simplicity, we present a sample of the reduced design space for each design example, along with the performance results. The pre-synthesized computational blocks within our framework reduced synthesis times for the entire design space by approximately 14 fold. For complex designs, such as the 32-tap FIR filter, this reduced synthesis times from days to hours. Table II compares synthesis times using the pre-synthesized cores to synthesis times without using presynthesized cores. A. High-Throughput Filters The first example we considered was a high-throughput FIR filter for magnetic recording read channel applications [32]. Current read channels use partial response maximum likelihood (PRML) equalizers which require an FIR filter for fine signal equalization. Efficient time-recovery in a read channel is imperative for fast phase and frequency acquisition. The PRML channel requires an FIR filter to fully equalize the data samples which are used to extract timing information. The FIR filter output samples are further processed using a maximum likelihood sequence detector in order to increase timing robustness. Therefore, it is crucial to utilize filters that exhibit minimal delay in order to provide sufficient bandwidth. An 8-tap, 6-bit non-pipelined transpose form FIR filter structure was presented in [33] which was sufficient for achieving tolerable mean square errors in the output. We manually designed the filter structure of [33] at the Verilog RTL using Booth-recoded Wallace-tree multipliers and carrylook-ahead adders in order to measure the performance. The design specifications are summarized in Table III.

10

TABLE II: Synthesis times for different order FIR filters Design Example

Filter Length

Output word size

8 16

Coefficient/ Input Word Sizes 6 bits 12 bits

12 bits 24 bits

Synthesis times with pre-synthesized cores without pre-synthesized cores per design design space per design design space 13.6 secs 29 mins 3 mins 6 hrs 30 mins 1 min 50 secs 240 mins (4 hrs) 26 mins 55 hrs

High-throughput Low-area & high-throughput Low-power & high-throughput Low-power & low-area

32

16 bits

32 bits

3 min 45 secs

480 mins (8 hrs)

50 mins

107 hrs

9

16 bits 8 bits

32 bits 16 bits

3 min 45 secs 14.2 secs

480 mins (8 hrs) 31 mins

50 mins 3 mins 25 secs

107 hrs 7 hrs 20 mins

TABLE III: 8-tap Non-pipelined transpose form FIR filter specifications technology supply voltage coeff. and in-bits throughput area power power density

0.18µm CMOS 1.8 V 6 bits 550 M samp/sec 0.12 mm2 57 mW 475 mW/mm2

We used our framework to search for 8-tap FIR filter structures that met the throughput specification of 550 M samp/sec. Our framework generated and synthesized the filter structures within 29 minutes (average of 14 seconds per design option) using an Intel Pentium-4, 3 GHz CPU Linux machine. Of the 128 filter designs and permutations, only 30 operated at 550 Msamp/sec or greater. The next task was selecting architectures from the reduced design space that performed well in terms of area and power dissipation. For that, we chose to use the power density cost function which analyzed power dissipation per square unit of area [16]. Table IV summarizes the results for the top five design options in the reduced design space. A designer can choose any one of the filter designs returned by our framework that meet the design specifications and still out perform the non-pipelined transpose form filter in terms of throughput and power. Comparing the performance of the first filter design option in Table IV to the performance of the non-pipelined transpose form in Table III, we found that we can achieve a 12% increase in throughput while reducing power dissipation by approximately 2.22 fold. However, this came at a cost of 20% increase in area. B. Area-Efficient and High-Throughput Filters The next design example we considered was a reduced-area, high-throughput FIR filter used in equalizers for communication systems. The QAM modulation standard employs an adaptive equalizer for minimizing channel distortions such as intersymbol interference. Yu et al. presented a popular areaefficient scheme for implementing a 64-QAM system by timemultiplexing a 16-tap FIR filter by a factor of 4 [34]. The authors implemented the time-multiplexed FIR filter using a non-pipelined transpose form structure. We manually designed and synthesized that filter structure in order to measure the performance which is summarized in Table V. Each filter architecture generated by our framework was time-multiplexed

TABLE V: 16-tap area and throughput-efficient, time-multiplexed FIR filter specifications technology supply voltage coeff. and in-bits power throughput area latency power efficiency area efficiency

0.18 µm CMOS 1.8 V 12 bits 314 mW 190 M samp/sec 0.8 mm2 2 cycles 0.61 (M samp/sec)/mW 238 (M samp/sec)/mm2

by a factor of 4. We used the area and throughput for the transpose form FIR filter as the design constraints to search the design space. Only 3 of the 128 design options met both area and throughput constraints, one of the design options being the non-pipelined transpose form structure operating at 1.8 V. From the results in Table VI, we notice a 3.5 fold improvement in power efficiency and a 5% increase in area efficiency when using the pipelined direct form II structure operating at 1.8 V compared to the transpose form structure. The entire process for generating the filter strucutures and converging to a likely filter implementation required 4 hours using our framework for this design example. This provided an improvement in both design productivity and design quality compared to manually constructing the hardware architectures and analyzing their performance. C. Low-Power and High-Throughput Filters The third design example we considered was a binary pseudo-random code matched filter for IS-2000 CDMA systems constructed using a 32-tap, low-power, high-throughput FIR filter. CDMA communication systems are based on spreading the passband frequencies generated by multiple users, allowing for multi-channel utilization. Each signal is uniquely coded using a pseudo-random code generator to avoid interference between users occupying the same channel. Receivers utilize matched filters that despread or decode each users frequency bandwidth in order to remove undesired interference from other transmissions. The matched filter in the receiver maximizes the signal-to-interference ratio using different signal processing optimization techniques. The hardware overhead introduced in maximizing signal recovery leaves little room for over-designing the basic, yet vital, FIR filter. Chen et al. proposed a low power reconfigurable 32-tap

11

TABLE IV: Performance Results for 8-tap high-throughput FIR filters Architecture

Pipe

TreeDF TreeDF TF TreeDF DF

2 3 3 2 3

VDD (V ) 1.8 1.8 1.8 2.0 2.0

Area (µm2 ) 144162 153026 176920 147088 172200

Delay (ns) 1.62 1.62 1.62 1.47 1.47

Throughput (M samp/sec) 617 617 617 680 680

Latency (cycles) 3 4 10 3 10

Power (mW ) 25.66 27.34 34.70 34.40 42.10

Power Density (mW/mm2 ) 178 179 196 234 244

TABLE VI: Performance results for 16-tap area and throughput-efficient, multiplexed FIR filters Architecture

Pipe

DFII DFII TF

1 1 0

VDD (V ) 1.8 2.0 1.8

Area (µm2 ) 762896 760529 804872

Delay (ns) 5.26 4.95 5.26

Throughput Latency (M samp/sec) (cycles) 190 2 202 2 190 2

Power Efficiency ((M samp/sec)/mW ) 2.11 1.74 0.61

Area Efficiency ((M samp/sec)/mm 2 ) 249 265 238

FIR filter used in the CDMA’s receiver [35]. We utilized our framework to search the design space for FIR filters that met both power and throughput design specifications summarized in Table VII. We normalized the performance results using a method suggested in [35] to match the technology size, word sizes and VDD values we used in analyzing the performance of the filter designs generated by our framework. TABLE VII: 32-tap low-power, high-throughput FIR filter specifications technology supply voltage coeff. and in-bits power throughput area latency power efficiency

0.18 µm CMOS 1.8 V 16 bits 129 mW 134 M samp/sec 2.33 mm2 12 cycles 1.04 (M samp/sec)/mW

Figure 10 illustrates the design permutations that lie within the accepted power/throughput region. We observe a similar trend in the power-delay curve to that of Figure 8. The designs that lie in the lower left region yield good power efficiency results. Of the 128 FIR filter design options, only 14 filters dissipated 129 mW or less, while operating at 134 M samp/sec or more. However, most of these filter structures required some form of pipelining, and therefore, exhibited large latencies. Therefore, we further reduced the design space to pipelined filter structures that met both power and throughput constraints, while requiring low latencies. The results for the top five design options in terms of power efficiency are summarized in Table VIII. All 14 filter architectures in the reduced design space met the design specifications for area, throughput and power. The designer can use power efficiency to find a design that is optimized for both power and throughput, since these were the two crucial design constraints for this example. In this case, the direct form tree-adder FIR filter with pipelined multipliers operating at 1.2 V was an appropriate design option which exhibited only 4 clock cycles of latency versus 12, and achieved a 2.7 fold increase in power efficiency compared to

Fig. 10: Design permutations that lie within the acceptable power/throughput region. the non-pipelined transpose form design. The entire design space for the 32-tap FIR filter was generated and synthesized in approximately 8 hours using our framework. D. Low-Power and Area-Efficient Filters The final design example we analyzed was the design space for a decimation filter used in quadrature mirror filter structures for subband decomposition. Signal and image compression engines, such as the JPEG2000 compression standard, utilize the discrete wavelet transform (DWT) to decompose a signal or image into subbands. Subbands that contain higher details are encoded with more bits than subbands with less detail [36]. The 2D DWT block used in the JPEG2000 design is constructed by applying the 1D DWT across the rows and columns of the image in a product-separable form. The 1DDWT of a signal x is obtained by filtering the signal and then decimating the output by a factor of 2. This type of filter operation is typically implemented using a two-phase polyphase filter which improves the power dissipation of the DWT system, as was discussed in Section V. The biorthogonal wavelet family provides improved coding gain and an efficient treatment of the signals at the boundaries. The linear-phase property of the biorthogonal DWT can be

12

TABLE VIII: Performance results for 32-tap low-power, high-throughput FIR filters Architecture

Pipe

TreeDF TreeDF TreeDF TF TreeDF

3 3 2 2 1

VDD (V ) 1.2 1.0 1.5 1.0 1.8

Area (µm2 ) 2338520 2191672 2158008 2193152 2185272

Delay (ns) 4.58 7.23 5.37 6.19 6.90

Throughput Latency (M samp/sec) (cycles) 218 4 138 4 186 3 162 3 145 2

combined with the polyphase structure of the FIR filter to offer a reduced-area, minimum-power filter structure than ad-hoc designs. One efficient hardware architecture in particular was implemented using a symmetric polyphase structure, shown in Figure 11 [37]. We manually designed this filter structure and synthesized the Verilog model to assess the performance of that design. The performance results are summarized in Table IX.

Power Power Efficiency (mW ) ((M samp/sec)/mW ) 76.82 2.84 62.83 2.20 109.10 1.70 110.20 1.47 123.60 1.17

generated by our framework, shown in Figure 12, combined both VDD scaling and pipelining to reduce power and increase throughput. However, this came at a cost of slightly increasing the latency of the filter.

Fig. 12: Symmetric decimation FIR filter generated using our framework Fig. 11: Symmetric decimation FIR filter using polyphase structure TABLE IX: 9-tap low-power, area-efficient decimation FIR filter specifications technology supply voltage coeff. and in-bits power throughput area latency power efficiency

0.18 µm CMOS 1.8 V 8 bits 53 mW 190 M samp/sec 0.26 mm2 2 cycles 3.58 (M samp/sec)/mW

We used our framework to generate similar polyphase structures and apply additional optimization techniques to further improve the performance of the basic design in of Figure 11. The entire filter design space synthesized in approximately 30 minutes, where only 16 decimation filter structures met both power and area specifications. We used the power efficiency cost function once more to comparatively analyze the performance of the pipelined polyphase decimation filters since the hardware complexity was similar for the architectures in the reduced design space. Table X lists the top five design permutations in order of power efficiency. The first design option of Table X exhibits a 2.8 fold reduction in power with a 36% increase in throughput, for an overall 3.8 fold improvement in power efficiency. The efficient hardware design

VII. C ONCLUSION This paper presents a framework for generating DSP architectures and analyzing the hardware performance. Our framework can be used to efficiently guide the designer in selecting recognizable DSP hardware architectures that performed well in terms of area, throughput, power dissipation and latency. We used our framework to investigate the effects of pipelining and voltage scaling to improve the quality of hardware designs for FIR filters. We utilized pre-synthesized arithmetic blocks to expedite the synthesis process from days to hours when considering hundreds of hardware permutations in the design space. Our framework can provide the designer the freedom to select the type of cost functions to use when assessing the quality of hardware designs. The goal of our framework was to efficiently reduce the design space to a smaller set of hardware architectures that met design specification, which facilitated the architectural selection process. The filter architectures generated by our framework conformed to structures that permitted hardware designers to efficiently apply additional optimizations. We illustrated some of the merits of our framework using different applications for FIR filters such as adaptive equalizers, matched filters and decimation filters. The results showed that filter structures generated by our framework yielded, on average, a 3 fold improvement in power efficiency when compared to ad-hoc designs.

13

TABLE X: Performance results for 9-tap low-power, area-efficient decimation FIR filters Architecture

Pipe

DF TF DF TF TF

1 0 1 0 0

VDD (V ) 1.5 1.2 2.0 1.5 1.8

Area (µm2 ) 259770 228970 255050 241818 258990

Delay (ns) 3.88 5.24 2.95 3.88 3.61

Throughput (M samp/sec) 258 191 339 258 277

ACKNOWLEDGMENTS The work of Ramsey Hourani was supported by the Office of Naval Research and Historically Black Engineering Colleges (ONR/HBEC) Future Engineering Faculty Fellowship Program. R EFERENCES [1] A. P. Chandrakasan, M. Potkonjak, J. Rabaey, and R. W. Brodersen, “HYPER-LP: a system for power minimization using architectural transformations,” in Computer-Aided Design, 1992. ICCAD-92. Digest of Technical Papers., 1992 IEEE/ACM International Conference on, Santa Clara, CA, Nov. 1992, pp. 300–303. [2] P. Banerjee, M. Haldar, A. Nayak, V. Kim, V. Saxena, S. Parkes, D. Bagchi, S. Pal, N. Tripathi, D. Zaretsky, R. Anderson, and J. Uribe, “Overview of a compiler for synthesizing MATLAB programs onto FPGAs,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 12, no. 3, pp. 312–324, Mar. 2004. [3] J. Hopf, “A parameterizable Handelc divider generator for FPGAs with embedded hardware multipliers,” in Field-Programmable Technology, 2004. Proceedings. 2004 IEEE International Conference on, 2004, pp. 355–358. [4] M. Potkonjak and J. Rabaey, “Exploring the algorithmic design space using high level synthesis,” in VLSI Signal Processing, VI, 1993., [Workshop on], Veldhoven, Oct. 1993, pp. 123–131. [5] “Synopsys design compiler, synopsys corporation,” http://www.synopsys.com/products/logic/design compiler.html. [6] “Synopsys module compiler 1999, synopsys corporation,” http://www.synopsys.com. [7] P. Banerjee, “An overview of a compiler for mapping MATLAB programs onto FPGAs,” in Design Automation Conference, 2003. Proceedings of the ASP-DAC 2003. Asia and South Pacific, Jan. 2003, pp. 477– 482. [8] “Acceldsp synthesis tool, xilinx design tools,” http://www.xilinx.com/ise/dsp design prod/acceldsp/index.htm. [9] M. R. Stan, “Low-power CMOS with subvolt supply voltages,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 9, no. 2, pp. 394–400, Apr. 2001. [10] T. Gemmeke, M. Gansen, H. J. Stockmanns, and T. G. Noll, “Design optimization of low-power high-performance DSP building blocks,” IEEE Journal of Solid-State Circuits, vol. 39, no. 7, pp. 1131–1139, July 2004. [11] S. Shah, A. Al-Khalili, and D. Al-Khalili, “Comparison of 32-bit multipliers for various performance measures,” in Microelectronics, 2000. ICM 2000. Proceedings of the 12th International Conference on, Tehran, Oct./Nov. 2000, pp. 75–80. [12] S. Summerfield, Z. Wang, and K. Parhi, “Area-power-time efficient pipeline-interleaved architectures for wave digital filters,” in Circuits and Systems, 1999. ISCAS ’99. Proceedings of the 1999 IEEE International Symposium on, vol. 3, Orlando, FL, May/June 1999, pp. 343–346. [13] K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation. New York, NY: John Wiley and Sons, 1999. [14] S. Y. Kung, VLSI Array Processing. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1987. [15] Q. Yue, L. Zhancai, and W. Qin, “Low-power FIR filter based on standard cell,” in ASIC, 2005. ASICON 2005. 6th International Conference On, vol. 1, Oct. 2005, pp. 208–211. [16] S. Velusamy, W. Huang, J. Lach, M. Stan, and K. Skadron, “Monitoring temperature in FPGA based socs,” in Computer Design: VLSI in Computers and Processors, 2005. ICCD 2005. Proceedings. 2005 IEEE International Conference on, Oct. 2005, pp. 634–637.

Latency (cycles) 3 2 3 2 2

Power Power Efficiency (mW ) ((M samp/sec)/mW ) 19 13.59 15 12.73 41 8.27 32 8.06 52 5.33

[17] D. Markovic, B. Nikolic, and R. W. Brodersen, “Power and area efficient VLSI architectures for communication signal processing,” in Communications, 2006 IEEE International Conference on, vol. 7, Istanbul, June 2006, pp. 3223–3228. [18] M. Vujkovic and C. Sechen, “Optimized power-delay curve generation for standard cell ICs,” 2002. ICCAD 2002. IEEE/ACM International Conference on Computer Aided Design, pp. 387–394, Nov. 2002. [19] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, “Low-power CMOS digital design,” IEEE Journal of Solid-State Circuits, vol. 27, no. 4, pp. 473–484, Apr. 1992. [20] “Systemc version 2.0 user’s guide,” http://www.systemc.org, Jan. 2002. [21] “Describing synthesizable rtl in systemc,” http: //www.synopsys.com, Jan. 2002. [22] A. Baganne, I. Bennour, M. Elmarzougui, R. Gaiech, and E. Martin, “A multi-level design flow for incorporating IP cores: case study of 1d wavelet IP integration,” Design, Automation and Test in Europe Conference and Exhibition, 2003, pp. 250–255, 2003. [23] R. Hourani, W. Alexander, , and T. Raithatha, “A hardware performance analysis framework for architectural exploration of DSP systems,” in Global Signal Processing Expo GSPX, San Jose, CA, Oct. 2005. [24] J. V. McCanny, J. G. McWhirter, and S. Y. Kung, “The use of data dependence graphs in the design of bit-level systolic arrays,” Acoustics, Speech, and Signal Processing [see also IEEE Transactions on Signal Processing], IEEE Transactions on, vol. 38, no. 5, pp. 787–793, May 1990. [25] “Opensocdesign,” http://www.opensocdesign.com/. [26] R. Hourani, R. Jenkal, W. R. Davis, and W. Alexander, “Automated Architectural Exploration for Signal Processing Algorithms,” in Signal Processing Systems Design and Implementation, 2006. SIPS ’06. IEEE Workshop on, Banff, AB, Canada, Oct. 2006, pp. 274–279. [27] W. Davis, “Getting high-performance silicon from system-level design,” in VLSI, 2003. Proceedings. IEEE Computer Society Annual Symposium on, Feb. 2003, pp. 238–243. [28] “Methodologies for user-friendly system-on-a-chip experimentation,” http://www.ece.ncsu.edu/muse/sshaft/. [29] R. Hourani, Y. Kim, S. Ocloo, and W. Alexander, “Automated hardware ip generation for digital signal processing applications,” in IEEE 40th ASILOMAR Conf. on Signals, Systems and Computers, Pacific Grove, CA, Nov. 2006. [30] P. P. Vaidyanathan, Multirate systems and filter banks. Englewood Cliffs, NJ: Prentice Hall, 1993. [31] “The oklahoma state university system on chip (soc) design flows,” http://vcag.ecen.okstate.edu/. [32] H.-J. Ki, W.-H. Paik, J.-S. Yoo, and S.-W. Kim, “A high speed, low power 8-tap digital FIR filter for PRML disk-drive read channels,” in Solid-State Circuits Conference, 1997. ESSCIRC ’97. Proceedings of the 23rd European, Sept. 1997, pp. 312–315. [33] R. B. Staszewski, K. Muhammad, and P. Balsara, “A 550-MSample/s 8-tap FIR digital filter for magnetic recording read channels,” IEEE Journal of Solid-State Circuits, vol. 35, no. 8, pp. 1205–1210, Aug. 2000. [34] H. Yu, B. W. Kim, Y. G. Cho, J. D. Cho, J. W. Kim, J. K. Lee, H. C. Park, and K. W. Lee, “Area-efficient and reusable VLSI architecture of decision feedback equalizer for QAM modem,” in Design Automation Conference, 2001. Proceedings of the ASP-DAC 2001. Asia and South Pacific, Yokohama, Jan./Feb. 2001, pp. 404–407. [35] K. H. Chen and T. D. Chiueh, “A low-power digit-based reconfigurable FIR filter,” vol. 53, no. 8, pp. 617–621, Aug. 2006. [36] A. Skodras, C. Christopoulos, and T. Ebrahimi, “The JPEG 2000 still image compression standard,” vol. 18, no. 5, pp. 36–58, Sept. 2001. [37] I. Uzun, A. Amira, and A. Bouridane, “An efficient architecture for 1-d discrete biorthogonal wavelet transform,” in Circuits and Systems, 2004. ISCAS ’04. Proceedings of the 2004 International Symposium on, vol. 2, May 2004, pp. 697–700.

14

Ramsey Hourani received his B.S. degree in Electrical and Computer Engineering from Iowa State University, Ames, in 1998 and M.S. degree in Electrical and Computer Engineering from North Carolina State University, Raleigh, in 2001. He is currently working toward the Ph.D. degree in electrical engineering at the same university. He worked two years with the Cellular Subscriber Sector for Motorola as a hardware designer for CDMA wireless systems. His research interests include developing efficient hardware architectures that map digital signal and image processing applications onto ASICs and FPGAs.

Ravi Jenkal received his B.E. degree in Electronics and Communications from S. J. College of Engineering and M.S. degree in Electrical and Computer Engineering from North Carolina State University, Raleigh, in 2004. He is currently working toward the Ph.D. degree in Computer Engineering at the same university. His research interests include the creation of novel architectural solutions for Multi-Antenna Systemson-a-Chip (SoC) solutions, 3DIC, low-power and high- speed ASIC/SoC design methods and highlevel performance estimation.

W. Rhett Davis received B.S. degrees in electrical and computer engineering from North Carolina State University, Raleigh, in 1994 and M.S. and Ph.D. degrees in electrical engineering from the University of California at Berkeley in 1997 and 2002. He has worked briefly with Hewlett-Packard (now Agilent) in Boeblingen, Germany and Chameleon Systems in San Jose, California. Since 2002, he has been an Assistant Professor of Electrical and Computer Engineering at North Carolina State University. His research interests are centered on developing methodologies, CAD tools and circuits for systems-on-chip in emerging technologies. His interests include 3D IC design and low-power and highperformance circuit design for digital signal-processing and embedded systems.

Winser E. Alexander received the B. S. Degree in Electrical Engineering from North Carolina A & T State University in 1964. He received the M. S. Degree in Engineering in 1966 and the Ph. D. in Electrical Engineering in 1974 from the University of New Mexico. He is a currently a Professor in the Department of Electrical and Computer Engineering at North Carolina State University. He previously served as an officer in the U. S. Air Force, he was a Member of Technical Staff at Sandia Laboratories, Albuquerque, NM and he was previously Chair of the Department of Electrical Engineering at North Carolina A & T State University, Greensboro, NC. His research interest include genomic signal processing, parallel algorithms and parallel computer architectures for applications such as communications, image processing and multimedia. Dr. Alexander is a senior member of IEEE and he is a registered professional engineer in North Carolina.