Emmanuel CASSEAU, Christophe JEGO, Eric MARTIN ... - CiteSeerX

1 downloads 0 Views 64KB Size Report
rue Saint Maudé, 56100 LORIENT FRANCE [email protected]. ABSTRACT. Architectural synthesis is an efficient design process that reduces the ...
ARCHITECTURAL SYNTHESIS OF DIGITAL SIGNAL PROCESSING APPLICATIONS DEDICATED TO SUBMICRON TECHNOLOGIES Emmanuel CASSEAU, Christophe JEGO, Eric MARTIN LESTER Laboratory, Université de Bretagne Sud rue Saint Maudé, 56100 LORIENT FRANCE [email protected]

ABSTRACT Architectural synthesis is an efficient design process that reduces the gap between algorithms and architectures by raising the abstraction level. However, this process currently does not take the VLSI circuit interconnection cost into account whereas this cost becomes predominant using submicron technologies. In this paper, an interconnection cost analysis at the behavioural level is performed in order to provide rapid prototyping results and to direct the synthesis process with additional path constraints. Results are presented showing the interest of this approach.

1. INTRODUCTION VLSI technologies decreased from 2 µm in 1985 to 0.18 µm in 1999. According to the National Technology Roadmap for Semiconductors [1], it will further decrease at the rate of 0.7x per generation (Moore's law) to reach 0.05 µm by 2011. This evolution has a significant impact on the design of VLSI circuits: the performance of a design will be increasingly determined by the interconnection performance. In fact, the wiring delay becomes more important than the operator propagation time [2]. Interconnection design thus plays a crucial role in the design of chips with sub-micron technologies. In other respects, architectural synthesis enables a significant productivity increase by raising the abstraction level of digital circuits. One of its characteristics is an "optimal" reusing (sharing) of the arithmetic components and registers [3]. This reusing is performed with the allocation of interconnection components and entails an additional interconnection wire cost. However, the synthesis processes currently do not take into account the wiring area -which is known to be difficult to predict- and unfortunately the associated path delay. When complex applications are concerned, it leads to tremendous different timings at the layout level. For example, we have performed the architectural synthesis of the Viterbi algorithm [4]. Results highlighted the problem of the interconnection cost : according to the real time constraint, a difference up to 25% between the estimated path delay of the architectural synthesis and the placed and routed architecture occurred. In fact, the more complex the architecture is the higher the difference. For theses reasons, architectural synthesis tools have to take interconnection cost into account.

This paper is structured in the following way: section 2 presents a method of characterisation of the DFG elements that enables an interconnection cost analysis at the behavioural level. The main features of the algorithms used for the high-level synthesis process we propose, including behavioural level interconnection information, are presented in section 3. Results are presented in section 4.

2. INTERCONNECTION COST ANALYSIS AT THE BEHAVIOURAL LEVEL 2.1

Characterisation of DFG elements

Architectural synthesis tools are based on a generic architectural model. In this paper, we assume that the synthesis of data flow algorithms under a real time constraint is concerned [5], and that a bus-based architectural model with registers dedicated to arithmetic component inputs is used for the processing unit (Figure 1). However, the proposed method can be used with other target structural models (bus-based, multiplexer-based or register-based) [6]. elementary cell

Mux

R OP

Mux

R

Dem

OP

R

R

R R

I / O

I / O R

Register

Dem Demultiplexor

Mux Multiplexor

Tristate

Parallel multi-bus

Figure 1. Structural model of the processing unit

The structural model of the processing unit can be partitioned into virtual elementary cells (Figure 1) including an arithmetic component, its connected registers, interconnection components (multiplexor, demultiplexor, tristate) and interconnection wires. We can observe that a processing unit associated with this kind of typical structural model is composed of three different types of interconnection wires : ‰ local to an elementary cell (type 1): these interconnections perform the data transfers into an elementary cell,

‰ local to the processing unit (type 2): these interconnections perform the data transfers between elementary cells, ‰ global to the architecture (type 3): these interconnections perform the parallel multi-bus access. With this architectural model, the parallel multi-bus performs the communication with the other functional units of the architecture (memory unit and I/O communication unit). The processing unit interconnection costs (wiring area and propagation delay) depend on the type of interconnection wires. On the one hand, these costs may be low for interconnection wires that are local to an elementary cell if this cell is hierarchically placed and routed (see section 3.3). For interconnection wires that are global to the architecture, the propagation delay may be crucial. However, since we assume that synchronisation registers are used for this kind of transfers, this cost is not critical. On the other hand, for interconnection wires that are local to the processing unit, the interconnection cost depends on the complexity of the architecture (related to the algorithm complexity and the real time constraint) and the place and route tool performance, thus may be critical. At the behavioural level, interconnection wire types are not known. However, interconnection wires are associated with data transfers, which are known at the behavioural level. Like many tools dedicated to the synthesis of data flow applications, our high-level synthesis process starts from the behavioural specification of the application, which is then compiled into a data flow graph (DFG) representation. This graph is composed of data and operation nodes, and dependency edges. Since interconnections are associated with data transfers, the purpose is to characterise the types of data (temporary processing data, constants or signals) during the architectural synthesis process and take advantage of their type to estimate the associated propagation delay. Three categories of data have thus been defined according to the interconnection features : ‰ category 1: temporary processing data which are linked to a single arithmetic operation, ‰ category 2: temporary processing data which are linked to two or more different arithmetic operations, ‰ category 3: temporary processing data and constants which are stored in the memory unit and I/O signals. The data of the DFG are first characterised according to this principle. Then the operations are also characterised, according to the previous characterisation of the data they are associated with. A vector thus characterises each operation. This vector specifies the categories of the input data that are necessary for the computation, and the category of the data associated with the result. 2.2

Interconnection cost models

Actually, in order to direct the synthesis process, the purpose is to associate to each computation of the specification an interconnection cost according to the concerned data. At this step of the process, we assume

that type 1 interconnection wires will perform category 1 data transfer and so on. It means that we firstly assume that temporary processing data (category 1) which are linked to a single arithmetic operation in the DFG will be associated with a single elementary cell, i.e. require type 1 wires, thanks to the arithmetic component sharing, and so on. Thus a lower bound interconnection cost is provided at this step and the synthesis process will be directed with this path constraint. Interconnection cost mainly depends on the length of the interconnections. According to the interconnection features, equation (1) is observed : LengthType1 < LengthType3 < LengthType2

(1)

Furthermore, a formulation of interconnection lengths can be done according to the models described by Mecha [7] and Hallberg [8] and equation (1) : LengthType1 = α1 2 * AreaComponent LengthType2 = α2 2 * Max[ AreaComponent ] LengthType3 = α3

(2)

( LengthType1 + LengthType 2 ) 2

AreaComponent Max[AreaComponent] are and respectively the area* of the elementary cell associated with the interconnection and the area* of the most complex elementary cell of the processing unit (*: without taking interconnection wire into account). Although formulas (2) are very simple, they give rather good length estimations with regards to physical synthesis results. However, the precision of these estimations can vary according to the application (complexity), the technology (size and metal layer number) and the real time constraint. For this reason, the lengths can be weighted by coefficients (α1, α2 and α3). Results given in section 4 have been obtained with αi = 1 . Then, according to a particular computation to be solved, propagation delay models are used for the different types of interconnection it requires : DelayType(i) = LengthType(i)* DelayWire with i= {1, 2 or 3} (3)

The interconnection propagation delay is determined from its length and a wiring delay associated with the technology. This model will be used during the selection/allocation task.

3. INTERCONNECTION COST CONTROL DURING HIGH-LEVEL SYNTHESIS Our high-level synthesis process is composed of two major tasks: selection/allocation and scheduling/binding. Then, the process integrates a clustering task in order to specifically control the interconnection lengths and to locally optimise the number of registers.

3.1

Selection/allocation task

The purpose of the selection algorithm is to find the optimal set of components from a given library, for a behavioural description and a real time constraint. Then

the allocation task determines the minimum number of every selected type of components. In this synthesis process, the selection/allocation task firstly determines a set of mono-function components. This is performed according to the previously done operation characterisation, the interconnection cost model and obviously the propagation time of the components. This task is composed of different steps that are applied for each possible set of components. The first one associates the propagation times of the selected components with the operations. Then, the mobility of the operations (difference between the ALAP and ASAP dates) is computed. At last, an allocation is performed in order to minimise the area cost of the architectural solution. Then we try to optimise the set of components through the exploration of solutions with multi-function components: according to the operation characterisation, when a group of costly components (a costly operation is an operation that uses data associated with category 2 rather than data associated with category 1 or 3) is localised, the use of multi-function components is evaluated in order to obtain a better set of components in term of characterisation. At this step of the process, the operation characterisation done before the selection/allocation task is refined according to the selected components.

3.2

Scheduling/binding task

In this synthesis process, the scheduling and binding algorithms are performed concurrently in order to take into account the operation characterisation. A resource constrained scheduling [3] (List-Based Scheduling) is used. It is a generalisation of the ASAP algorithm with the inclusion of constraints. A scheduling priority list is provided according to two criteria. In each iteration step, operations with low mobility are scheduled first, and operations with large mobility are deferred to later control steps. The second criterion depends on the operation characterisation vectors and the structure of the components. In fact, the allocated components are characterised by vectors (data path knowledge) that are updated all along the steps of the synthesis process, according to the category of the data they are associated with. The objective is to schedule the operations by respecting the similarity between the operation characterisation vectors and the component characterisation vectors. In the same way, the binding step respectively assigns data to registers and operations to the allocated components by taking into account their characterisation in order to minimise the number of different characterisation vectors associated with each component and consequently the number of data paths. If necessary, the characterisation vectors associated with the components are updated again.

3.3

Clustering and register optimisation task

From a given architectural solution, the interconnection cost depends on the length of the wires. Since conventional place and route tools like Silicon

Ensemble can take hierarchical descriptions and placement directives into account, the generation of a hierarchical RTL description enables locally placed components, i.e. low cost interconnections. A clustering step is thus performed in this synthesis process. This step consists in providing hierarchy in the RTL description. As said previously, the processing unit can be partitioned into elementary cells including an arithmetic component, its connected registers and interconnection operators (Figure 1). The lengths of wiring associated with category 1 variables can thus be controlled if elementary cells are hierarchically described. For the inter-cluster data transfers, placement directives are given in order to minimise the length of wiring associated with category 2 variables. With regards to the register sharing, our optimisation algorithm is applied to each distinct cluster neither on the whole processing unit as usually done. Obviously, this register sharing is not as efficient as a register sharing applied to the whole processing unit from the number of registers point of view, however it insures local data transfers (i.e. short interconnections) and few interconnection components.

4. EXPERIMENTAL RESULTS Based on a behavioural synthesis tool called GAUT [5] and dedicated to signal and image processing applications under a given real time constraint, a synthesis tool has been developed according to this interconnection cost control approach, and syntheses have been performed. The BLMS (Bloc Least Mean Square) adaptive filter algorithm [9] is used in acoustic echo cancellation applications that appear in different domains such as teleconference systems. The second application presented there is a DWT (Discrete Wavelet Transform). It can be viewed as a multiresolution decomposition of a signal or an image. JPEG2000 international image coding standard [10] is based on compression techniques and in particular on the wavelet transform using the Lifting Scheme (LS) algorithm [11]. We synthesised the VHDL behavioural descriptions of : ‰ a BLMS adaptive filter which filter size is 1024 taps and a 160 Kb/s throughput constraint, ‰ a DWT LS for 3 levels of resolution on a 64*64 tile and a 8000 tile/s throughput constraint that corresponds to applications with a 30 image/s throughput with 1024*1024 images. The synthesis library we used is composed of 16 bit components that have been defined according to a 3 metal layers AMS 0.35 µm CMOS technology. The syntheses have been firstly performed with a classical architectural synthesis process (i.e. without particular interconnection wire cares), then with the architectural synthesis process dedicated to submicron technologies we have presented in this paper. The selection/allocation task of this latter process is usually more costly for the number of allocated arithmetic components point of view. Furthermore, since the register sharing step is performed with cluster

constraints, the provided architectures require more registers. However, the estimated areas of the architectures obtained with interconnection cost control are smaller than those associated with no particular interconnection wire cost. In fact, with interconnection cost control, the sets of selected arithmetic components enable more scheduling and binding possibilities for the operations to be computed. Consequently, less interconnection components and wires are necessary. Furthermore, thanks to a smaller number of interconnection components (multiplexor, demultiplexor, tristates), the control unit is much more simple. Then, from the RTL provided description, the logical and physical syntheses have been performed with Design Compiler from Synopsys and Silicon Ensemble from Cadence with the same control options and utilisation row. Results obtained after placement and routing are presented in table 1. With respect to a usual architectural synthesis process, a classical logic and physical flow has been used for syntheses ➀. Synthesis ➁ results have been obtained after a hierarchical placement and routing of the regions that correspond to the clusters obtained with the process presented in this paper. A classical placement is potentially more efficient than a hierarchical placement for the chip area point of view. However, areas are similar for the BLMS filter algorithm, and synthesis ➁ is 30 % less costly than synthesis ➀ for the DWT algorithm. These results confirm the area estimations provided by the architectural synthesis tool.

Technology: AMS 0.35 µm CMOS number Cell area (103 µm2) Chip area (103 µm2) number Interconnection average length(µm) max length (µm)

BLMS (1024 laps)

DWT LS (64*64)

c d c d

10027 9156 14260 9388 1784 1771 1884 1658 2751 2730 3249 2694 12091 11348 13944 10404 89 86 149 99 3048 1562 4712 1731

Table 1. Synthesis results after placement and routing

Routing results are also presented in Table 1. The total number of interconnections is less important for syntheses ➁ and the longest interconnections are respectively 2 and 2.7 times less important with solutions ➁ than with solutions ➀. The initial objective is thus met : the number and the lengths of the interconnections are significantly smaller. Although the path with the longest interconnection wire may not be the critical path, the characterisation of the components made all along the architectural synthesis process gives the type of the data associated with. According to this information, we can take care with costly transfers, i.e. category 2 coefficients so type 2 wires, during the placement task. Furthermore, since the process aims to reduce the category 2 data number, this placement task is made easier. For instance, for the

architectures provided by the architectural synthesis process with interconnection cost control, the path delay of the placed and routed architecture always achieves the specified real time constraint whereas this constraint was exceeded otherwise.

5. CONCLUSION Actually, architectural synthesis tools attempt to provide RT level design solutions with a quite good trade-off between cost and performance from a behavioural description. However, these tools do not take the VLSI circuit interconnection wire cost into account whereas this cost becomes predominant with the technology decrease and the application complexity increase. An interconnect cost analysis method and an architectural synthesis process that enables the interconnection cost to be controlled have been proposed. Results obtained with DSP applications show an important decrease of the interconnection wire cost, which significantly reduces the gap between the highlevel estimated performance and the real performance.

References [1] Semiconductor Industry Association, "The National Technology Roadmap For Semiconductors", 1999. [2] J. O. Piednoir, "Very Deep Submicron Design Flows", tutorial, DCISC, Montpellier, France, Nov. 2124, 2000. [3] D.D. Gajski, N. Dutt, A. Wu, S. Lin, "High-level Synthesis", Ed. Kluwer Academic Publisher 1992. [4] C. Jégo, E. Casseau, E. Martin, "Architectural Synthesis of a Complex Application : the Viterbi Algorithm", User Forum, DATE 99, pp. 69-73, 1999. [5] E. Martin, O. Sentieys, H. Dubois, J.L. Philippe, "GAUT, an Architecture Synthesis Tool for Dedicated Signal Processors", EURO-DAC 93, pp.14-19, 1993. [6] C. Jégo, "Architectural Synthesis of Real Time Applications dedicated to Sub-micron Technologies", PhD Thesis (in French), Lester Lab, Rennes1 University,France, Dec. 11, 2000. [7] H. Mecha, M. Fernandez, F. Tirado, J. Septién, D. Mozos, K. Olcoz, "A Method for Area Estimation of Data-Path in High-Level Synthesis", IEEE Trans. on CAD of Integrated Circuits and Systems, pp. 258-265, Vol 15, 1996. [8] J. Hallberg, Z. Peng, "Estimation and Consideration of Interconnection Delays during High-Level Synthesis", Euromicro'98, pp. 349-356, 1998. [9] E. Bjarnason, E. Haensler and M. Rupp, "Acoustic Echo Control: Advances in Algorithm techniques", 3rd International Workshop of Acoustic Echo Control, 1993. [10] ISO/IEC JTC1/SC29 WG1, JPEG 2000, Editor M. Boliek, Coeditors C. Christopoulos & E. Majam "JPEG2000 Image Coding System", Part I Final Committee Draft Version 1.0. [11] W. Sweldens, "The Lifting Scheme : A Custumdesign Construction of Biorthogonal Wavelets", Appl. Comput. Harmon. Anal., Vol. 3, N° 2, pp. 186-200, 1996.