digital if filter for mobile radio - Semantic Scholar

1 downloads 0 Views 197KB Size Report
A multirate IF filter for mobile radio has been implemented in silicon. The filter consists of a decimation stage followed by a bandpass filter. Both parts use a ...
DIGITAL IF FILTER FOR MOBILE RADIO Kent Palmkvist, Peter Sandberg, Mark Vesterbacka, and Lars Wanhammar Department of Electrical Engineering Linköping University S-581 83 Linköping, Sweden

SUMMARY A multirate IF filter for mobile radio has been implemented in silicon. The filter consists of a decimation stage followed by a bandpass filter. Both parts use a lattice wave digital structure. The design and implementation are described beginning with the filter specification and proceed through algorithmic design, operations scheduling, and resource allocation and assignment. Every step tries to minimize the amount of resources in the final implementation, thereby reducing power consumption. Finally the architecture is selected and the system is described using synthesizable VHDL in order to arrive at a chip layout using standard-cell technology. This design technique is used to reduce the design work.

1. INTRODUCTION This paper describes an implementation of a multirate digital filter. The presented implementation is to be used as a IF filter for mobile radio. It is well suited for systems using oversampling techniques to simplify the analog interfacing, as it contains decimation stages. The implementation approach is however well suited for most digital filters as well as many other types of DSP algorithms.

2. OVERVIEW OF THE DESIGN PROCESS One usual approach of mapping an algorithm onto an architecture is to map the algorithm onto an intermediate virtual machine followed by a mapping to a programmable hardware structure. The resulting implementation will then consist of standard programmable components, for example digital signal processors. An alternative to this approach is to do a direct mapping. This may be the only option for some high-performance systems. The design effort is however increased compared with the virtual machine approach. A direct mapping technique is outlined below, suitable for design of fixed-function systems. 1. Ideally, the specification of the system contain a complete specification of the system and its components. However, in reality, the specifications of the components, from which the system is built, must be derived from the system specification. In a top-down design approach the specification of the components at a certain system level will depend on the result of the s done at the level above. Hence, specification for the components will evolve as the design progresses. 1

2.

A good DSP algorithm is selected and tuned to the application. Usually, a high-level language is used to develop and validate the correctness of the DSP algorithm. Note that this implementation of the DSP algorithm can serve as specification for the next design phase.

DSP ALGORITHM

SCHEDULING

3.

4.

In the next design step, the algorithm is successively partitioned into a hierarchy of processes. Here we regard any task in the algorithm as a process, for example, storage of data is considered as a process. The higher level processes will typically be mapped onto virtual machines while the lower level processes will be mapped onto hardware components such as PEs and memories. The partitioning of the system into a hierarchy of processes is a crucial design step that will have a major influence on both the system cost and its performance. In practice, most complex DSP algorithms are formulated using a sequential high-level programming language. The sequential algorithm must therefore be transformed into a more parallel description. The processes in the parallel description must be scheduled in time so that the specified performance is met and the required amount of resources is minimized. Often, only the number of PEs is minimized, but it is also possible to minimize the memory and communication resources. In fact, it is possible to minimize the total amount of hardware resources as well as the power consumption by proper scheduling.

RESOURCE ALLOCATION AND ASSIGNMENT

CLASS OF IDEAL ARCHITECTURES

OPTIMAL ARCHITECTURE

Fig. 1. Synthesis path for the digital IF filter.

5.

An adequate amount of resources are allocated according to the given schedule. The number of PEs can be estimated from the number of concurrent operations while the number of logical memories is determined by the largest number of simultaneous memory transactions.

6.

In the resource assignment step, the processes are assigned to specific resources, i.e., PEs, memories, and communication channels. The methods that we will use are either based on clique partitioning or the left-edge algorithm that will be used for optimizing the utilization of memory. The left-edge algorithm is a well-known algorithm that is used for wire routing.

7.

The generic architecture is a shared-memory architecture. In this step, we minimize the cost for the PEs, memories, and communication circuitry. It is advantageous to use bit-serial PEs and bit-parallel RAMs for storage of data. The bit-serial PEs must therefore communicate with the memories through serial/parallel converters that acts as cache memories.

Steps 3 through 7 must be iterated until a satisfactory solution is found. 8.

The next step involves the logic design of the modules in the architecture, i.e., the PEs, memories, control units, etc. Control signals, which also can be derived from the schedule, are defined in this step. 2

9.

In the circuit design step, the modules are designed at the transistor level. Transistor sizes are optimized with respect to performance in terms of speed, power consumption and chip area.

10. In the last step, the VLSI design phase, involves layout, i.e., floor-planning, placement, and wire routing. A key idea is that neither system nor circuit design is done in this phase.

3. FILTER SPECIFICATION The IF filter specification is shown in Fig 2. There are also a limitation of group delay variations in the passband. It does however not define the sample frequency. The following design is based on a sample rate that is 8 times the center frequency of the bandpass filter. This corresponds to one decimation by 2 in front of the bandpass filter. The output sample rate is 380 kHz and the input sample is consequently 760 kHz. The decimation steps are typically needed in systems using delta-sigma A/D converters.

A[dB] 0 -50 -100

50

100

150

f [kHz]

Fig. 2. Magnitude specification for the bandpass filter.

4. ALGORITHM DESIGN AND OPTIMIZATION The filter algorithms are selected according to the robustness requirement set by the application. One large family of filters with this property is the lattice wave digital filters [1,2]. These structures is guaranteed to be free from parasitic oscillations, have a close to minimum coefficient sensitivity, high dynamic range, and good computational properties. The data and coefficient word lengths has been determined to Wd = 12 bits and Wc = 6 bits, respectively. T

T T

T

-1

T

-1

!0

x(n)

-

-1

!0 -

+

-

+

+ +

+

!1

T

-1

T

-1

!1

T

T

T

T

T

y(n)

-1

-1

!2

!2

T

T

T

Fig. 3. Multirate IF filter with n decimation stages and bandpass filter. 3

Decimation is done by two stages each decimating with a factor of two. This allows the decimation stage to be radically simplified.The transition band of the decimation filter may be rather broad, as the high frequency components are filtered out by the bandpass filter.

5. SCHEDULING OF THE ARITHMETIC OPERATIONS Scheduling of operations are performed to find execution times for all operations. This ordering must meet the deadline and requires a minimal amount of resources. These resources consists of PEs, memory cells, and communication channels. The control circuitry should also be included, even though it is often neglected. The scheduling problem formulation is formed using a computational graph. The computational graph is a signal flow graph with a time property added. DSP algorithms such as filters are often periodic making the schedule also periodic. This can be utilized by connecting k computational graphs as shown in Fig. 4. x(n-2)

y(n-2) x(n-1)

N

T

y(n)

y(n-1) x(n) T

N

T

T

N

T T

Fig. 4. Circularly concatenated computational graphs. The multirate property of the original problem implies that a transformation must be done to be able to use the wanted formulation. This transformation consists of transforming the higher sample rate portions of the algorithms into the lowest sample rate of the algorithm. The number of operations are thereby clearly described by the signal flow graph. Both the decimation stages and the bandpass filter consists of wave digital filters. It is therefore suitable to select the adaptor operation as an atomic operation to be implemented in a PE. The multiplications with –1 and overflow detection and handling is also included in this PE. A PE will therefore consist of a few additions, one multiplication, and misc. logic. The simplest bitserial implementation requires approximately Wd + Wc + 6. It is possible to pipeline this operation, giving a higher throughput. The simple implementation thus require 22 clock cycles. Standard cells are not very well suited for bitserial implementations, as the flip-flop is rather large, and the maximum clock frequency is below 100 MHz while a full custom implementation may use up to and beyond 300 MHz. A bitparallel PE was therefore selected. It had approximately the same area as the bitserial. It can be clocked beyond 10 MHz. The minimal number of PEs is therefore 12 " 380 " 103 = 0.46 < 1. 10 " 106 The latency of the memory system is one important parameter when scheduling the operations. This forces the designer to estimate the latency through the memory system. Using memories to store temporary values and relaxing the access time forces the memory latency to be in the same order as that of an PE operation. Storage of variables can be implemented in a 2-port memory with an access-rate lower than 12 " 2 " 380 " 103 < 10 MHz. The latency of an operation is in such system 4

is at most 3 TPE. The algorithm must therefore be pipelined. This will only affect the group delay of the system. There are delay elements in the algorithm that are connected together forming a 2 Tsample delay. This forces the schedule to be done over 2 sample periods, in order to describe the variables. This indicates that the control circuitry is periodic with a period of 2 TSample.

x(4n)

2

2

x(4n+2) 3

x(4n+1)

3 x(4n+3)

1

1

12

12

1

10 4

6

5

6

5

7

8

Tsample

4 9

1

10 7

8

9 y(2m+1)

y(2m)

Tsample

Fig. 5. Operation schedule over 2 sample periods.

The sample period is divided into 12 slots of equal length. Distributing the starting time over these available slots guarantees that the variable store is only accessed by one PE at a time. The scheduling problem is then to put one operation into each slot in such a way that the precedence relations are not violated, and that the minimal length of arcs (due to latency in the memory system) is fulfilled. It is very important that the cost in terms of resources are correctly calculated during the scheduling process. This cost may be determined by executing the resource allocation and assignment step every time a different schedule is tried, or use approximations. Such approximations may consist of lower limits as the maximal number of simultaneous operations, maximal number of simultaneous storage, etc. All these are easily extracted from the scheduled graph.

6. RESOURCE ALLOCATION AND ASSIGNMENT The resource allocation step determines the amount of 1A 1B hardware resources that should be employed. Each process in 2A 2B the schedule has a lifetime, and must have some resource 3A 3B allocated. This includes the storage and communication 4A 4B processes. One common approach is to use a one-to-one 5A 5B mapping, where one resource is allocated and assigned to 6A every operation in the schedule. This is also known as 6B 7A isomorphic mapping. Resources may often be shared if the 8A 8B processes sharing the resource are overlapping in time. The 10A 10B minimal number of resources are often equal to the maximal 11A 11B number of simultaneous resources in the schedule. 12A Operations must then indicate the lifetime as well as the 5 10 15 20 latency, as it is the lifetime that defines the number of Fig. 6. Life-time table for the simultaneous operations. memory variables. The resource allocation and assignment problem may be solved using connectivity graphs, where nodes represent processes, and arcs connects processes that do not overlap in time. Each graph is representing one type of resource. The resource allocation problem is then to find the minimal number of cliques, where a clique is a fully connected subgraph. 5

Each node must be present in one and only one clique. Each clique are then allocated a processing element, and the resource is assigned to every process represented by the nodes in the clique. The problem of finding these cliques in general are NP-complete. This indicates that this algorithm is not very suitable if a large amount of resources are to be allocated and assigned. Another method that may be used is a modified left edge algorithm. This is an heuristic algorithm used for VLSI routing problems, and do as such not always find the optimal solution. It is rather well suited for resource allocation and assignment for variable space. A sorted list of all processes are searched for processes that may be assigned to one allocated resource. When the end of the list is reached, a new resource is allocated and the search restarts from the beginning of the list.

7. OPTIMAL ARCHITECTURE Sharing resources implies that the variables must be stored in a common place, and that PEs communicate with this storage. We use a shared multi-bus structure, where each memory has its own bus. The PEs has cache memories at its inputs and outputs, enabling the PEs to work asynchronous while variables are fetched and stored.

Counter

ROM

... Cache Memory

8. IMPLEMENTATION The complete design has been described using VHDL. The VHDL code consists of both behavioral and structural parts. A logical representation in form of a netlist is then generated using logic synthesis tools and a standard cell library (AMS 0.8 µm CMOS). Automatic placement and routing was then used to produce a layout. Each step has been tested by simulating the design.

RAM

... Inputs/Outputs Processing Element ... Coeff. ROM

with The initial implementation of the bandpass part, i.e. one PE and 16 Fig. 7. Architecture asynchronous PEs. memory cells, resulted in a standard cell implementation [3] using 4 2 2 mm (9 mm including pads). This implementation consumed 45 mW at 5V. This high power dissipation was party due to a bad implementation of the RAM. Each RAM cell was generated as flip-flops continuously clocked at system clock speed. Implementing the memory using a ram generator reduced the area to 3 mm2 (6.4 mm2 including pads). This version has not yet returned from fabrication.

9. REFERENCES [1] [2] [3]

Wanhammar L.: DSP Integrated Circuits, Prentice-Hall, 1995. (in preparation). Eriksson S., Wanhammar L.: Tidsdiskreta filter, del 1-3, LinTek IC, 1978. Sandberg P., Palmkvist K., Wanhammar L.: Some Experiences from Automatic Synthesis of Digital Filters, Proceedings of NorChip ‘94, Göteborg, Sweden, Nov. 8-9, 1994

6