A Heterogeneous MPSoC with Hardware Supported ... - CiteSeerX

32 downloads 5180 Views 631KB Size Report
Vodafone Chair Mobile Comm. Systems. D-01062 ... We propose a hardware scheduling unit which we call. CoreManager ..... Measurements have then been performed by activating and .... Supercomputing 2006 Conference, November 2006.
A Heterogeneous MPSoC with Hardware Supported Dynamic Task Scheduling for Software Defined Radio T. Limberg, M. Winter, M. Bimberg, R. Klemm, M.B.S. Tavares, H. Ahlendorf, E. Matúš, G. Fettweis Technische Universität Dresden Vodafone Chair Mobile Comm. Systems D-01062 Dresden, Germany

[email protected]

ABSTRACT In this paper, we present a fully programmable, heterogeneous single chip SDR platform with multimedia support. Running at 175 MHz, a peak performance of 40 GOPS is delivered while dissipating 1.5 W. The chip contains a hardware unit called CoreManager for run-time scheduling of tasks. The CoreManager solves the typical MPSoC programmability problem, improves energy efficiency and makes the platform scalable.

Categories and Subject Descriptors C.1 [Computer Systems Organization]: Processor Architectures; C.3 [Computer Systems Organization]: Special-Purpose and Application-Based Systems—Signal Processing Systems

General Terms Design

Keywords MPSoC, DSP, Signal Processing, Task Scheduling, RunTime Scheduling

1. INTRODUCTION Emerging next generation cellular standards like 3GPP LTE and WiMAX require vast amounts of modem signal processing. Both standards represent high data rate, low latency, packet optimized technologies, incorporating OFDMA/MIMO, adaptive modulation and state of the art channel coding techniques. In such systems, the dynamic variability of configurations due to user resource allocation in conjunction with high computational demand as well as low

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00.

H. Eisenreich, G. Ellguth, J.-U. Schlüssler, Technische Universität Dresden Parallel VLSI-Systems and Neural Circuits Chair D-01062 Dresden, Germany

[email protected]

latency requirements call for programmable distributed baseband architectures. On the other hand, broadband media applications (e.g. H.264) will be running in handsets as well. Their data dependent control flow does not allow effective scheduling at compile time. Thus, a run-time solution of this problem is preferable. Multi-core architectures - e.g. from Icera, Coresonic, PicoChip, Infineon’s MuSIC [11] or Sandbridge’s SB3011 platform [6] - are acknowledged to be power efficient [7, 1] in scenarios with stringent performance requirements. However, it is usually very hard to program such systems. Furthermore, traditional multi-core programming techniques impose penalties on energy efficiency which can easily dilute the benefits acquired from parallelism. Previous studies [13] have shown that the power efficiency of current DSP based modem implementations (considering the GSM part) did not scale up with the semiconductor technology improvement due to the additional complexity inherent to programmable systems and the increased algorithmic complexity (e.g. enhanced GSM speech codec). At least one fourth of the above mentioned losses can be attributed to ineffective multi-core programming techniques. In this context, while it might seem reasonable to use many application tailored hardware accelerators, which are power efficient on their own, interrupt based synchronization of these accelerators and control code causes considerable run-time overhead that finally depreciates the energy advantage. We propose a hardware scheduling unit which we call CoreManager to solve the above mentioned problems. The CoreManager enables a C-based programming model, following the synchronous data flow model [8]. In this programming model atomic computational kernels, called tasks, are instantiated from the control code and then automatically scheduled to processing elements (PE) at run-time. Besides enabling convenient programming of an MPSoC, the CoreManager helps to improve the energy efficiency of such systems in several ways: 1. Task scheduling is performed by a highly specialized hardware component which does not require any additional software. 2. Communication with the control code is not interrupt based. This reduces overhead caused by interrupt processing and cache misses due to context switching. 3. Partitioning of control code processing and number crunching allows to select the appropriate processor type for each

Host CPU

send Task De

send Task De send Task De

scription of T2 scription of T3

T3 depends on T1 and cannot be scheduled at this time – Wait until T1 has been finished! Nevertheless host can continue. send Task De

DMA Controllers

CoreManager scripton of T1

request T1 da request T2 da

T1 data trans

ta transfers ta transfers

perform data and program memory transfers to processing elements

fers finished

scription of T4

T4 is independent of its predecessors and can be started out of order (T3 is still waiting).

After T1 has finished, data transfers for T3 are scheduled.

Processing Elements

dependency checking against queued tasks

start T1 on pro request T4 da

cessing eleme

nt

ta transfers

ment processing ele T1 finished on request T1 wri teback transfer fers finished T1 data trans request T3 da ta

Task Execution

s

transfers

Figure 1: Sequence diagram showing the interaction of components for dynamic task scheduling (T1, T2 and T4 are independent, T3 depends on T1; some edges are left out to retain visual clarity). task. In this sense, any kind of hardware units such as ASICs, ASIPs or general purpose processors can be used as CoreManager controlled processing elements. Furthermore, it is also very important to mention that the CoreManager enables scalability. This means that processing elements can be added and removed without the need of software changes, which allows for easy adaption of existing systems to new demands. In this paper we present the Tomahawk MPSoC - a fully programmable heterogeneous low-power software defined radio platform with support for multimedia applications, based on the CoreManager concept.

// the task macro is defined here : # include " task . h " # pragma TASK_BEGIN # pragma TASK_TYPE

// start task declaration // specify target PE

void some_task ( void * i1 , void * i2 , void * out ) { // function implementation goes here // calls to sub - functions are allowed } # pragma TASK_END

// finish task declaration

int main ( ) { // task data declarations int * input1 , * input2 , * output1 , * output2 ; // initialize memories and data ...

2. MPSOC PROGRAMMING MODEL The C-based programming model of the Tomahawk, which is similar to the CellSS [2] programming model for the Cell processor, hides scheduling details from the programmer completely. However, in contrast to the CellSS software based scheduling, the CoreManager [12] computes the schedule of tasks issued from the control code with a dedicated hardware, and thus, it achieves a significantly better performance and energy efficiency. The programmer is merely required to annotate all C-functions which shall be executed as tasks on a certain processing element type with special #pragma directives. The tasks are then instantiated with pointers to their input and output data blocks, as well as the size of each of these data blocks. These parameters are provided within the task macro which actually requests the task execution. Listing 1 shows a simple example of how a task is called with different parameters dependent on the control flow. It is worth to mention that any off the shelf Ccompiler can correctly compile the shown program without modifications. Only the task.h header has to be replaced by an appropriate version which defines the task macro to perform a simple function call. Considering the compiler framework, an automated scriptcontrolled tool chain splits the task code from the control code and generates task binaries using the compilers of the different processing elements. Then, the task binaries are

some_task somedsp

// instantiate tasks if ( this_is_true ) task ( some_task , IN ( input1 , 256 ) , OUT ( output1 , 16 ) , OUT ( output2 , 64 ) ); else task ( other_task , IN ( input1 , 64 ) , IN ( input2 , 32 ) , OUT ( output1 , 1024 ) ); }

Listing 1: Simple programming model example

linked as data sections into the control code. Finally, the task macros are expanded to function calls, forming so-called Task Descriptions which are sent to the CoreManager at run-time. The Task Descriptions, which contain the addresses of the task binary, input and output data as well as the size of each of these memory blocks, are sent over the interconnection network to the CoreManager. Based on the input and output data blocks, dependency checking against all previously queued tasks is performed and annotated in the CoreManager. As soon as all dependencies for a certain task are resolved and an appropriate processing element is available, this task is started. Before task execution, program and input memories are copied from the global to the local PE memory. After task completion, the local PE memory is copied back to the global memory. Figure 1 shows an example of the dynamic task scheduling procedure.

DC212 GP M

DC212 GP M

FPGA Bridge S M

Entropy Decoder

VGA & Stream S M

DDR Ctrl S

DDR Ctrl S

DDR Ctrl S

NoC

Network-on-Chip

PCIe

M, S

NoC Master / Slave Port

S M

DDR Ctrl DDR SDRAM Controller

S

S M M

S

DMA

I2C & GPIO

NoC S LDPC Decoder

Deblocking Filter

S

M

M

M

CoreManager

DMA

DMA

DMA

M

M

M

M

S

Scratchpad Memory

NoC S

S

S

S

S

S

S

S

VDSP

VDSP

VDSP

VDSP

VDSP

VDSP

SDSP

SDSP

PCIe

PCI express Endpoint MAC

GPIO

General Purpose I/O

VDSP

Vector DSP

SDSP

Scalar DSP

LDPC

Low-Density Parity-Check Decoder

DMA

Direct Memory Access Controller

Figure 3: Tomahawk MPSoC schematic (external IP is dark gray). 0 8 16

two-dimensional memory sub -block sub-block overlap

24

Figure 2: Dependency checking between two 2-D sub-blocks of a 2-D memory which is stored line after line in memory (numbers are line start addresses).

Concerning the data transfers, the CoreManager tries to maximize the local reuse of program memories, thus decreasing the need for reloading. Consequently, the required NoC bandwidth and energy consumption are reduced. Concerning the dependency checking, the CoreManager and DMA controllers also support two-dimensional memory accesses. This allows for more effective implementation of multimedia and MIMO algorithms. Figure 2 shows how 2-D dependency checking on the memories is performed.

3. MPSOC ARCHITECTURE The Tomahawk MPSoC exploits instruction, data and task level parallelism in order to meet stringent performance requirements with low energy consumption. Figure 3 shows a schematic of the Tomahawk. Below, we briefly discuss the components building this architecture. Two Tensilica DC212GP RISC processors build the platform for the operating system and control code execution. The signal processing block of the Tomahawk is composed of six 4-fold SIMD fixed-point vector DSPs (VDSP), two scalar floating-point DSPs (SDSP), a low density parity check code (LDPC) decoder ASIP, a deblocking filter ASIP and an entropy decoder ASIC. The CoreManager performs the scheduling of signal processing tasks issued from the control code onto the VDSPs and SDSPs. The VDSPs are meant for processing vectorized parallel algorithms such as FFTs or DCTs. Each VDSP is able to provide 20 MOPS/MHz, resulting in an accumulated processing power of 21 GOPS at 175 MHz operation frequency. Algorithms with a high dynamic range such as matrix inversions for MIMO processing can be processed by the SDSP. Furthermore, the SDSP is intended to be used for bit stream processing. For that purpose, all processor instructions can be executed conditionally. Both SDSPs jointly contribute 0.7 GOPS to the total processing power of the Tomahawk.

The LDPC decoder ASIP [3] is able to decode variable block length at throughput of several hundred MBit/s. For instance, a gross data rate of 1.1 GBit/s for a (3, 6) code is achieved by our implementation. The 64 way parallel, 8 bit fixed point SIMD-VLIW decoder architecture delivers a performance of 12 GOPS. The peripheral part of the Tomahawk consist of the following components: an FPGA bridge enabling additional functionalities by the mapping of off-chip components into the address space of the Tomahawk, a single lane PCI Express interface realizing communication links of 2 GBit/s to a host computer, a VGA/Streaming interface that allows interfacing an AD/DA converter or a normal VGA display, a freely programmable DMA controller, general purpose I/Os and an UART interface. All components on the chip are connected by two low latency, high bandwidth, crossbar-like master/slave networkson-chip (NoC) [14] with 32 bit bus-width. The NoC performs static priority arbitration per slave and supports burst transfers of up to 63 data words. For latency improvement, the NoCs operate on negative clock transition. A sustained throughput of 5.47 GBit/s is achieved for each master-slave connection. The crossbar-like architecture allows parallel communication channels for each master-slave pair. For extensibility, the FPGA bridge provides a full bandwidth interface to the external world. Hence, additional components can be connected to Tomahawk. Moreover, depending on the function of the component attached to it, the FPGA bridge can be configured to work as master or as slave. The VDSPs and SDSPs, the LDPC decoder and the deblocking filter ASIP implementations are all based on the synchronous transfer architecture (STA) [4]. STA processors explicitly use register file bypassing. In contrast to the traditional approach, the STA functional units hold and exchange data directly with each other, instead of using a central register file. This significantly reduces the I/O bandwidth, size and power consumption of the register file.

3.1

Memory Architecture

In the Tomahawk, local and global memories can be found. The local memories are part of the signal processing elements, which do not have direct access to the global memories. On the other hand, the global memories are accessible from all NoC master components (see Figure 3) and consist of external DDR-SDRAMs and I2 C as well as an internal 256 KByte SRAM, which is used as scratchpad memory.

Three independently (and parallel) accessible DDR RAM controllers provide large memory bandwidth in order to supply the processing elements (PE) and the control processors (CP) with data. The I2 C memory serves as boot-ROM for the Tomahawk.

4. DESIGN IMPLEMENTATION 4.1 RTL Design and Functional Verification In order to implement, test and verify the Tomahawk-SoC a bottom up design approach has been used. The scalar and vector processors as well as the deblocking filter and LDPC decoder ASIPs have been implemented using our automated STA design tools [4]. These tools generate a C++ based instruction-set simulator (ISS) as well as an assembler from a high-level XML machine description. Furthermore, a Verilog template is generated from this XML description. This template contains the fully functional instruction decoder and stubs of all functional units (FU) of the processor. Additionally, the interconnection between the FU stubs is generated automatically, reducing error-prone hand written interconnect design. Thus, only the behavior of the FUs needs to be implemented manually. Along with the Verilog templates, appropriate test benches are generated which allow the automated cross checking of the Verilog implementation against the generated ISS. During the verification process, corner case and random test patterns, as well as complete assembly programs have been used to verify the correct functionality of the processors in simulations. The networks on chip have been also generated automatically. Hand written test master and test slave modules have been used to verify the functional correctness of the network and its connecting components. Finally, in order to verify the complex interaction of the different MPSoC components, several chip configurations have been implemented as FPGA prototypes. For this purpose, Altera Stratix II 180 and Stratix II GX 130 platforms have been used.

4.2 Logical and Physical Implementation A flat logical and physical implementation of the complete Tomahawk MPSoC was not feasible. The tool runtimes would have been unacceptably high because of the large chip size. Therefore a hierarchical approach was chosen. A synthesis, place&route and sign-off flow was set up not only for the toplevel, but also separately for each PE and the CoreManager. The resulting subchips were then used during the toplevel implementation as macro blocks. Besides the extremely reduced tool run-times this approach allowed to parallelize the logical and physical implementation work. For the synthesis task, Synopsys Design Compiler Ultra was used, Place&Route was done with Cadence SOC Encounter (including QRC gate level extractor and CeltIC crosstalk analyzer). The setups of all tool flow runs were reproducible (script/makefile based). This allowed fast design iterations. Starting with ”dirty” netlists based on incomplete RTL, initial floorplanning was done for the subchips and the toplevel. In this way, critical implementation issues (e.g. design constraining and critical timing paths) could be identified and addressed early in the project schedule. To find a suitable power mesh, IR-drop analysis based on statistical assumptions and post-layout netlist simulation data for

the node activity were performed for the subchips and the assembled toplevel (flat). The timing sign-off was done with Synopsys Primetime/SI, based on the flat toplevel netlist and gate level parasitics gained with Synopsys Star-RCXT. Remaining timing violations were fixed by ECO runs. The physical sign off was based on Mentor Calibre. For further reduction of tool run-times, multi-thread capabilities were used where possible. The execution speed for routing (Nanoroute), gate level parasitics extraction (QRC, StarRCXT) and DRC scaled almost linearly with the number of processors used. Also LVS could be accelerated significantly by enabling the Calibre multi-thread option.

5. 5.1

RESULTS Implementation Results

The Tomahawk chip was designed using a UMC 130 nm, 8 metal layer CMOS standard cell design flow. The 57M transistor chip occupies 10x10 mm2 (including all 480 I/O cells) and runs at 175 MHz. The typical case core supply voltage is 1.2 V, the I/O voltages are 3.3 V and 2.5 V for the normal and for the high speed SSTL2 I/Os, respectively.

Figure 4: Tomahawk reference board plugged into PCIexpress slot of a barebone PC. Figure 4 shows the Tomahawk prototype board. All presented results have been measured at this board except if otherwise denoted. For the core power measurements, the PCB provides an independent power supply for the Tomahawk core. The core supply voltage can be adjusted from 0.9 to 1.35 V and has been set to 1.3 V for all measurements. Since the Tomahawk has only one single power domain for all components, obtaining exact power numbers for single components is impossible. Therefore, we approximately determined power numbers by ensuring that only the component under observation is running during the measurement. Measurements have then been performed by activating and deactivating the component under observation. We could note that all measurements results have been in the same range as power simulations on back annotated place and route netlists. Table 1 summarizes the power and area results of the core components. More detailed results for the single sub-components are shown in [9]. The CoreManager needs an average time of about 60 clock cycles to schedule one task. Running at 175 MHz clock rate

Table 1: Programmable functional units and CoreManager implementation results. Unit

Power/mW

Area/mm2

Memory area

SDSP

27

3.33

91.1 %

VDSP

85

3.80

79.5 %

CoreManager

282

5.95

24.3 %

DC212GP

30

2.50

15.8 %

LDPC

354

7.89

64.0 %

Deblocker

86

4.54

86.0 %

of 445 mW is due to missing clock gating at all peripheral components and the CoreManager. In order to compare the performance values of the complete Tomahawk MPSoC with [10], we scaled the dynamic power consumption of the fully loaded chip by 0.69 · 0.5, where the first scaling factor comes from the voltage difference of 1.3 V and 1.0 V (i.e. (1.0V )2 /(1.3V )2 = 0.69) and the second factor is due to the process gain when going from 130 nm to 90 nm (frequency remains the same) [5]. From Figure 6, it can be observed, that the Tomahawk outperforms existing designs by an order of magnitude, nearly achieving the ASIC results of [10].

5.2 and consuming 282 mW, the CoreManager dissipates 100 nJ energy per task. Compared to about 500 nJ/task that would be required if the scheduling algorithm would run 3000 cycles on a standard RISC processor like the Tensilica DC212GP core, this is an significant improvement. In order to save power, the CoreManager explicitly switches off the clock for PEs which are not in use. This is done in addition to the clock gates which are available for all registers in the PEs. The CoreManager itself is not clock gated. This leaves room for significant power reduction in future designs. The presented CoreManager power numbers are simulated on back annotated place and route netlist. Measuring the real power consumption is practically impossible, because the CoreManager is not able to run without running at least one DC212GP and the DSPs. Furthermore, both NoCs are under load. Considering that both DC212GP, all vector and scalar DSPs, the CoreManager and LDPC decoder are fully loaded and simultaneously running, the overall dynamic power consumption of these components is 1260 mW. However, full utilization is not very likely to appear. If we consider a more realistic utilization of 80% for each component, a dynamic power of about 950 mW would be observed. If we now add the static power consumption of 130 mW and the clock tree power consumption of 445 mW (for all inactive components) we end up with 1525 mW core power consumption for a realistic application scenario. The huge clock tree power

Core Manager

6x VDSP

PLL Scratchpad Memory (256 KB)

Applications Benchmarks

A part of an OFDM receiver with configurable number of sub-carriers and H.264 video decoding have been implemented on the Tomahawk platform as demanding application examples for modem and multimedia signal processing. The OFDM receiver consists of an FFT, the equalizer and the demapper. For a configuration with 256 sub-carriers and 16 QAM modulation, the accumulated task run-time of one OFDM symbol on a single processor is 14500 clock cycles. Linear scaling of the application can be achieved if multiple OFDM symbols are processed in parallel. The H.264 implementation is capable of decoding a QVGA stream with 25 frames per second. The scalability of this H.264 implementation is currently not competitive because the order of macro-blocks is not optimized for parallel processing. In this case a scale factor of 1.2 was observed for two processing elements. Further processors did not speed up this implementation. In order to show the scaling potential of the CoreManager, we implemented two benchmark applications. The first one starts tasks which are all independent from each other. The second one assumes more realistic 50% of dependencies. The length of the tasks to be started is configurable. Figure 7 shows the speedup achieved by both benchmarks with accumulated 4kByte of input and output data. The reason for the inferior performance gain of the testcase with 2500 clock cycles is the limited memory bandwidth. The 4096 Bytes of input and output data need about 1100 clock cycles to complete. This is almost half the time which is required for task execution. At this proportion of memory transfer time to execution time, the DMA controllers are fully loaded but unable to serve all tasks with the required data in time.

DC212GP DC212GP

DDR Ctrl

LDPC-CC Decoder Deblocking Filter PLL

SDSP 0 SDSP 1 PLL

Figure 5: Chip micrograph.

Figure 6: Tomahawk - normalized to a 1V 90nm process and 12 bit adder equivalents - added to a graph from [10].

6 5 Speedup

8.

10000 Cycles, 0% 10000 Cycles, 50% 5000 Cycles, 0% 5000 Cycles, 50% 2500 Cycles, 0% 2500 Cycles, 50%

4 3 2 1 0

1

2

3 4 Number Of Cores

5

6

Figure 7: CoreManager scaling for different task run-times with fixed input and output size of 2 kByte each.

6. CONCLUSION AND FUTURE WORK We presented Tomahawk, a low power, heterogeneous MPSoC for signal processing applications. The major innovation of the Tomahawk is the hardware based run-time scheduler called CoreManager which is capable of scheduling tasks automatically to a set of processing elements. Task dependencies are analyzed at run-time and accounted during scheduling. We consider the Tomahawk as a proof of concept of CoreManager based platforms. The CoreManager combines the three advantages: platforms are programmable easily in high level languages like C, the platform is scalable and the energy efficiency can be improved. Our future research efforts aim to improve the CoreManager in four ways. First of all, we are aiming to maximize the local reuse of data memories at the processing elements. Second, we are extending the CoreManager with explicit real-time support. Third, efforts required to use the CoreManager in many-core systems are being studied. And finally, a CoreManager controlled dynamic power management with voltage and frequency scaling of PEs is aimed.

7. ACKNOWLEDGMENTS We would like to acknowledge Prof. Ren´e Sch¨ uffny, head of the Parallel VLSI-Systems and Neural Circuits Chair at the TU Dresden, for providing his experienced team of researchers for the backend work. Furthermore, we would like to thank Frank Siebler, Pablo Robelly, Markus Ullmann, Johannes Lange, Arne Lehmann, Boris Boesler and Patrick Herhold for providing benchmark applications. We are also thankful to the ZMD AG Dresden for providing access to their wafer prober and to the Institute of Semiconductors and Microsystems at the TU Dresden for providing a crosssection polish image of a die. Finally, we would like to thank Synopsys, Tensilica and Altera for sponsoring Software, IP and prototyping FPGAs. The major part of this research work has been done within the scope of the WIGWAM project, funded by the German Federal Ministry of Education and Research. A minor part was funded by the European Union within the scope of the E2R project.

REFERENCES

[1] K. Asanovic, R. Bodik, B. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006. [2] P. Bellens, J. M. Perez, R. M. Badia, and J. Labarta. CellSS: a programming model for the cell be architecture. In Proceedings of the ACM/IEEE Supercomputing 2006 Conference, November 2006. [3] M. Bimberg, M. Tavares, E. Matus, and G. Fettweis. A high-throughput programmable decoder for LDPC convolutional codes. In Proceedings of the 18th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP’07), Montreal, Canada, July 2007. [4] G. Cichon, P. Robelly, H. Seidel, E. Matus, M. Bronzel, and G. Fettweis. Synchronous transfer architecture (STA). In Proceedings of the 4th International Workshop on Systems, Architectures, Modeling, and Simulation (SAMOS’04), pages 126–130, Samos, Greece, July 2004. [5] K. Flautner. The wall ahead is made of rubber. In 4th HiPEAC Industrial Workshop on Compilers and Architectures, Cambridge, UK, November 2007. [6] J. Glossner, D. Iancu, M. Moudgill, G. Nacer, S. Jinturkar, S. Stanley, and M. Schulte. The sandbridge SB3011 platform. EURASIP J. Embedded Syst., 2007(1):16–16, 2007. [7] M. Horowitz and W. Dally. How scaling will change processor architecture. In Proceedings of the IEEE Solid-State Circuits Conference, 2004, Digest of Technical Papers ISSCC, pages 132 – 133, February 2004. [8] E. A. Lee and D. G. Messerschmitt. Synchronous data flow. Proceedings of the IEEE, 75(9):1235–1245, 1987. [9] T. Limberg, M. Winter, M. Bimberg, R. Klemm, E. Matus, M. Tavares, G. Fettweis, H. Ahlendorf, and P. Robelly. A fully programmabel 40 gops sdr single chip baseband for lte/wimax terminals. In Proceedings of the 34th European Solid-State Circuits Conference, ESSCIRC, Edinburgh, Scotland, September 2008. [10] D. Markovic, B. Nikolic, and R. Brodersen. Power and area minimization for multidimensional signal processing. IEEE J. Solid-State Circuits, 42(4):922–934, April 2007. [11] U. Ramacher. Software-defined radio prospects for multistandard mobile phones. Computer, 40(10):62–69, 2007. [12] H. Seidel. A Task-level Programmable Processor. PhD Thesis, WiKu, Duisburg, October 2006. [13] O. Silven and K. Jyrkk¨ a. Observations on power-efficiency trends in mobile communication devices. EURASIP J. Embedded Syst., 2007(1):17–17, 2007. [14] M. Winter and G. Fettweis. Interconnection generation for system-on-chip design. In Proceedings of International Symposium on System-on-Chip 2006, pages 91–94, Tampere, Finland, November 2006.