ePUMA: Embedded Parallel DSP Processor

4 downloads 0 Views 216KB Size Report
primarily aimed at low power, low cost and high per- formance embedded ... in SIMD extensions or vector machines, instruction level parallelism of VLIW pro-.
ePUMA: Embedded Parallel DSP Processor Architecture with Unique Memory Access Jian Wang, Joar Sohl, Olof Kraigher, Lan Dong and Dake Liu Electrical Engineering Department Link¨oping University Sweden Email: {jianw, joar, kraigher, lan, dake}@isy.liu.se

Abstract

machines, instruction level parallelism of VLIW processors, and task parallelism from multi-core computers. These parallel DSP architectures have shown their advantages by several industrial successes. The SIMD extension architecture, such as PowerPC’s AltiVec, offers parallel data processing on single instruction to improve the performance at very low, almost none, extra power consumption. The VLIW architecture, used in TI’s DaVinci media DSP processor, efficiently utilizes the hardware resources by running several instructions in parallel. However, it is hard for these architectures to handle today’s high performance embedded computing characterized by much larger volume data and more complex processing algorithms, especially in the applications of stream processing which have additional real-time constraints. Such applications include MIMO communications, HD video codec, and radar signal processing.

ePUMA[1] is an ongoing project in the Division of Computer Engineering at Link¨ oping university, Sweden. It is supported by the SSF, Swedish Foundation for Strategic Research. The goal of this project is to develop a Parallel ASIP DSP processor for realtime stream computing. The essential technology is to separate data access kernels from arithmetic computing kernels so as to hide or minimize data access time by running these two types of kernels in parallel. Both our hardware system and programming flow are primarily aimed at low power, low cost and high performance embedded parallel computing for real-time communication and media signal processing.

Keywords

In recent years, a new trend in parallel DSP proParallel DSP; SIMD processor; conflict-free mem- cessor design is towards a master-multi-SIMD archiory access; data permutation; parallel programming; tecture, for example the IBM Cell processor. This template based programming architecture can bring more computing power from both the multi-core topology and the SIMD datapath. However, it also comes with several important design challenges which are the research focuses 1 Introduction of our ePUMA project. The hardware design chalTo meet the continuously increasing computational lenges of ePUMA are from the memory access laload acquired from various signal processing appli- tency problem. Previous research has shown that for cations, the DSP processor design has gone through the parallel DSP architectures, the performance bea development process that involves explorations of comes very much communication limited[2]. Our idea different types or levels of parallelism. For exam- is to define separate data access kernels and computple, data parallelism in SIMD extensions or vector ing kernels, and run them in parallel to hide or min1

imize the memory access latency. Another important factor of access latency in SIMD or vector operations comes from the vector data load and store. To improve the performance of current time-consuming SIMD shuffle instructions on vector register file, we use the P3RMA solution proposed from[2]. Some other references about the conflict free parallel memory access can be found in [3] and [4]. Moreover, the on-chip communication architecture, memory architecture and intelligent DMA controller also have design challenges to reduce the memory access latency and consequently to improve the performance. The software design challenges of ePUMA project include parallel programming model and compiler tool-chain. We choose to use kernel based parallel programming which is compatible with the new OpenCLTM [5] standard. OpenCLTM is the first open standard for parallel programming for heterogeneous platforms. Another idea for the ePUMA project is the intelligent kernel identification by running profiling tools on application software to extract potential of parallelism and form the data access kernel. Some research on this has been done in our group[4]. The following sections in this paper will generally describe ePUMA hardware architecture (Section 2) and programming model (Section 3). After these some performance benchmarks will be presented (Section 4). At the end there is a conclusion (Section 5).

SIMD 1

Star Network Ring Network

N1

N2

N3 SIMD 4 Local Memroy

SIMD 7

N6

N5

SIMD 6

DM Bank-0 Bank-1 Bank-2 ... Bank-7

SIMD data permutation

N7

CM

N4 DMA data permutation

N8

Master DMA Main Mem

8-way SIMD-Core

PM

SIMD 8

SIMD 5

Figure 1: ePUMA hardware architecture overview cal memory system, is designed to maximally reduce the data access overhead for predictable embedded DSP applications. The essential technologies include the enhanced DMA controller and the software programmable SIMD local memory system. Conflict-free parallel access of multi bank memory is important to achieve high executing efficiency of SIMD programs. The ePUMA memory subsystem supports software programmable data allocation in SIMD local vector memory. The SIMD instruction set supports lookup table based vector addressing. Combining these two, we can achieve very flexible data access of multi-bank memory without bank conflict, as illustrated by the example in Figure 2.

8 9 10 11 12 13 14 15

As shown in Figure 1, the ePUMA platform uses a heterogeneous chip multiprocessing architecture consisting of one host controller and eight SIMD coprocessors. It decouples program control and data processing to maximize the computing performance. The host controller is a RISC processor, it controls the coprocessors and the memory subsystem to execute independent compute and transfer data. Each SIMD coprocessor has an 8-way parallel data path. The combined eight SIMD processors deliver the majority of the system’s computing power. The ePUMA memory subsystem, including the DMA controller, the on-chip interconnection and the SIMD lo-

SIMD data access

0 7 10 13 ...

Bank-0

1 4 11 14 ...

Bank-1

2 5 8 15 ...

Bank-2

3 6 9 12 ...

Bank-3

SIMD permutation

4 5 6 7

Hardware Architecture

SIMD local memory DMA permutation

Global memory 0 1 2 3

2

SIMD 3

SIMD 2

load.vector[0,1,2,3] load.vector[0,4,8,12]

Figure 2: Conflict-free multi-bank memory access

3

Programming Model

In the ePUMA kernel based parallel programming model, the software program is partitioned into master program running in host processor and kernel program running in SIMD coprocessors. The host program is written in C with certain extension functions. These extensions are for system control in2

cluding DMA data transaction, multi-core synchronization, on-chip interconnection configuration and task scheduling. The code below shows the master program for DMA data transaction using ePUMA C extensions. Another task of host program is to execute sequential part of the application program.

The SIMD coprocessors are programmed in assembly, using ePUMA SIMD instructions with lookup table based memory access. We decide to program the SIMD coprocessor in assembly for the following two reasons. The first is to play the greatest performance of the SIMD core by programming in low level. Secondly, today’s compiler techniques can not show satisfactory performance for software vectorization. Especially for our ePUMA SIMD processor with more powerful and complex vector operations than conventional SIMD architectures. The code below gives an example of the ePUMA SIMD assembly program.

Listing 1: ePUMA host program /∗ D e f i n e a DMA t a s k ∗/ DMATask T1{ TYPE = MM LVM; MODE = BROADCAST; SIMD SEL = 0 x000F ; SIZE = 1 2 8 ; VAGU { // SIMD0 VAGU−1 c o n f i g u r a t i o n SIMD ID = 0 ; SC = { INIT = 0 ; TOP = 5 1 2 ; BOT = 0 ; STRIDE = 8 ; INCR = 1 ; }; PT = { VAL SEL = 1 ; INIT = 0 ; TOP = 5 ; BOT = 0 ; INCR = 1 ; }; LUT = { 7 ,6 ,5 ,4 ,3 ,2 ,1 ,0; 0 ,9 ,18 ,27 ,36 ,45 ,54 ,63; }; }; };

Listing 2: ePUMA SIMD program .code kernel start : vcopy vr0 m0 [ cm [ 0 ] ] .vw vcopy vr1 m1 [ cm [ c a r 0 +=1]] .vw vcopy vr2 m0 [ a r 0+=S + cm [ c a r 0 +=1%]].vw vadd vr3 vr0 vr1 vsub m0 [ a r 0 +=8].vw m1 [ a r 1 +=8].vw vr2 256∗ vadd m0 [ a r 0 +=8].vw m1 [ a r 1 +=8].vw vr3 vscmac v a c r m0 [ a r 2 +=2] . s d m1 [ a r 0+=8%]. v d ... i n t q 0 xa1 stop jmpq k e r n e l s t a r t

4

Benchmark Results

The current benchmarking programs are executed on the ePUMA cycle-true and pipeline-accurate simulator. A fully functional simulator including components as the host processor, the SIMD coprocessor, the on-chip interconnection and the memory subsystem has been developed using C++. The command line simulation environment and GUI interface have been developed using Python. We assume a external memory access throughput of 128-bit per cycle. The benchmark results of selected DSP kernel algorithms are listed in Table 1, where a comparison between non-optimized SIMD, one ePUMA SIMD and eight ePUMA SIMD processors is shown.

/∗ C o n f i g u r e DMA c o n t r o l l e r ∗/ i n t dma hdl ; dma hdl = ConfigDMATask ( T1 ) ; /∗ S t a r t DMA t r a n s a c t i o n ∗/ StartDMATask ( dma hdl ) ; /∗ Some o t h e r t a s k s ∗/ ... /∗ Wait f o r DMA f i n i s h ∗/ WaitOnDMAFinish ( dma hdl ) ;

3

Table 1: Current benchmarking (cycle cost) Algorithm 64×64 Real Matrix Mul. 64×64 Complex Matrix Mul. 64×64 LR-factorization 8k FFT 64k FFT 8k 8-tap FIR

5

general 8-way SIMD 275342

ePUMA 8-way SIMD 42531

82166 73216

29435 30720 9127

Conclusion

ePUMA 8×8-way SIMD 5926 10921 9472 81920 3624

[5] “OpenCL - the open standard for parallel programming of heterogeneous systems.” http:// www.khronos.org/opencl/.

The ePUMA project aims at providing a parallel DSP solution for high performance embedded applications. With research focuses on reducing memory access latency and programmer friendly parallel programming flow, as well as low silicon cost and low power consumption, we are expecting to have a result of 3 times higher performance over silicon area and half power consumption of COTS products.

Acknowledgements The authors would like to thank SSF, Swedish Foundation for Strategic Research, for the support of this project.

References [1] “ePUMA: embedded Parallel computing architecture with Unique Memory Access.” http://www. da.isy.liu.se/research/scratchpad/. [2] D. Liu, Embedded DSP Processor Design, ch. 20. Link¨ oping, Sweden: Morgan Kaufmann, 2008. [3] B. R. M. G¨ ossel and R. Creutzburg, Memory Architecture and Parallel Access. New York, USA: Elsevier Science Inc., 1994. ¨ [4] B. Lundgren and A. Odlund, “Expose of patterns in parallel memory access,” m.sc. thesis, Link¨ oping University, Link¨ oping, Sweden, Sept. 2007. 4