Reconfigurable hardware implementation of BinDCT ... - IEEE Xplore

1 downloads 0 Views 280KB Size Report
FBinDCT-C9. RBinDCT-C1. RBinDCT-C9. Reconfigurable hardware implementation of BinDCT. C.W. Murphy and D.M. Harvey. Enhancing the coding gain and ...
Hongyu Liao, M.K. Mandal and B.F. Cockbum (Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB T6G 2V4, Canada)

attributed to the loss-less data reconstruction characteristics of lifting ladders, with numerical error due to finite word lengths only to blame. Results concluded that BinDCT-C9 achieved the greatest inherent data redundancy of all nine configurations while also exhibiting the greatest

References

RMSEs.

and SWELDENS, W: ‘Factoring wavelet transforms into lifting steps’, J Fourier Anal. Appl., 1998, 4, (3), pp. 245-267 ANDRA, K., CHAKRABARTI, C., and ACHARYA, T.: ‘AVLSI architecture for lifting-based wavelet transform’. IEEE Workshop on Signal Processing Systems, Lafayette, LA, USA, 2000, pp. 70-79 JIANG, w., and ORTEGA, A.: ‘Parallel architecture for the discrete wavelet transform based on the lifting factorization’. Proc. SPIE, Denver, USA, July 1999, Vol. V3817, pp. 2-13 LIAN, c., CHEN, K., CHEN, H., and CHEN, L.: ‘Liftingbased discrete wavelet transform architecture for JPEG2000’. Proc. of IEEE ISCAS, Sydney, Australia, 2001, Vol. V2, pp. 11445-11448 DAUBECHIES, I.,

Reconfigurable hardware implementation of BinDCT C.W. Murphy and D.M. Harvey Enhancing the coding gain and operand throughput of a data compression application through dynamic hardware reconfiguration is described. A novel implementation of the BinDCT, a discrete cosine transform (DCT) approximation has been realised on a run-time reconfigurable custom XC6200 field programmable gate array-based dynamic coprocessor, coupled to a TMS320C40 digital signal processor system. Introduction: Xilinx XC6200 field programmable gate arrays (FPGAs) are one of a few devices that can exhibit dynamic runtime reconfigurable (RTR) properties. The Xilinx XC6200 FPGAs have a FastMAPTMinterface that facilitates partial reconfiguration [ 11. While this occurs, unaffected regions of the FPGA remain active, enabling them to exhibit virtual hardware characteristics [Z]. Virtual hardware can increase the computational density by reusing and redeploying FPGA hardware resources dynamically during run-time. The design of the BinDCT [3] algorithm has provided an application most suited for this technology. Using XC6264 FPGAs, a dynamic coprocessor implementation has been designed that demonstrates the feasibility of using RTR hardware to enhance system performance [4].

BinDCT algorithm: The BinDCT is a multiplier-less fixed-pointfriendly approximation of Chen’s fast discrete cosine transform (DCT) [5] implementation. Within the BinDCT Chen DCT type butterflies and plane rotations have been replaced by a series of dyadic lifting steps. Lifting steps can be considered as scaling and addition operations using integer fix-point implementation-friendly values. A characteristic of lifting steps is that an input can be reconstructed from an output response without error if identical coefficient values are used in both operations. Using this property, nine dyadic lifting step configurations were developed [3], known as BinDCT-C1 to BinDCT-C9. Each generated an approximation of the DCT algorithm, having varying degrees of accuracy. The respective processing complexity of each BinDCT configuration (BinDCT-C 1 to BinDCT-C9) reduced as the DCT approximation error increased. Research: BinDCT configurations were evaluated using Ramp, DC level, Mexican Hat, Unit Step, and Impulse input functions. Results obtained indicated that BinDCT root mean square error (RMSE) was dependent upon the configuration used and the frequency content of the inputs [4]. Forward BinDCT approximations generated the largest MSEs. For an input coefficient range from 0 to 255 the maximum RMSE of an 8-bit forward BinDCT (FBinDCT-C9) was 70.8 for a step input fimction but much less for other inputs. The largest RMSE for an 8-bit BinDCT-C1 was however much smaller at 0.89, for a step input fimction, and less for other inputs. The reverse BinDCT transforms’ RMSEs were very and considered negligible. This can be small (typically 1.2 x

1012

To investigate the DCT coding gain, the most accurate (BinDCT-C1) and least accurate (BinDCT-C9) BinDCT configuration forward transform coefficients were used. For high frequency input data, higher compression ratios were obtained using BinDCT-C1 compared to BinDCT-C9. Paradoxically, for low frequency content inputs BinDCT-C9 generated greater loss-less compression ratios than BinDCT-Cl. The design was therefore required to determine which transform configuration generated the most efficient coding gain for each 8 x 8 pixel tile in an image. The optimal BinDCT configuration used for each input sequence was determined through analysing input data frequency content, which would normally vary within a real-time operation. If the BinDCT configuration remained static, with the input image data-stream frequency content varying, optimal loss-less DCT coding gain and throughput would be unobtainable. This is because the BinDCT configuration most suited to each image tile would not be configured all the time. To correct this and achieve maximum compression all the time, the BinDCT configuration must be updated for each image tile during system operation. This operational notion provided the basis for increasing loss-less compression and operand throughput by using dynamic RTR hardware implementation.

Hardware implementation: Dynamically reconfigurable forward and reverse BinDCT transforms have been constructed using VHDL and implemented on a DSP/dynamic hardware development platform. This consists of a commercial TIM-40 TMS32,0C40 (C40) parallel processor, custom designed XC6200 FPGA prototype environment, and software development tools [4, 61. Both systems interact within a host PC, enabling the XC6200 FPGA to be configured as a C40 dynamic RTR coprocessor, and/or inter-node routing and processing engine. Transforms BinDCT-Cl and BinDCT-C9 have been configured within an XC6264 FPGA functioning as a C40 memory mapped coprocessor. Compared to the latest static FPGA architectures, XC6200 FPGAs suffered from inefficient limited logic and routing resources, but were unique in being dynamically reconfigurable. This restricted BinDCT transform implementation to serial pipeline architectures and forced active transform selection to be conducted within the C40. The XC6264 hardware statistics of transforms BinDCT-C1 and BinDCT-C9 are listed in Table 1.

Table 1: XC6264BinDCT hardware implementation characteristics

I

I

I BinDCT throuehnut (at maximum freauencv) I Piueline full II Piueline emDtv .. I OPSIS OPSIs

II Configuration I Maximum freauencv 1 ~~

I

MHZ

FBinDCT-C9 RBinDCT-C1 RBinDCT-C9

5.3 4.5 4.17

265.1 IkBinDCT 225.27kBinDCT

88.37kBinDCT 37.54kBinDCT

208.46kBinDCT

69.48kSinDCT

Dynamic update of the active BinDCT transform was performed using a novel custom designed ‘self-configuration’ control mechanism (Fig. 1). A self-configuration controller, and interfaces to the C40 global bus and an extemal configuration store were designed and run on the XC6264. The BinDCT processor hardware was implemented within the remaining XC6264 coprocessor h c t i o n area, approximately 77% of the configurable logic cell (CLC) area. RTR was instigated by the C40 through the coprocessors (XC6264) address space. Once RTR commenced, the XC6264 configuration was updated from the 262 kbyte configuration store. This process occurred using partial configuration through the XC6264’s FastMAPTMinterface independently of the C40 or host PC. Custom software tools have been developed that reduce the volume of RTR configuration data required. These function by analysing two successive XC6200 configurations to determine the routing and logic changes that occur between them. To switch between transform configurations C1 and C9 requires 6401 (12 ms) and 6751 (12.5 ms)

ELECTRONICS LETTERS 29th August 2002

Vol, 38 No. 18

XC6264 address updates (via FastMAPTM)for forward and reverse transforms, respectively. However, if configuration data could be stored on chip in a multi-context memory, reconfiguration would be instantaneous. XC6264 FPGA

I

X

C

I

C.C array

~ external ~ ~

configuration memorystore

Using this dynamic technique an additional 1956 (Dynamic BinDCT-Tme DCT) forward transform DC coefficients generated were at zero compared to the respective static configuration versions (BinDCT-CI: 38777, BinDCTC9: 3335, True DCT: 38899, Dynamic BinDCT: 40855). This gave a DCT coding gain of 5%. For the test image (Fig. 2), 2D forward then reverse BinDCTs were run on the XC6264 designs to assess pixel errors. The static BinDCT-C1 had a RMSE of 0.0758, the static BinDCT-C9 an RMSE of 0.0439 and the dynamic BinDCT only an RSME of 0.0027. Therefore the dynamic BinDCT was more accurate than the individual static BinDCTs tried for the test image. At the beginning of this research, the XC6200 FPGA family were the only commercially available dynamic RTR FPGAs. Limitations within the XC6200 FPGA architecture restricted BinDCT transform implementations and RTR performance. Comparison to cutting edge FPGA and processor technologies, the operating properties of this operation appear inefficient. However, this image processing application has practically demonstrated how RTR hardware can be used to increase both the compression ratios and operand throughput. Merged with the loss-less compression properties of the BinDCT configurations, dynamic BinDCT implementation provides a basis for enhancing speed-dependent low-power consumption compression applications.

Fig. 1 XC6200-based dynamic coprocessor topology

Operating characteristics of the transforms (Table 1) demonstrated that for both forward and reverse transforms BinDCT-C9’s throughput was greater than BinDCT-C 1, since BinDCT-C9 co-mputation complexity was less than BinDCT-C 1. BinDCT-CI transforms required six-stage pipeline designs, whereas BinDCT-C9 required three-stage pipeline designs. For similar clocking speeds the forward and reverse BinDCT-C9 was, respectively, 205 and 185% faster than the BinDCT-Cl with the pipelines initially empty. Therefore, by dynamically switching between transforms real performance advantages can be made. This is an inherent feature of the BinDCT algorithm, regardless of implementation.

Experimental application: To demonstrate two-dimension (2D) dynamic BinDCT operation, 8 x 8 pixel kernels within a standard 5 12 x 5 12 grey scale image (Fig. 2) were analysed to determine which BinDCT forward transform generated the greatest DCT coding gain. Results obtained indicated that 529 out of 4096 (13%) kernel operations exhibited greater inherent loss-less compression using configuration BinDCT-C9 than BinDCT-C 1. The distribution of BinDCT configuration usage within the source image (Fig. 2) is shown in Fig. 3, with the corresponding BinDCT-C9 kernel operation locations represented in black and BinDCT-C1 operations in white.

Conclusions: The implementation of dynamic coprocessor hardware has demonstrated how RTR implementations can improve the compression ratio, accuracy and operand throughput of a BinDCT application. Through RTR of two BinDCT modes (BinDCT-C1 and BinDCT-C9) image data DCT coding gain has increased by 5%, and operand throughput can be increased by over 200%. With improved dynamic semiconductor technologies, reconfiguration times will decrease and all nine BinDCT configurations could be used, allowing even more dramatic results. 28 May 2002 0 IEE 2002 Electronics Letters Online No: 2002071 I DOI: IO. 1049/e1:2002071 I C.W. Murphy and D.M. Harvey (Coherent Electro Optics Research Group, School of Engineering, Liverpool John Moores University, Byrom Street, Liverpool, L3 3AE United Kingdom) References

‘XC6200 FPGA family data sheet’, April 1997, (Xilinx, Version 1.10) and BELLEc, P.: ‘Virtual hardware for graphics applications using FPGAs’. Field Programmable Custom Computing Machines, (FCCM94), Napa Valley, USA, April 1994, (IEEE Computer Society Press), pp. 49-58 LIANG, J., and TRAN, T.D.: ‘Fast multiplierless approximations of the DCT with the lifting scheme’, IEEE Trans. Signal Process., 2001, 49, pp. 3032-3044 MURPHY, C.: ‘Run-time reconfigurable DSP parallel processing system using dynamic FPGAs’ Ph.D. Thesis, LJMU, 2002 CHEN, W.H., SMITH, C.H., and FRALICK, S.C.: ‘A fast computational algorithm for the discrete cosine transform’, IEEE Trans. Commun., 1977, COM-25, (9) MURPHY, C., HARVEY, D., and NICOLSON, L.: ‘Dynamic configurable DSP parallel processing architecture’.IASTED Int. Conf. Applied Informatics Int. Symp. on Parallel and Distributed Computing and Network, Innsbruck, Austria, February 2002, pp. 13-18 SINGH, S.,

Fig. 2 Benchmark image

Compact all-fibre on-line power monitor via core-to-cladding mode coupling Qun Li, C.-H. Lin, A.A. Au and H.P. Lee An on-line all-fibre power monitor is demonstrated in which light from

the core mode is first coupled to the cladding mode using an acoustooptic tunable filter, and then coupled to an InGaAs pin detector bonded to the fibre. Detection efficiency for different cladding modes and fibre cladding diameters are presented. Fig. 3 BinDCT distribution

Introduction: Loss filters such as long-period fibre gratings (LPGs) [I], blazed fibre Bragg gratings (FBG) [2], and acousto-optic

ELECTRONICS LETTERS 29th August 2002 Vol. 38 No. 18

1013