a lowpass filter and downsampling the image (decimation part) which results in the low frequency coefficients. Based on this coarse version, the original image ...
Submitted to the 18th Iranian Conference on Electrical Engineering (ICEE)
Low Computational Complexity Hardware Implementation of Laplacian Pyramid Seyed Mohammad Ali Zeinolabedin, Nader Karimi, Shadrokh Samavi Department of Electrical and Computer Engineering Isfahan University of Technology, Isfahan, Iran.
Abstract: In this paper a new implementation of the Laplacian Pyramid (LP) algorithm is proposed which uses the polyphase representation and noble identities to facilitate a new pipeline architecture. Our approach saves a large number of mathematical operations which results in the reduction of power consumption. Furthermore, the proposed architecture decreases the number of employed resources as compared with the existing designs. The implementation results reveal the correct functionality of the proposed architecture.
Multiresolution data representation is a powerful tool in many signal processing applications. An example of such a representation is the Laplacian pyramid (LP) presented by Burt et al . The LP is initially used for image coding which removes image correlation by combining features of predictive and transform methods. The LP can be summarized as the following. It produces a coarse approximation of the original signal by applying a lowpass filter and downsampling the image (decimation part) which results in the low frequency coefficients. Based on this coarse version, the original image is predicted by upsampling and filtering (interpolation part) which produces the high frequency coefficients. Then the difference is calculated as the prediction error . The LP has been used in various algorithms such as compression [1, 3], SIFT (Scale-Invariant Feature Transform) , image fusion routines , and video communication . The LP plays an important role in all of the mentioned algorithms. Some of these algorithms are also implemented on hardware devices [5-9]. Reference  is totally dedicated to the implementation of the LP. In  a method is proposed to implement the Laplacian pyramid image fusion by FPGA. The first part of that work is devoted to implement the LP. A 3 × 3 decimation filter is used while the interpolation filter performs the averaging function. The two previously processed rows are preserved in the FPGA and with the arrival of the third row the decimation filter starts its operation. These results are delivered to the interpolation
filter in this part two different strategies are observed. The column interleaving operation, on the even lines, is used while the row interleaving operation, on the odd lines, is applied. The specified 3-tap filter in the interpolation part results in loss of generality. They realize real-time image fusion for dual channels 640×480 images at a rate of 25 frames per second (fps). In  an FPGA-based implementation is presented for a computationally complex stereo disparity algorithm. The design consists of four stages. One of them is the Scale-Orientation Decomposition (SOD) stage. In this stage a 3-level decimation is performed. Each level of this pyramid is built by low-pass filtering using a 3-tap FIR and decimating by a factor of two. The system outputs 8-bit, subpixel disparity estimates for 256×360 pixel images at 30 fps. In this work only the low frequency coefficients are produced and there is no part devoted to the interpolation. In  the implementation of the Laplacian pyramid is considered by using a reconfigurable attached processor, called SPLASH 2. This design consists of a number of FPGAs. One of the FPGAs performs a row convolution and resamples to eliminate half of the pixels of a row. It then passes the results to a second FPGA. Then the column convolution is performed. Each of these computations requires five image pixels from five successive rows. Hence, four previous rows need to be stored and accessed simultaneously. This requires storing of image rows in a memory. Finally the image is resampled in the vertical dimension and the complete image is stored in the memory. In  a pipeline is proposed for the implementation of the LP. In that work the legacy implementation of the LP is converted into a pipeline by inserting buffers within the structure. In  an ASIC chip, called PYR, is developed to perform the standard filter and resampling operations required in the LP and the inverse pyramid transforms. The mentioned chip simultaneously constructs the decimation and interpolation parts from a 512 by 480 image with a speed of 44 fps. Image samples pass through the chip sequentially, in raster scan order. The
PYR chip computes the filter operations on images in a pipeline fashion, as samples are transferred from one memory region to another. Resampleing is performed as samples are read from and then written into memory. In all of the mentioned papers, there are some major drawbacks. All of these designs are based on the idea that resampling should be applied after the convolution is performed and the decimation part should be before the convolution in the interpolation part. Firstly, the above mentioned strategy causes substantial redundancy in the arithmetic operations that are performed during the convolution stage. When the low frequency coefficients are extracted we see that 50% of the additions and multiplications are redundant. While we are calculating the high frequency coefficients half of the multiplications multiply a number by zero and more than 50% of the additions are adding a number with a zero. Secondly, in the interpolation part half of the FIFO's that are required, for processing of the columns, can be eliminated. Furthermore, for an 𝑁 × 𝑁 input image, half of the entries of each FIFO with the length of 𝑁 are filled with zeros. Thirdly, there is a delay between the production of the last coefficient of any row and the start of the production of the coefficients of the next row. This delay 𝑁 is equal to 2 for an 𝑁 × 𝑁 image. To sum up, the mentioned drawbacks cause an increase in the output delay and a waste in the allocation of the resources and power. In this paper a new solution is discussed to solve the quoted shortcomings by offering a pipeline architecture. In our implementations we have used the noble identities  and polyphase representation  to achieve results that are same as the results that other implementations of the LP produce but lack the above mentioned shortcomings. We show that when the results for one row are produced the arrival of the results of the next row is started without any latency. This is unlike the above mentioned implementations and causes a utilization of 100%. The proposed architecture eliminates a number of computational steps which can result in lower power usage. The organization of paper is as follows. In section 2 some preliminaries about the LP are discussed. Section 3 describes the proposed method, while section 4 presents the implementation results. In section 5 computational complexity of the proposed architecture is analyzed and section 6 concludes the paper. 2.
where 𝑙 = 1, 2, … , 𝑡 which shows that there is a maximum of 𝑡 levels. Also, 𝐶0 is the input image, and r indicates the length of the h filter. The high frequency coefficients of the LP, denoted by 𝐷𝑙 , are defined by (2) and (3) based on g filter : 𝐷𝑙 = 𝐶𝑙 − 𝐷′ 𝑙+1
, 𝐷𝑡 = 𝐶𝑡
𝐷′𝑙 𝑖, 𝑗 = 4
𝑔 𝑚, 𝑛 𝐷′𝑙 𝑚 =−𝑟 𝑛=−𝑟
C Decimation Part
ℎ 𝑚, 𝑛 𝐶𝑙−1 2𝑖 + 𝑚, 2𝑗 + 𝑛
Fig. 1: The general structure of the Laplacian pyramid.
By using noble identities we could simplify multirate systems. In construction of separable filters, such as ℎ and 𝑔 of Fig. 1, noble identities can be applied to obtain polyphase representation . An important advancement in the multirate signal processing is the polyphase representation. This permits great simplification and also leads to computationally efficient implementations of decimation and interpolation filters. There are two kinds of polyphase representations as are denoted by Equations (4) and (6). In these equations 𝐻(𝑧) and 𝐺(𝑧) are the z-transforms of a given filter where 𝑀 and L are the decimation and Interpolation values : 𝑀−1
𝐸𝑙 𝑧 𝑀 𝑧 −𝑙
H z =
ℎ 𝑀𝑛 + 𝑙 𝑧 −𝑛𝑀
𝐸𝑙 𝑧 =
𝑚 =−𝑟 𝑛=−𝑟
where 𝐷′𝑙 = 𝐶𝑡 and 𝑙 = 1, 2, … , 𝑡. Let us assume that the image is represented by 𝐶0 . This image becomes the zero level of the LP. Pyramid level 1 contains image 𝐶1 , which is a low-pass filtered version of 𝐶0 . Each value within level 1 is computed as a weighted average of the values in level 0 within the ℎ filter. Each value within level 2, represented by 𝐶2 , is then obtained from values within level 1 by applying the same pattern of weights. If we apply (2) once to image 𝐶1 , within a 𝑔 filter, we obtain 𝐷′1 which has the same size as the original image 𝐶0 . We will use it to obtain the high frequency coefficients.
𝑅𝑙 𝑧 𝑀 𝑧−(𝐿−1−𝑙)
G z = 𝑙=0
𝑔 𝐿𝑛 + 𝐿 − 1 − 𝑙 𝑍 −𝑛
𝑅𝑙 𝑧 = 𝐸𝑀−1−𝑙 𝑧 = 𝑛=−∞
𝐶𝑙 𝑖, 𝑗 =
𝑖−𝑚 𝑗−𝑛 , 2 2
The general structure of the LP scheme is shown in Fig. 1. The low frequency coefficients are denoted as 𝐶𝑙 and are defined in (1) in which i, j signify the pixel position . 𝑟
In Equations (4) and (6), 𝐸𝑙 and 𝑅𝑙 are called polyphase components which decompose filters into 𝑀 and L parts. Using these equations and noble identities can facilitate the switching of the position of the
decimator with that of its filter. The same switching phenomenon can be applied to the interpolator and its corresponding filter. In the next section more details about the LP simplification are presented. 3.
In this section we go over the details of the proposed structure. With the application of polyphase representations of Equations (4) and (6), and the application of the noble identities block diagram of Fig. 2 is proposed for the implementation of the LP. Delay
Fig. 2: Proposed structure for Laplacian Pyramid
In Fig. 2, two rows of the image, Row2i-1 and Row2i 𝑁 where 𝑖 varies from 1 to 2 for an 𝑁 × 𝑁 image, are processed concurrently. Therefore, high frequency coefficients of the two mentioned rows (HFC2i-1, HFC2i), are produced concurrently, too. In the other hand the low frequency coefficients ( 𝐿𝐹𝐶𝑖 ) are produced by the proposed structure of Fig. 2. To reduce the number of computational steps we applied one dimensional filter rather than a two dimensional version. It leads to separable strategy of convolution in which rows are processed by ℎ𝑥 and ℎ𝑦 filters and the results of them are processed by 𝑔𝑥 and 𝑔𝑦 filters. We decompose 𝐻 and 𝐺 , for the LP, with Equations (4) and (6), given that both 𝑀 and 𝐿 are taken equal to 2 by which 𝐸0 , 𝐸1 , 𝑅0 and 𝑅1 are produced. To start a valid convolution, from each row of the image two pixels are needed. Afterwards, results from the horizontal convolution are produced consecutively. In Fig. 2 ℎ𝑥 , shown by a dotted rectangle, sequentially calculates the horizontal convolution. For the vertical convolution we should analyze the even-indexed results by 𝐸0 and the odd-indexed results by 𝐸1 . Then the results are added together to produce the low frequency coefficients of the input image which is shown in Fig. 1 simultaneously convolved by 𝑅0 and 𝑅1 and then the results are interleaved. The interleaved results are the outcome of the horizontal convolution in the interpolation stage of the LP. These results are processed again by 𝑅0 and 𝑅1 but there is no interleaving of the outcomes and the outputs of the two filters are the high frequency responses. In the proposed method, two pixels of two adjacent rows are processed simultaneously and therefore two similar hardware units for the row convolution (ℎ𝑥 ) are used in the decimation part. The results from the row convolutions are combined in the column convolution (ℎ𝑦 ) in the decimation part. This policy, in contrast to
the traditional designs, produces the output without any time-lapse and the low and high frequency coefficients are generated consecutively. In Fig. 2 the "Delay" block denotes that synchronization is required between the input pixels and the output of the interpolation part. These Delay blocks are realized by very short FIFOs that deliver the pixels to the subtraction unit at the correct time. It is also worth mentioning that in Fig. 2 considering two inputs form two rows is a kind of decimation and hence, row-decimation is implicitly performed. The pipeline implementation of the proposed method is shown in Fig. 3. In stage 1 two consecutive pixels of a row are stored in block 1 and both of them are sent simultaneously to the next stage of the pipeline. Same thing happens for the pixels of the other row. This requires two separate clocks for this hardware where arrival of the pixels are controlled by a clock (clk1) and the dispatch of the two pixels are triggered by a second clock (clk2) with half the frequency of clk1. In stage2 horizontal convolution is performed. In stage3 sufficient number of data, produced in previous stages, is stored by FIFOs. In stage4 the vertical convolution is applied and the low frequency coefficients are calculated. These stages belong to the decimation part of the LP. In stage5 the horizontal convolution is done and sufficient number of data, which is needed to performing the vertical convolution, is stored in FIFOs of stage6. Finally the vertical convolution is being done in stage7 and the high frequency coefficients of two adjacent rows are simultaneously produced. 4.
For implementing the LP algorithm, we used the 9/7 filter that is introduced in  and some standard images for evaluation of proposed method. All of the arithmetic operations are done in fixed-point mode; therefore, we quantized our results in three different places. The quantization was applied to the input data, coefficients of the filters, and the results of the arithmetic operations. Input data are constrained to 8 bits but for the filter coefficients, the selection of sufficient number of bits is more complicated. Two important parameters should be considered. Firstly, the frequency response of the quantized filter should be preserved. Hence, the sensitivity function  can be used to evaluate the effects of the quantization. Secondly, the quality of the outputs should be acceptable. For this reason, we start with the quantization of the filter coefficients of the decimation part and keep the rest of the stages in an ideal form. Under these circumstances we use MSE, as shown in Fig. 4, to test the quantization of the low frequency coefficients. Then the proper number of bits for a desired MSE can be figured out. After the quantization of the decimation coefficients, we can go to the interpolation part. MSE for the high frequency coefficients and PSNR between ideal and quantized reconstructed images are shown respectively in Figs. 5 and 6 for various amount of
Decimation Part Stage1
Fig. 3: Implemented architecture of the proposed method.
quantizations. The required number of bits that produce best MSE and PSNR can be figured out. It is evident in Fig. 4, that for a number of bits that exceed 6, MSE of low frequency coefficients are lower than 10 and we hence choose 8-bit quantization. To achieve more precise selection for number of the required bits we go through the interpolation part. Based on Fig. 5, it is clear that for 8 bits MSE of the high frequency coefficients is less than 5. Therefore, we select 8 bits for the interpolation part. For this case, if we were to choose 6 bits, MSE would be greater than 10. As a result, by selecting 8 bits, where 1 bit is for the sign, for all of the filters, resulted PSNR is higher than 40 dB. In table I, some of the standard images are evaluated. In Fig. 7, the 8-bit and 6-bit quantization effect on the frequency response of 9 filter is shown. It shows that 8-bit quantized 9 filter approximately satisfies the ideal frequency response. Hardware realization of 𝐸0 , 𝐸1 , 𝑅0 and 𝑅1 blocks are in direct form . Since ℎ and 𝑔 are linear phase filters, the reduced direct form is applicable which leads to lower number of multipliers.
Fig. 4: Fixed-point analysis for the E0 and E1 filters coefficients.
TABLE I : MSE of low frequency coefficients (LFC), high frequency coefficients (HFC) and PSNR of the reconstructed images
Image Cameraman Barbara Lena Baboon
MSE(LFC) 6.4354 6.4063 6.2835 6.4563
MSE(HFC) 3.2741 2.7167 2.1329 2.4270
PSNR(R) 40.5050 40.9748 44.8239 40.7610
Fig. 5: MSE analysis for the R0 and R1 filters coefficients.
Another point that should be considered in quantization is that each FIFO requires large resources. The strategy of finding the proper number of bits helps the reduction of the number of bits that are needed in FIFOs. By considering the 8 bits for '9-7' filter, we reach the width of 10 and 12 for the FIFOs in Fig. 3. Fig. 6: PSNR analysis for the R0 and R1 filters coefficients.
Fig 7. Quantization effects on 9 filter frequency response
The results of the interpolation part should be subtracted from the pixels of the input image. For this purpose we need to know the delay between the arrival of the first pixel and the production of the first output to synchronize them as the inputs of a subtractor. The mentioned delay is equal to 10 clock cycles. Therefore, FIFOs of length 10 are deployed for this purpose in stage 6 of the pipeline. We implement this design on a EP3C120F780I7 chip from the Cyclone III family, manufactured by Altera. This FPGA is a cost-optimized, memory-rich chip. The implementation results are gathered in Table II. We see that there is a total of 61477 logic elements are consumed that is about 51.6% of the resources. The logic cells that are devoted to FIFOs are 46972 which is about 77% of the total involved logic cells. Fig. 8. shows the floorplan of the chip. The blocks that are numbered belong to FIFOs. Blocks that are numbered as 1 are related to FIFOs in the decimation part, and the ones indicated by 2 are related to FIFOs in the interpolation part. About 0.3% of the used logic cells are in the control units. This indicates that the design is straight forward. The maximum frequency that is applicable is 1GHZ. It is critical that if anyone desires to perform the LP in a 𝑁×𝑁 hierarchical method, one memory module is needed 4 and therefore read and write delays of the memory should also be considered. TABLE II: Results of implementation
Part # of LC % of LC
𝐸0 , 𝐸1 𝑅0 , 𝑅1
Our design is not platform dependent. But we tried to use the specific capabilities of this chip to further improve the performance of the implementation. Implementation results lead us to this fact that a vast area of chip is devoted to FIFOs which is about 77%. This value shows that making any kind of memory devices consumes many useful building blocks that are logic elements. By using this chip, it is possible to use a memory-rich capability to decrease the misuse of logic cells. This family of FPGAs contains memory blocks that are called M9K, which provide up to a total of 3.98 M bits of RAM at a maximum 0f 315 MHZ for synchronous operation . In this chip there are 432 of such memory blocks and each block has 9216 bits. A simple calculation shows that with this amount of memory blocks we could implement all of the FIFOs and also make one 256*256 memory for applying the hierarchical LP. Hence, we need only 119 M9k blocks. This means that 32% of the memory blocks are consumed and the total frequency of the design using the memory blocks is 315MHZ. Other improvements are possible by choosing a different multiplier from possible options of this chip. There are 288 embedded multiplier blocks available that have high speed interfaces to external memory . If memory blocks are exploited then the allocated logic cells will be 12.16% of the total logic elements while previous design required 51.6% of the resources. Considering a clock frequency of 300 MHZ, it will take 0.44msec to analyze a single 512x512 image. 5.
Computational Complexity Analysis
For an 𝑁 × 𝑁 input image, the decimation filter size is 𝐾 × 1 and the interpolation filter size is 𝑃 × 1. we could calculate the number of required additions and multiplications throughout the whole process. Using separable filters the aggregated total reduction of the arithmetic operations (addition, multiplication) is in 4𝐾−1 the order of 2𝐾 2 −1. For example if 𝐾 = 9 the mentioned reduction is about 20%. On the other hand, in our method the numbers of additions and multiplications in the 3 3 decimation part are respectively 4 𝑁 2 (𝐾 − 1) and 4 𝐾𝑁 2 . While the numbers of additions and multiplications in decimation part in the traditional method are 3 2 3 𝑁 (𝐾 − 1) and 2 𝐾𝑁 2 . These values reveal that half of 2 the arithmetic operations are redundant. The numbers of additions and multiplications in the 3 3 interpolation part are 𝑁 2 (𝑃 − 2) and 𝑃𝑁 2 while the 4 4 corresponding values in the traditional method are 3 2 3 𝑁 𝑃 − 2 and 2 𝑃𝑁 2 . This means that the percentage 2 of the redundant additions and multiplications is 𝑃 × 100%. For example if 𝑃 = 7 the ratio of the 2(𝑃−1) redundant multiplications is about 58%. Further results using the same justifications are presented in Table III. These values ascertain the mentioned reductions in the number of the arithmetic operations.
In the proposed method, we need 3𝐾 + 2𝑃 − 5 adders, 3(𝐾+𝑃) + 3 number of multipliers. Also 𝐾 − 1 number of 2 𝑁
FIFOs of the length of 2 and 2 FIFOs with the length of 𝑁 are required. The main employed assets for the pipeline implementation are the registers and FIFOs. An important consideration is the number of elements. Convolving the 𝑃−1 input data by ℎ𝑥 and 𝑔𝑥 requires 𝐾 − 1 and registers. 2 Also convolving the output of the horizontal analysis requires storage in a FIFO to calculate the results. This 𝑃−1 imply that 𝐾 − 1 and FIFOs are required for this 2 purpose. In the traditional method the same number of registers and FIFOs are required in the decimation part but twice the registers and FIFOs are required in the interpolation part. This shows a 50% improvement in the interpolation part. The whole process requires some clock pulses to finish the analysis of an 𝑁 × 𝑁 image.
 P. J. Burt and E. H. Adelson, “The Laplacian pyramid as a compact image code,” IEEE Trans. Commun., vol. COM-31, pp. 532–540, Apr.1983. 
TABLE III :
Comparison of Number of + and ∗ in decimation and interpolation parts
1.5 × 𝑁 2 (𝐾 − 1)
1.5 × 𝑁 𝐾 1.5 × 𝑁 2 (𝑃 − 1)
1.5 × 𝑁 2 𝑃
Proposed Method 0.75 × 𝑁 2 (𝐾 − 1)
In this paper, we have presented a new approach to the hardware implementation of the Laplacian Pyramid. The presented architecture utilizes the hardware more efficiently by reducing the number of arithmetic operations in comparison to the methods of previous papers [5-9]. This architecture also reduces the number of logic blocks such as FIFOs and registers. The outputs are produce with no latency, i.e. the output from the second row of the image arrives right after the first row is finished. If the ASIC implementation of the circuit is realized then the 50% reduction in the number of the arithmetic operations is translated into substantial reduction of the chip area and reduction of the power consumption of the chip.
M. N. Do and M. Vetterli, “Framing Pyramids,” IEEE Transaction on Signal Processing, VOL. 51, NO. 9, SEPTEMBER 2003.
 D. Tzovaras, M.G. Strintzis, “Optimal construction of reduced pyramids for lossless and progressive image coding”, IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, Vol. 47., pp. 332-348, 2000.
0.75 × 𝑁 𝐾
 D. J. Lowe, "Distinctive Image Features from Scale-Invariant Keypoints," International Journal of Computer Vision 60 (2): 91– 110.
0.75 × 𝑁 2 (𝑃 − 2)
 Y. Song, K. Gao, G. Ni, R. Lu, " Implementation of real-time Laplacian pyramid image fusion processing based on FPGA," Proc. Of SPIE Vol. 6833, 683316, 2007.
0.75 × 𝑁 2 𝑃
 A. Darabiha, W. J. MacLean, J. Rose, "Reconfigurable hardware implementation of a phase-correlation stereo algorithm," Machine Vision and Applications, 7 March 2006.
 A. L. Abbott, P. M. Athanas, L. Chen, R. L. Elliott, "Finding Lines and Building Pyramids with Splash2," in IEEE Proc. 1994 FPGAs for Custom Computing Machines, pp. 155-163.  P. J. Burt, "Multiresolution Pyramid Architectutres for Real-Time Motion Analysis," IAPR Workshop on Machine Vision Applications Nov. 28-30,1990.  G. S. V. D. Wal, P. J. Burt, "A VLSI Pyramid Chip for Multiresolution Image Analysis," international Journal of computer vision,177-189, 1992.  P. P. Vaidyanathan. Multirate Systems and Filter Banks. PrenticeHall,Englewood Cliffs, NJ, 1993.  M. Antonini, M. Barlaud, P. Mathieu, and I. Daubechies, “Image coding using wavelet transform,” IEEE Trans. Image Processing, vol. 1, pp. 205–220, Apr. 1992.  X. Hu, L. S. Debrunner, V. Debrunner, “An efficient design for FIR filters with variable Precision,” in IEEE International Symposium on Circuits and Systems, Volume 4, 26-29 May 2002.  S. K. Mitra. Digital Signal Processing: A Computer-Based Approach New York, McGraw-Hill, 2001.
Fig. 8: Floor plan of implemented method and area that is occupied by FIFOs
 Cyclone III Device handbook, October 2008. Available: http://www.altera.com.