" # 0 0 * "$ ! - 1 -#. # #. # 2 2 2 3

0 downloads 0 Views 23MB Size Report
related to me but its possession to all the people whom guide me, teach me, pray for me and ...... results in Matlab, d- DWT results in VHDL, e- IDWT result in. Matlab ..... At present, transform-based coding methods have garnered lots of interest ...... The preferred use of folded scheme compared to PA and RPA for hardware.
    

                     

                                 !  "

   #     $                      %&'     !          

      (               

  ) !  #  *     *     #               +                 ,-              # .        !      /            

           "   #

     

           

00 * "$!  - 1-#.#  #.#  2  2         2      3  !4   ! - +) 567%  899% !#*  :##       :;   Ɛ

the proposed 5/3 LS filter resources for the four versions of the design.

ϭϬϬϬ ϵϬϬ ϴϬϬ ϳϬϬ ϲϬϬ ϱϬϬ ϰϬϬ ϯϬϬ ϮϬϬ ϭϬϬ Ϭ

ϭϮϬ

ϭϮϳ

ϭϮϰ

ϵϱϭ

ϴϴϲ

ϴϱϲ

ϳϯϵ

ϭϯϳ

ŝĨĨĞƌĞŶƚ>ĞŶĂŝŵĂŐĞƐŝnjĞǀĞƌƐŝŽŶƐsϲϰ͕sϭϮϴ͕sϮϱϲ͕ĂŶĚsϱϭϮĨŽƌ WƌŽƉŽƐĞĚĂŶĚ>ĞĞ'ĂůůϱͬϯĨŝůƚĞƌƐĚĞƐŝŐŶ

Figure 5.9: Total number of logic elements comparison between Lee Gall 5/3 filter feasibility of the traditional JPEG2000 and the proposed 5/3 LS filter resources

After performing synthesis and other verification processes, an RTL simulation of the proposed efficient FDWT module has been achieved. The RTL or technology map helps to check the design visually. The detailed RTL architecture of the whole developed 2-D FDWT module is shown in Figure 5.10.

143 

Figure 5.10: RTL view of the proposed efficient FDWT module

144



start

clk reset

DWTControl:U3

elem_count[10..0]

dst_step[10..0]

dst_start_offset_lo[18..0]

dst_start_offset_hi[18..0]

src_step[10..0]

src_start_offset[18..0]

start

reset

clk

ready

rwtu_addr[1..0]

datain[7..0]

addr[1..0]

wr

rd

first

cs

clk

NL[2..0]

start

reset

dwt_ready

RWTU:U2

dataout[7..0]

dwt_elem_count[10..0]

dwt_dst_step[10..0]

dwt_dst_start_offset_lo[18..0]

dwt_dst_start_offset_hi[18..0]

dwt_src_step[10..0]

dwt_src_start_offset[18..0]

ready

3' h2 --

DWT2DControl:U4 dwt_start clk

tm_wr

tm_addr[18..0]

tm_rd

tm_cs

rwtu_wr

rwtu_rd

rwtu_first

rwtu_cs

1

PRE

init

CLR

ENA

D

Q

CLR

ENA

PRE Q

dump D

datain[7..0]

addr[18..0]

wr

rd

dump

cs

Memory:U1

dataout[7..0]

ready

It is necessary to write the physical behavior and then simulate it using the different versions of gray scale test image data i.e. by writing the test bench to verify the functionality. Test benches are frequently utilized to offer an automated testing environment enclosing the design. The test-bench serves as a supplier of test vectors to the input pins, and is also capable of receiving the responses from the output pins (Altera Corporation, 2014). The efficiency of the individual hardware components consisting of the 2-D -DWT chipset was examined using ModelSim-Altera 6.5b Software. The varied stimulus signal presented to the FDWT module which is developed by the test bench environment includes Reset, Clock, and Start. Ready is the output signal which is returned to the control unit using ModelSim-Altera 6.5b as shown in Figure 5.11.

Figure 5.11: Simulation waveform result of the proposed FDWT module

To carry out the 1-D transforms in row wise, the 2-D DWT Control module begins 1-D DWT control unit by applying reset signal. If the external reset signal is 145 

affirmed, it immobilizes 5/3 WTU core and memory block and expect the positive edge of start signal to be stated. However, no values are written to the data bus. The 2-D- DWT Control block triggers the 1-D DWT processor upon accepting an active start signal from the system environment. The start signal allows the 1-D DWT processor to read an 8-bit pixel data from the original location of the memory and start the computation transformation. The data are read in sequence from the memory. For every computation level, pixel values are first read in row-by-row. This process is maintained till all pixel values from all rows are read, and the transformed values saved in the memory. Following the reading in of all the rows and the computation completed, the results are written to the external memory when a write signal is affirmed. Subsequent to transformation of all the rows of the image, 2-D DWT Control module restarts the transformation process in column wise, thereby completing the level-1 2-D DWT transformation, accordingly. The results are saved in the memory after finishing the computations for all columns. The size of the external memory is double of the original image size.

After completing the transformation as well as computing all the demanded levels of the image, the ready signal is asserted by the processor to signify that the system is presently capable of reading the transformed image data results from the memory. The whole amount of calculations is based on the level of computation particular to the number of levels, NL signal. Every level demands to read a definite number of rows and columns symbolizing a specified number of pixels. Computations are performed for all levels up to the signaled level beginning with level 1 based on NL signal.

146 

The design applies an equivalent code in the IDWT module. The inverse transform can instantly be derived from the same structural FDWT design. The original pixels data input to the FDWT can be completely recovered from the estimated approximation and wavelet coefficients components. The FDWT and IDWT executed with lifting scheme theorem using efficient embedded extension 5/3 wavelet transform of similar computation intricacy given that the whole logic devices are required to be similar. The IDWT synthesis process is conducted to produce RTL Schematics effectively as shown in Figure 5.12.

There is the constant demand for clock cycles at different levels of calculation beginning from the time the start signal is affirmed till the ready signal is provided to complete the 2-D DWT. It is worth mentioning that the clock cycle usage variety of all the VHDL architectures has been verified and tested utilizing different levels of decomposition and input image data sets size. In addition to the considerable rise in the amount of cycles utilized for the 2-D DWT coding process leads to higher image size versions, the clock cycle usage is increased simultaneously with the rise in the levels of decomposition and it ranges from 22655 (N=64, L=1), to 1761137 clock cycles (N=512, L=7), in the efficient 2-D DWT proposed design. This is due to the high access to memory related to the large number of cycles required for enlarging the size of the memory used to save the image information and the increasing in the number of levels as shown in Table 5.3.

147 

Figure 5.12: RTL view of the proposed efficient IDWT module

148



start

clk reset

IDWTControl:U3

elem_count[8..0]

dst_step[8..0]

dst_start_offset[14..0]

src_step[8..0]

src_start_offset_hi[14..0]

src_start_offset_lo[14..0]

start

reset

clk

irwtu_cs

3' h1 --

clk

datain[7..0]

addr[1..0]

wr

rd

first

cs

clk

NL[2..0]

start

IRWTU:U2

dataout[7..0]

idwt_elem_count[8..0]

idwt_dst_step[8..0]

idwt_dst_start_offset[14..0]

idwt_src_step[8..0]

idwt_src_start_offset_hi[14..0]

idwt_src_start_offset_lo[14..0]

ready

reset

idwt_ready

idwt_start

irwtu_addr[1..0]

tm_wr tm_addr[14..0]

tm_rd

tm_cs

ready CLR

CLR

ENA

PRE Q

dump D dump

cs

datain[7..0]

addr[14..0]

wr

IDWT2DControl:U4

Q rd

PRE

init

ENA

D

irwtu_rd

1

irwtu_wr

irwtu_first

Memory:U1

dataout[7..0]

ready

Table 5.3: Number of clock cycle usage for efficient 2-D DWT 5/3 lifting filter coding Number of levels

Lena image 64×64pixels clock cycles

Lena image 128×128 pixels clock cycles

Lena image 256×256 pixels clock cycles

Lena image 512×512 pixels clock cycles

Level L=1

22655

84863

332159

1318271

Level L=2

28162

106114

415618

1649026

Level L=3

29637

111621

436869

1732485

Level L=4

30056

113096

442376

1753736

Level L=5

30227

113515

443851

1759243

Level L=6

30314

113686

444270

1760718

Level L=7

-

113773

444441

1761137

The clock cycle usage for (N=64, L=7), is not mentioned in Table 5.3, because the image size will be decreased by quarter each level of decomposition until the target level (the final level) is achieved and the data from external memory is drained. A comparative analysis of the implementations, in terms of clock cycles utilized for the 2-D DWT coding process between the proposed efficient design and conventional JPEG 2000 lifting indicate closely lower clock cycle consumption as illustrated in Table 5.4. The clock cycle consumption differs over both the image size and the decomposition levels. Furthermore, the clock cycle variety is from 63741 at (N=64, L=1), to 5258207 clock cycles at (N=512, L=7), in the traditional JPEG2000 Lee Gall 5/3 filter. As the 2-D DWT level and the image size increases, the filter DUFKLWHFWXUHUHTXLUHVDORQJHUFRPSXWDWLRQFORFNF\FOH¶VSHULRG7KHQXPEHURIFORFN cycle usage for 2-D DWT JPEG2000 5/3 lifting filter coding is slightly higher than clock cycles in the efficient2-D DWT proposed design as showed in Table 5.4.

149 

Table 5.4: Number of clock cycle usage for 2-D DWT JPEG2000 5/3 lifting filter coding Number of levels

Lena image 64×64 pixels clock cycles

Lena image 128×128 pixels clock cycles

Lena image 256×256 pixels clock cycles

Lena image 512×512 pixels clock cycles

Level L=1

63741

248957

988029

3940733

Level L=2

79552

311296

1235584

492736 5

Level L=3

83619

327107

1297923

5174915

Level L=4

84694

331174

1313734

5237254

Level L=5

84993

332249

1317801

5253065

Level L=6

85144

332548

1318876

5257132

Level L=7

-

332699

1319175

5258207

In order to evaluate the performance of the hardware architecture, it is required to make use of certain metrics that characterize the architecture in terms of the hardware resources used and the computation time. The hardware resources used for filtering are measured by the number of multipliers and number of adders, while those used for the storage of data and filter coefficients are measured by the number of registers (Mahapatra et al., 2014). The numbers of multipliers, adders employed by the architecture depend only on the filter length, whereas the number of the registers of the buffers depends on the image size as mentioned in Section 2.6.1. In general, the computation time is technology dependent, since the clock period may significantly differ from one architecture to another (Mahapatra et al., 2014).

However, a metric that is technology independent and can be used to determine the computation time T is the number of clock cycles NCLK elapsed 150 

between the first and the last samples inputted to the architecture. Assuming that the clock period is Tc, the total computation time can then be obtained as T=NCLK ×Tc (Mahapatra et al., 2014). The results of Table 5.1 have shown that the proposed efficient2-D DWT architecture can run at a speed of up to the maximum operational clock frequency (176.40 MHz, Tc = 5.669 ns), (166.22 MHz, Tc = 6.016 ns), (160.23 MHz, Tc = 6.241 ns), and (167.39 MHz, Tc = 5.974 ns) for 64×64, 128×128, 256×256 and 512×512 different input image sizes respectively. The anticipated overall time (T=NCLK ×Tc ) which can be obtained by the FDWT encoder for all image size versions for the efficient5/3 architecture is presented in Table 5.5.

Table 5.5 Computation time for efficient2-D DWT 5/3 lifting filter coding 2-D DWT

Lena image Lena image Lena image Lena image 64×64 pixels 128×128 pixels 256×256 pixels 512×512 pixels computation computation computation computation time (ms) time (ms) time (ms) time (ms) Level L=1 0.12842 0.510546 2.07301 7.87544 Level L=2

0.15964

0.63839

2.59388

9.85140

Level L=3

0.168010

0.67152

2.72651

10.34999

Level L=4

0.17038

0.68039

2.76088

10.47694

Level L=5

0.17135

0.68292

2.77008

10.50984

Level L=6

0.17184

0.68394

2.77270

10.51865

Level L=7

-

0.68447

2.77376

10.52116

In contrast to the conventional JPEG2000 lifting that ranges from 0.52420 ms at (N=64, L=1), to 39.21108 ms at (N=512, L=7) of FDWT decompositions of various image versions as showed in Table 5.6 and illustrated in Figure 5.13.

151 

Table 5.6: Computation time for JPEG2000 2-D DWT 5/3 lifting filter coding 2-D DWT

Lena image Lena image Lena image Lena image 64×64 pixels 128×128 pixels 256×256 pixels 512×512 pixels computation computation computation computation time (ms) time (ms) time (ms) time (ms) Level L=1 0.52420 2.16842 8.59753 29.38652 0.65421

2.71140

10.75168

36.74392

Level L=3

0.68765

2.84911

11.29414

38.58997

Level L=4

0.69649

2.88453

11.43172

39.05483

Level L=5

0.69895

2.89390

11.46711

39.17274

Level L=6

0.70019

2.89650

11.47647

39.20307

Level L=7

-

2.89782

11.47907

39.21108

ŽŵƉƵƚĂƚŝŽŶƚŝŵĞ;ŵƐͿ

Level L=2

ϰϬ ϯϱ ϯϬ Ϯϱ ϮϬ ϭϱ ϭϬ ϱ Ϭ

>s>ϭ >s>Ϯ >s>ϯ >s>ϰ >s>ϱ

>s>ϲ >s>ϳ Different Lena image size versions V64,V128,V256, and V512 for Proposed and LeeGall 5/3 filters design

Figure 5.13: Computation time comparisons between the proposed efficient design and conventional JPEG 2000

The outcomes in Table 5.2 have shown that conventional JPEG2000 2-D DWT architecture can run at a lower frequency of up to (121.60 MHz , Tc = 8.224 ns), (114.81 MHz , Tc = 8.710 ns), (114.92 MHz, Tc = 8.702 ns), and (134.10 MHz , Tc = 7.457 ns) for 64×64, 128×128, 256×256 and 512×512 different input image sizes respectively. The results of the FPGA implementation have shown that the 152 

FDWT circuit of the proposed efficient2-D DWT architecture can process a 256×256 image in 2.07301ms, which is at least four times faster than that of the other JPEG2000 FPGA implementation for the same image size in 8.59753 ms, even with less hardware utilization. The aim of short computation time has been achieved by minimizing the number of clock cycles required for the DWT computation with little operating cost on the hardware resources.

The embedded extension architecture implementation into the main 5/3 DWT algorithm has been developed to efficiently reduce computation area in addition to the power loss of the system by means of volumetric cut down of 5/3 wavelet transform arithmetic operations. Such significant advantages bring about considerable savings in both area and power, which in turn lower the hardware utilization and enhance the processing speed. It is inferred from the performance appraisal of the embedded extension algorithm and its architecture that they are appropriate for efficientuse and high speed devices.

Therefore, a safety margin should be considered, to judge the real-time processing capability of a system. It is suitable for real time processing since it performs NxN image wavelet transforms in N2 computation clock cycles period required to complete the 2-D DWT (Benkrid et al., 2003). Observing Tables 5.5, 5.6 and Figure 5.13, one concludes that efficient architecture can handle efficiently the real time image processing of still images sizes 64×64, 128×128, 256×256 and 512×512, with minimal calculation time for the whole structure in a low-power mode with less hardware utilization.

153 

The hardware design process involves the exchange between the setbacks limitation factors such as space, speed and power are trimmed to drastically enhance efficiency. The FPGA input/output I/O pins consume a huge amount of power because they are designed in a larger geometry compared to the core, to support sinking currents for all of the I/O standards. The power consumption evaluation showed that the chipset of the proposed efficient design consumed around 0.033W. The power report of the proposed structure of 2-D DWT is illustrated in Figure 5.14.

a

b Figure 5.14: PowerPlay Power Optimization in the Quartus II software design flow; a-Cyclone II PowerPlay Early Power Estimator report, b- the Main worksheet of the PowerPlay Early Power Estimator

154 

Related studies on efficient hardware architecture implementation for a multilevel 2D DWT processor were evaluated in this section. Emphasis was given to validate the proposed hardware architecture schemes on FPGA systems and the matching HDL model for the DWT computation of the wavelet-based image compression applications.

FPGA architecture for the separable 2-D Biorthogonal DWT decomposition was presented in (Benkrid et al., 2003). The architecture is based on the PA analysis, which controls the computation along the margins competently using the technique of symmetric extension. The architecture is scalable for diverse filter lengths and various octave levels. The design of a precise 2-D Biorthogonal 9/7 Wavelet Transform and its execution on the Xilinx Virtex-E is selected as a case study. For a 256×256 input image, the architecture operates at a speed of 75 MHz. It uses 4720 slices of the Xilinx XCV600E FPGA and 10 Block Rams.

Furthermore, 2-D DWT non-separable, parallel, and recursive architecture was developed in Palero et al., (2006). This architecture employs distributed control to compute the three decomposition levels. The architecture utilizes an amendment of the RPA. It is created by basic blocks that when correctly merged facilitate an increase in the number of levels or an inconsistency in the characteristics of the filters. It is merely necessary to increase the number of control units in order to increase the number of levels. The results obtained with the Xilinx XCV600E FPGA device show that implementation can be attained with a frequency of about 45Mhz, which allows images sized 512×512 pixels to be processed at 5,88 ms computation time with hardware LEs product of 15017.

155 

Three main 2D DWT lifting-based computation programs: the RC, LB and the BB have been applied on FPGA-based platforms, and evaluated based on performance, area and energy requirements (Angelopoulou et al., 2008). The computation schedules are analysed comparatively in terms of variations in image sizes M×M and number of levels of the transform L. The RC, the LB and the BB function on the XC4VLX15 FPGA apparatus at 172.4, 113.6 and 117.6 MHz respectively. The amount of slices for RC comprise a range of 280 (M=256, L=3) to 329 slices (M=1024, L=6). For LB and BB this range falls within 2,659±3,001 and 2,646±3,597 slices, respectively. The projected on-chip power loss of the RC is 214 mw,and ranges from 241 mw (M= 256) to 268 mw (M=1024) in the case of BB. The power dissipation of LB is at comparatively higher levels, beginning at 264 mw (M=256) and stretches to 289 mw (M=1024).These results show that the selection of the appropriate schedule is a decision that should be dependent on the specified algorithmic provisions.

Modified Flipping architecture is proposed for the implementation of image compression using 2D DWT in (Parvatham and Gopalakrishnan, 2012) using a new multiplier algorithm called as Modified Baugh-Wooley pipelined constant coefficient multiplier (MBW-PKCM) for signed multiplication. Distributed Arithmetic architecture with constant coefficient multiplier is used to implement MBW multiplier. The 2D DWT scheme is implemented on Altera FPGA using the Modified flipping Architecture MFA blocks with 9/7 biorthogonal filters and MBWPKCM multipliers. The 2D DWT is implemented using Cyclone II EP2C35F672C6 FPGA trainer kit. VERILOG HDL language is used to describe the functionality of the circuit and after the circuit is described in HDL, functionality is verified using

156 

Altera QuartusII simulation tool. From the implementation results, it is verified that MBW-PKCM is faster 147.15MHz compared to non pipelined BW-KCM 131.48 MHZ and is register efficient by means of only 1079 LE, and less compared to BWPKCM with 1121 LE. Just one level 2D DWT scheme is implemented, in NIOS II PROCESSOR with less power 116.86mw than BW-PKCM with116.82mw.

New algorithms and hardware architectures to address the memory requirements (for storing intermediate signals) and critical path issues in 2-D dualmode (supporting 5/3 lossless and 9/7 lossy coding) lifting based discrete wavelet transform (LDWT) are presented in (Chih-Hsien et al., 2013). Moreover, a new approach, namely, interlaced read scan algorithm (IRSA) that changes the signal reading order from purely row-ZLVHVLJQDOÀRZWRPL[HGURZ- and column wise and thus reduces the transpose memory TM. The prototyping chip took 29196 gate counts and could operate at 100 MHz and15.47 mW estimated power.

A multiplier-less architecture based on algebraic integer AI representation for computing the Daubechies 6-tap wavelet transform for 1-D/2-D signal processing has been proposed in Madishetty et al., (2014). The design is physically implemented for a 4-level 1-D/2-D decomposition using a Xilinx Virtex-6 vcx240t-1ff1156 FPGA device. The FPGA implementations of 1-D/2-D are tested using hardware cosimulation and an ML605 board with clock of 100 MHz. A 45 nm CMOS synthesis of 2-D designs show improved clock frequency of better than 306 MHz for a supply voltage of 1.1 V.

157 

Table 5.7 shows a comparative analysis of hardware performances of associated implemented architectures based on maximum frequency (Fmax), the number of FPGA logic element LEs, image size, consumed power, and computing time (T) duration. The proposed DWT architecture has been developed based on a lifting scheme for Lee Gall 5/3 filter computation related to JPEG2000 standard. The FPGA LEs used in the proposed efficient2-D DWT architecture are much fewer than in other architectures. The number of logic element LEs covers a range from 120 (N=256) up to 137 slices (N=512). The architecture employed 127slices, of which only 1% was used from 33216 FPGA slices with 256×256 image size version.

This is regarded as very low in contrast to the other architectures. Since the number of clock cycles is a principal factor in the real time processing computations, the proposed 256×256 image size version architecture was evaluated against an average frequency of 160.23 MHz frequency (Tc period = 6.241 ns). The results have uncovered that, a computation time of 2.07301ms is sufficient to process and compute the 2-D DWT coefficients of 256×256 pixels. However, the proposed architecture exhibited as the most rapid computing time compared to the other 5/3 or 9/7 LS structures. Simply put, the considerably smaller computation duration leads to lower power consumption compared to other architectures.

158 

Table 5.7: Comparison of various 2-D DWT FPGA architecture implementations Architecture

(McCanny et al.,2002) Symmetrically extended 9/7 (Hongyu et al., 2004) 9/7 RPA (Raghunath and Aziz, 2006) Symmetrically extended 9/7 (Palero et al., 2006) 5/3 Parallel (Angelopoulou et al., 2008) 5/3 RC single (Dia et al., 2009) 5/3 parallel (Parvatham and Gopalakrishnan, 2012) symmetrically extended 9/7 (Chih-Hsien et al., 2013) dual-mode 5/3 9/7 LDWT (Madishetty et al., 2014) Daub-6 Proposed embedded extension 5/3

Image size (N) 512

DWT Number F max Level of LEs (MHZ)

T (ms)

Power (watt)

FPGA device

3

2559

44.1

9

N/A

XC2V500

512

3

879

50

5.3

N/A

XC2V250

512

5

1700

171.8

3.1

N/A

Virtex-2

512

3

3580

45

5.88

N/A

XCV600E

256

6

280

172.4

N/A

0.214

XC4VLX15

256

3

1835

108

2.36

0.047

XC4VLX15

512

1

1079

147.15

N/A

0.1168

Cyclone II

256

3

29196

100

N/A

0.1547

TSMC 0.18ȝP30 (CMOS)

256

4

1040

306.15

N/A

4.15

Virtex-6 I

256

7

127

160.23

2.07

0.033

Cyclone II EP2C35F672 C6

The second aimed to compare the performances of the proposed architecture comprehensively with results from other wavelet models. In the following Section, the HWT design results and analysis has been conducted as a comparison.

159 

5.3 Hardware Results and Analysis for HWT Design Following the design completion of every VHDL module, each one is required to be assembled independently for precision analysis and debugging. The VHDL modeling for HWT has been analyzed, synthesized, and simulated using the same Altera Development board DE2, CycloneII FPGA family and chipset EP2C35F672C6N which has been used previously. The Quartus II software tool (V 9.1 SP2 web edition) has built in tools for performing these operations. The VHDL model for HWT is tested for justifying the performance of FDWT and IDWT. To completely evaluate the 2-D DWT performance of the HWT design, again three 256 level gray scale images are selected. The FDWT and IDWT modules perform 2-D DWT to the three test images of size 256×256 pixels. These modules can be applied into different number of FDWT transformation levels up to seven levels as shown in Figure 5.15, Figure 5.16 and Figure 5.17. Unlike to MATLAB, VHDL codes does not support many built in functions such as the convolution. Consequently while implementing the algorithm in VHDL; linear equations of FDWT and IDWT are used. In discrete form, Haar wavelets are related to a mathematical operation called the HWT. Like all wavelet transforms, the HWT decomposes a discrete signal into two subsignals of half its length. One subsignal is a running average or trend; the other subsignal is a running difference or fluctuation (Walker, 2008).

160 











Figures 5.15: Lena original image of size 256×256 and its seven levels FDWT 161 









Figure 5.16: Woman original image of size 256×256 and its seven levels FDWT 162 









Figure 5.17: House original image of size 256×256 and its seven levels FDWT 163 

The FDWT module which consists of adder and right shifter is used to obtain the low-pass average and high-pass difference components. The first current and next pixel samples are given to the adder then the output of the adder is right shifted by one to give (division by two) of the average wavelet components. The difference component is calculated by subtracting the current and next pixel values with a shift to right operation. The Haar linear equations to calculate an average (Li) and a difference wavelet coefficient (Hi) from the currentand next pixel samples in the input image data are computed by applying linear equation (4.1) and equation(4.2), respectively͘

The HWT also has limitations, which can be a problem to some applications. In generating each of averages for the next level and each set of coefficients, the Haar transform performs an average and difference on a pair of values. Then the algorithm shifts over by two values and calculates another average and difference on the next pair. The high frequency coefficient variety should reflect all high frequency changes. It is worth noting that the Haar wavelet transform window slides nonoverlapping, and contains only two elements. If a big change takes place from an even value to an odd value, the change will not be reflected in the high frequency coefficients.

To approximate the rounding errors formed by the arithmetic operations to attain average and difference, an evaluation was made between the original image and the consequential image after the FDWT and IDWT processes. Table 5.8 shows the Mean Square Error MSE of the three images with increasing the level of 164 

decomposition. It is computed pixel-by-pixel by adding up the squared differences errors of all the pixels values of original image from those of resultant image from IDWT module and dividing by the total pixel count. The selected three 256 level gray scale images gave similar performance properties as shown in Table 5.8 and Figure 5.18. Table 5.8: MSE of three test images with raising the level of decomposition MSE for test images 256×256 pixels 256×256 pixels 256×256 pixels FDWT coding process Lena test image Woman test image House test image Level 1 of 2-D DWT

0.9350

0.9971

0.9941

Level 2 of 2-D DWT

2.4141

2.4981

2.4551

Level 3 of 2-D DWT

4.4028

4.5529

4.4028

Level 4 of 2-D DWT

6.9421

7.0429

4.3771

Level 5 of 2-D DWT

9.6506

9.7497

9.6972

Level 6 of 2-D DWT

14.0419

12.6461

13.6324

Level 7 of 2-D DWT

18.6094

19.0865

18.4241

Different (256×256 ) test images MSE with different decomposition levels

20

LENA WOMAN HOUSE

18

Mean Square Error (MSE)

16 14 12 10 8 6 4 2 0

1

2

3

4

5

6

7

Number of levels of decomposition (L)

Figure 5.18: MSE performance evaluations of test images with different decomposition levels 165 

The MSE created by the arithmetic processes of implementing FDWT and IDWT are increased clearly with the increasing of the decomposition levels as depicted in Figure 5.18. This figure shows that the difference between the MSEs reached by Lena image and the other test images is relatively not very significant. Therefore simulation was implemented using the Lena test image with various sizes and transformation levels for each size. Four versions of the design are presented. Each one is adapted to cater for different image sizes. The modules perform 2-D DWT to images of size 64×64, 128×128, 256×256, and512× 512 pixels.

The images size is determined before the programming is accommodated on FPGA device. This makes the final design issued in four versions 64×64, 128×128, 256×256, and 512× 512 pixels. The 2-D DWT hardware architecture consists of FDWT and IDWT in RC fashion on the input image text file. For the IDWT architecture, the design is nearly identical. These modules can be integrated into a variety number of transformation levels, from one level to seven levels for entire image compression applications. Figures 5.19 - Figure 5.22 shows the results of the four versions of the Lena image with various decomposition levels for every size in the four versions.



Figure 5.19: FDWT simulation using Lena image for 64×64 image sizes and 7 transformation levels 

166 







Figure 5.20: FDWT simulation using Lena image for 128×128 image sizes and 7 transformation levels 



Figure 5.21: Level 1FDWT simulation using Lena image for 256×256 image sizes

167 



Figure 5.22: Level 1FDWT simulation using Lena image for 512× 512 image size

Table 5.9 and Figure 5.23, reveals that the MSE results are evidently increased with the increasing of the decomposition levels at all versions of the Lena images. The flow clarified in the Figure 5.24 is utilized on the first 8×8 Lena image samples for the size 256×256 pixels version of the test image Lena. This piece of the work was completely executed utilizing MATLAB to be sure the HWT calculation was completely comprehended and to serve as validation and an approval reference. The approval of the MATLAB code is carried out when the recreated IDWT images

168 

are discovered to be outwardly matching to the original images due to differences errors.

Table 5.9: MSE for 2-D DWT coding process of LENA images MSE for LENA image coding process

128×128 pixels image size

256×256 pixels image size

Level 1 of 2-D DWT 0.9880

0.9919

0.9350

512×512 pixels image size 0.6804

Level 2 of 2-D DWT 2.4724

2.4141

2.4141

2.0134

Level 3 of 2-D DWT 4.4543

4.5991

4.4028

3.8794

Level 4 of 2-D DWT 6.9485

7.2880

6.9421

6.2535

Level 5 of 2-D DWT 9.2927

9.6750

9.6506

9.1710

Level 6 of 2-D DWT 10.8352

15.2166

14.0419

12.2256

Level 7 of 2-D DWT

21.1798

18.6094

16.3184

25

-

LENA512 LENA256 LENA128 LENA64

20

Mean Square Error (MSE)

64×64 pixels image size

15

10

5

0

1

2

3

4

5

6

7

Number of levels of decomposition (L)

Figure 5.23: Lena test images MSE performance evaluation with different decomposition levels

169 

b

a

d

c

e

f

Figure 5.24: MATLAB Simulation results on 256×256 pixels size version of Lena image ; a- 256×256 Lena test image version, b- The first 8×8 Lena image, c-256×256 Lena IDWT image, d- first 8×8 IDWT result Lena image, e- the transformed DWT result image, f- first 8×8 DWT result Lena image 170 



a

b



c

d



e

f

Figure 5.25: VHDL Simulation results on 256×256 pixels size version of Lena image; a-256×256 Lena test image version, b- the first 8×8 Lena image, c- the reconstructed IDWT result image, d- first 8×8 IDWT result Lena image, e- the transformed DWT result image, f- first 8×8 DWT result Lena image

171 

Figure 5.25 is utilized for the FDWT hardware usage of the HWT algorithm of the same 256×256 pixels size version example. The different sized images in the grayscale format which are originally stored in the simulated memory module are stripped of its image header information to produce a raw text (HEX format file). As a result of the FDWT hardware implementation, wavelet coefficients HEX text data is produced. The first 8 selected DWT result successions of samples of Lena image in Figure 5.25 f, is demonstrated using ModelSim-Altera 6.5b Software in Figure 5.26.



Figure 5.26: Waveform indicating DWT results of memory module

From the first 8 selected DWT sequence of samples of Lena image in Figure 5.25 f, it is supposed that the memory stores couple of coefficients, low coefficient L0=A1 at address 0 and high coefficient H0=A2 at address 1 and so on, as illustrated in Figure 5.27. The pair of coefficients L0=A1, H0=A2 is formed through IDWT filtering of one level of decomposition HWT of the 256×256 Lena test image 172 

version. Given that, input coefficients are presently stored in the external memory module, L0 and H0 can be stored in their addresses 0 and 1. Similarly, all pairs of the resultant wavelet coefficients L(i) and H(i), can be saved at addresses, 2i and 2i+1, in that order as shown in Figure 5.27.

Figure 5.27: Waveform indicating FDWT results

The intermediate 256×256 wavelet HEX coefficients data are stored in the external memory and then the dump signal is activated which signifies the end of the transformation process. The 2-D- DWT module generate ready signal to the outside environment indicating the end of first level of decomposition process after 989565 clock cycles required for coding of 256×256 pixels through FDWT filtering as shown in Figure 5.28 as well as later in the Table 5.11, illustrating the number of clock cycle usage for 2-D DWT coding process of HWT.

173 



Figure 5.28: Waveform indicating end of first level of decomposition process FDWT

The VHDL behavioral report can be created by means of synthesis, which implies it can be translated into physically attainable circuits, using Computer Aided Design tools as illustrated in Figure 5.29. A simulation flow summary report is generated after performing this compilation. As a practical method of design confirmation, the simulation report shows that the simulation output four versions 64×64, 128×128, 256×256, and 512× 512 pixels, and that the responses conform to the desired input requirements.

174 



a

b



c

d

Figure 5.29: A simulation flow summary report for four FDWT versions using Lena test image; a- 64×64 pixels version, b- 128×128 pixels version, c- 256×256 pixels version, d- 512× 512 pixels version

175 

The proposed HWT structure has been enhanced by employing fast computation techniques which is implemented using linear algebra equations. The performance of this proposed arithmetic unit has shown significant improvements in the total number of computations and given less equivalent gate counts as shown in Table 5.10.

Table 5.10: Performance comparison in whole equivalent gate usage for 2-D DWT coding process Analysis & Synthesis Resource Usage

images of size64×64 pixels

Total logic elements LEs by number of LUT inputs Total registers

images of size256×256 pixels

images of size512×512 pixels

617 (2%) usage from 33216 LEs

images of size 128×128 pixels 697(2%) usage from 33216 LEs

700 (2%) usage from 33216 LEs

718 (2%) usage from 33216 LEs

313 (< 1%) usage from 33216 LEs 41

345 (1%) usage from 33216 LEs 45

377 (1%) usage from 33216 LEs 49

409 (1%) usage from 33216 LEs 53

Global clocks GCLKs Frequency

2

2

2

2

131.18 MHz

129.25 MHz 130.89 MHz

111.35 MHz

Thermal power

0.128w

0.130w

0.132w

I/O pins

0.131w

Figure 5.30 shows the synthesis report with synopsis of main hardware device utilization in terms of registers and total logic elements LEs by number of LUT inputs and the Model-sim simulation result. 

176 

a

b Figure 5.30: Synthesis report and the simulation result of the proposed 2-D DWT module; a- QuartusII simulation flow summary report, b- ModelSim-Altera 6.5 test bench

177 

After performing synthesis and other verification processes, a Register Transfer Level (RTL) simulation of FDWT Module has been achieved. The RTL or Technology Map helps to check the design visually. The functionality of the individual hardware components comprising the 2-D -DWT chipset was tested using ModelSim-Altera 6.5b. Test benches are often used to provide an automated testing environment wrapper around the design.

After synthesis and place-and-route are completed, the Power Analyzer is used for full information on the design power estimates. Figure 5.31 shows the RTL schematic of the proposed 2-D DWT architecture with different blocks and their interconnections. The total number of logic elements LEs used determines hardware size.

Although the significant increase in the number of cycles used for the 2-D DWT coding process results in higher image size versions, as Table 5.11 shows, the clock cycle usage is also increased with the rise in the levels of decomposition and it ranges from 64323 clock cycles at (N=64, L=1), to 5264303 clock cycles at (N=512, L=7), in the 2-D DWT coding process of HWT as illustrated in Table 5.11.

178 

a s rc_ cu r[1 2 ..0 ] PR E

Se l e cto r3 9 _ OU T

D

s rc_ cu r_ OU T0

Q

d w t:d w tEn g cl k

i s ve rti ca l _ OU T0 Se l e cto r4 3 _ OU T w t_ ctrl _ s i g _ OU T0 Se l e cto r4 5 _ OU T Se l e cto r4 4 _ OU T

i sv er ti cal r eset ctr l _si g [2..0] ctr l _data[12..0]

mem_enabl e mem_r w r eady mem_adr [12..0] mem_data[7..0]

Se l e cto r4 8 _ OU T Se l e cto r4 9 _ OU T Se l e cto r5 0 _ OU T Se l e cto r5 1 _ OU T cl k re s e t Se l e cto r4 7 _ OU T

EN A C LR

Mu x4 1

SEL[1..0]

l e ve l s [2 ..0 ] Se l e cto r2 6 _ OU T Se l e cto r2 7 _ OU T Se l e cto r2 8 _ OU T

PR E D

D ATA[3..0]

Q

OU T

Mu x4 1 _ OU T

OU T

Mu x4 0 _ OU T

OU T

Mu x3 9 _ OU T

OU T

Mu x3 8 _ OU T

OU T

Mu x3 7 _ OU T

OU T

Mu x3 6 _ OU T

OU T

Mu x3 5 _ OU T

OU T

Mu x3 4 _ OU T

OU T

Mu x3 3 _ OU T

OU T

Mu x3 2 _ OU T

OU T

Mu x3 0 _ OU T

OU T

Mu x2 9 _ OU T

OU T

Mu x2 8 _ OU T

OU T

Mu x2 7 _ OU T

OU T

Mu x2 6 _ OU T

OU T

Mu x2 5 _ OU T

M UX EN A

Mu x4 0

C LR

SEL[1..0]

Eq u a l 1 _ OU T Se l e cto r4 0 _ OU T Se l e cto r4 1 _ OU T Se l e cto r4 2 _ OU T Se l e cto r4 6 _ OU T

D ATA[3..0]

M UX

Mu x3 9

SEL[1..0] D ATA[3..0]

M UX

Mu x3 8 tm p D a ta Offs e t[1 2 ..0 ] D a ta Offs e t~_ OU T0

SEL[1..0]

PR E D

Q D ATA[3..0]

EN A C LR M UX

d w t_ ctrl _ d a ta [1 2 ..0 ] Se l e cto r7 3 _ OU T Se l e cto r7 4 _ OU T Se l e cto r7 5 _ OU T Se l e cto r7 6 _ OU T Se l e cto r7 7 _ OU T Se l e cto r7 8 _ OU T Se l e cto r7 9 _ OU T Se l e cto r8 0 _ OU T Se l e cto r8 1 _ OU T Se l e cto r8 2 _ OU T Se l e cto r8 3 _ OU T Se l e cto r8 4 _ OU T Se l e cto r8 5 _ OU T

Mu x3 7

PR E D

Q SEL[1..0]

D ATA[3..0]

M UX

Mu x3 6

SEL[1..0]

D ATA[3..0] EN A C LR M UX

Mu x3 5

SEL[1..0]

D ATA[3..0]

M UX

Mu x3 4

SEL[1..0]

D ATA[3..0]

M UX

Mu x3 3

SEL[1..0]

D ATA[3..0]

M UX

Mu x3 2

SEL[1..0]

D ATA[3..0]

M UX

Mu x3 0

SEL[1..0]

D ATA[3..0]

M UX

Mu x2 9

SEL[1..0]

D ATA[3..0]

M UX

Mu x2 8

SEL[1..0]

D ATA[3..0]

M UX

Mu x2 7

SEL[1..0]

D ATA[3..0]

M UX

Mu x2 6

SEL[1..0]

D ATA[3..0]

M UX

Mu x2 5 h e i g h t[1 2 ..0 ] Se l e cto r1 3 _ OU T Se l e cto r1 4 _ OU T Se l e cto r1 5 _ OU T Se l e cto r1 6 _ OU T Se l e cto r1 7 _ OU T Se l e cto r1 8 _ OU T Se l e cto r1 9 _ OU T Se l e cto r2 0 _ OU T Se l e cto r2 1 _ OU T Se l e cto r2 2 _ OU T Se l e cto r2 3 _ OU T Se l e cto r2 4 _ OU T Se l e cto r2 5 _ OU T

SEL[1..0]

PR E D

Q D ATA[3..0]

M UX

m e m _ d a ta [7 ..0 ] m e m _ a d r[1 2 ..0 ] d w t:d w tEn g _ re a d y m e m _ rw m em _enable co u n te r_ OU T0 w i d th _ OU T0 h e i g h t_ OU T0 EN A

Mu x2 4

C LR

SEL[1..0]

D ATA[3..0]

OU T

Mu x2 4 _ OU T

OU T

Mu x2 3 _ OU T

OU T

Mu x2 2 _ OU T

OU T

Mu x2 1 _ OU T

OU T

Mu x2 0 _ OU T

OU T

Mu x1 9 _ OU T

OU T

Mu x1 8 _ OU T

OU T

Mu x1 7 _ OU T

OU T

Mu x1 6 _ OU T

OU T

Mu x1 5 _ OU T

OU T

Mu x1 4 _ OU T

OU T

Mu x1 3 _ OU T

OU T

Mu x1 2 _ OU T

M UX

Mu x2 3

SEL[1..0]

D ATA[3..0]

M UX

Mu x2 2

SEL[1..0]

D ATA[3..0]

M UX

Mu x2 1

SEL[1..0]

D ATA[3..0]

M UX

Mu x2 0

SEL[1..0]

D ATA[3..0]

M UX

Mu x1 9 ctrl _ s i g [1 ..0 ]

SEL[1..0]

ctrl _ d a ta [1 2 ..0 ]

D ATA[3..0]

M UX

Mu x1 8

SEL[1..0]

D ATA[3..0]

M UX

Mu x1 7

SEL[1..0]

D ATA[3..0]

M UX

Mu x1 6

SEL[1..0]

D ATA[3..0]

M UX

Mu x1 5

SEL[1..0]

D ATA[3..0]

M UX

Mu x1 4

SEL[1..0]

D ATA[3..0]

M UX

Mu x1 3

SEL[1..0]

D ATA[3..0]

M UX

Mu x1 2 w i d th [1 2 ..0 ] Se l e cto r0 _ OU T Se l e cto r1 _ OU T Se l e cto r2 _ OU T Se l e cto r3 _ OU T Se l e cto r4 _ OU T Se l e cto r5 _ OU T Se l e cto r6 _ OU T Se l e cto r7 _ OU T Se l e cto r8 _ OU T Se l e cto r9 _ OU T Se l e cto r1 0 _ OU T Se l e cto r1 1 _ OU T Se l e cto r1 2 _ OU T

SEL[1..0]

PR E D

Q

D ATA[3..0]

M UX

tm p D a ta Offs e t_ OU T l e ve l s _ OU T0

EN A

Mu x1 1

C LR

SEL[1..0] D ATA[3..0]

OU T

Mu x1 1 _ OU T

OU T

Mu x1 0 _ OU T

OU T

Mu x9 _ OU T

OU T

Mu x7 _ OU T

OU T

Mu x6 _ OU T

OU T

Mu x5 _ OU T

OU T

Mu x4 _ OU T

OU T

Mu x3 _ OU T

OU T

Mu x2 _ OU T

OU T

Mu x1 _ OU T

OU T

Mu x0 _ OU T

M UX

Mu x1 0

SEL[1..0] D ATA[3..0]

M UX

Mu x9

SEL[1..0] D ATA[3..0]

M UX

Mu x7

SEL[1..0] D ATA[3..0]

M UX

Mu x6

SEL[1..0] D ATA[3..0]

M UX

Mu x5

SEL[1..0] D ATA[3..0]

M UX

Mu x4

SEL[1..0] D ATA[3..0]

M UX

Mu x3

SEL[1..0] D ATA[3..0]

M UX

Mu x2

SEL[1..0] D ATA[3..0]

M UX

Mu x1

SEL[1..0] D ATA[3..0]

M UX

Mu x0

SEL[1..0] D ATA[3..0]

M UX

d s t_ cu r[1 2 ..0 ] PR E

Se l e cto r5 2 _ OU T Se l e cto r5 3 _ OU T Se l e cto r5 4 _ OU T Se l e cto r5 5 _ OU T Se l e cto r5 6 _ OU T Se l e cto r5 7 _ OU T Se l e cto r5 8 _ OU T Se l e cto r5 9 _ OU T Se l e cto r6 0 _ OU T Se l e cto r6 1 _ OU T Se l e cto r6 2 _ OU T Se l e cto r6 3 _ OU T Se l e cto r6 4 _ OU T

D

d s t_ cu r_ OU T0

Q

EN A C LR

s ta te cl k dwt:dwtEng :r eady

Eq u a l 0 _ OU T

Eq ual 0:OU T Eq ual 1:OU T

co u n te r[3 ..0 ] Se l e cto r6 5 _ OU T Se l e cto r6 6 _ OU T Se l e cto r6 7 _ OU T Se l e cto r6 8 _ OU T

r eset

PR E D

Q

counter [3..0] ctr l _si g [1..0] l ev el s[2..0]

TS2D _C ON TIN U E_H OR IZ TS2D _C ON TIN U E_VER T TS2D _IN IT_H OR IZ TS2D _IN IT_VER T TS2D _N EXT_LEVEL TS2D _R EAD Y TS2D _STAR T_H OR IZ TS2D _STAR T_VER T TS2D _TER M IN ATE

s ta te _ TS2 D _ C ON T s ta te _ TS2 D _ C ON T s ta te _ TS2 D _ IN IT_ H s ta te _ TS2 D _ IN IT_ V s ta te _ TS2 D _ N EXT_ s ta te _ TS2 D _ R EAD Y s ta te _ TS2 D _ STAR T s ta te _ TS2 D _ STAR T s ta te _ TS2 D _ TER M d w t_ ctrl _ d a ta _ OU T0

EN A C LR

b Figure 5.31: a- VHDL simulation of HWT module using Quartus II software tool, bRegister Transfer Level (RTL)view of FDWT Module after synthesis 179 

Table 5.11: Number of clock cycle usage for 2-D DWT coding process of HWT 2-D DWT clock cycles for coding process Level 1 of 2-D DWT Level 2 of 2-D DWT Level 3 of 2-D DWT Level 4 of 2-D DWT Level 5 of 2-D DWT Level 6 of 2-D DWT Level 7 of 2-D DWT

64×64 pixels image size 64323

128×128 pixels image size 249725

256×256 pixels image size 989565

512×512 pixels image size 3943805

80128

312448

1237888

4931968

84291

328649

1300611

5180291

85414

332812

1316614

5243014

85737

333935

1320777

5259017

85840

334258

1321900

5263180

-

334361

1322223

5264303

The outcomes in Table 5.10 have shown that HWT 2-D DWT architecture can run at a maximum operating frequency to 131.18 MHz , 129.25 MHz, 130.89 MHz, and 111.35 MHz for 64×64, 128×128, 256×256 and 512×512 different input image sizes respectively. From the simulation results, it can be deduced that the algorithm performs properly and is capable of reducing both hardware expenditure and power consumption. The synthesis process demonstrated that the four proposed versions are analogous in terms of maximum frequency and the used LEs of the target device, while they are dissimilar in the number of clock cycles needed for coding, which is in turn reliant on the number of levels required. That is, the significant reductions in the computation time results in lower sized schemes than other increased image size versions. Therefore, the computation time also increases according to the level of decomposition for 2-D DWT processes as illustrated in Table 5.12.

180 

Table 5.12: HWT computation time for 2-D DWT coding process Computation time for 2-D DWT coding process

Lena image of size64×64 pixels(ms)

Lena image of size 256×256 pixels (ms)

Lena image of size 512×512 pixels(ms)

0.49034

Lena image of size 128×128 pixels (ms) 1.93210

Level 1 Level 2

7.56027

35.41809

0.61082

2.41739

9.45746

44.29248

Level 3

0.64255

2.54273

9.93667

46.56441

Level 4

0.65112

2.57494

10.05893

47.12821

Level 5

0.65358

2.58363

10.09074

47.27206

Level 6

0.65436

2.58613

10.09932

47.30948

Level 7

-

2.58693

10.10178

47.31957

However, the HWT architecture requires slightly higher computation time than the other efficientembedded extension 5/3 or Le Gall 5/3 LS filter structures. Table 5.13 compares the computation time of the implemented architectures for LENA image size 256×256 coding process in (ms). This table presents comparative results of the proposed efficientembedded extension 5/3 or Le Gall 5/3 LS filter architectures in terms computing time with the other HWT architecture.

Table 5.13: Comparison of computation time for 2-D DWT HWT, Lee Gall 5/3 lifting filter and the proposed efficient embedded 5/3 lifting filter HWT (ms)

Le Gall 5/3 LS (ms)

Proposed 5/3 LS (ms)

7.56027

8.59753

2.07301

Level L=2 of 2-D DWT decomposition

9.45746

10.75168

2.59388

Level L=3 of 2-D DWT decomposition

9.93667

11.29414

2.72651

Level L=4 of 2-D DWT decomposition

10.05893

11.43172

2.76088

Level L=5 of 2-D DWT decomposition

10.09074

11.46711

2.77008

Level L=6 of 2-D DWT decomposition

10.09932

11.47647

2.77270

Level L=7 of 2-D DWT decomposition

10.10178

11.47907

2.77376

2-D DWT Computation time for LENA image size256×256 coding process (ms) Level L=1 of 2-D DWT decomposition

181 

Furthermore, Table 5.14 compares the hardware performances of the implemented architectures. This table presents comparative results of the proposed architectures in terms of the number of FPGA slices, image size, and consumed power. The proposed HWT architecture is executed against an average frequency of 130.89 MHz frequency for clock cycles compared to the other architectures with 256×256 image size version. The LEs number of the architecture is 700 LEs (2%) usage from 33216. It truly is a trade-off between the limitations whenever space, speed and power are trimmed for significantly increased efficiency. The final result is a cost effective and low consumption HWT with considerably reduced hardware space compared to the conventional Le Gall 5/3 LS. However, HWT consumes higher hardware space and computation time compared to the proposed efficient 5/3 2-D DWT architecture. Table 5.14: Performance comparison in gate usage for 2-D DWT HWT, Lee Gall 5/3 lifting filter and embedded 5/3lifting filter Analysis & Synthesis Resource using Altera Cyclone® II 2C35 FPGA device Total logic elements LEs by number of LUT inputs Total registers I/O pins Global clocks GCLKs Frequency

Thermal power (w)

2-D DWT Traditional Haar filter using images of size 256×256 pixels

2-D DWT Le Gall 5/3 lifting filter using images of size 256×256 pixels

2-D DWT proposed embedded 5/3 lifting filter using images of size 256×256 pixels

700 (2%) usage from 33216 LEs

886 (3%) usage from 33216 LEs

127 (< 1%) usage from 33216 LEs

377 (1%) usage from 33216 LEs 49 (10%) usage I/O pins from 475 2 ( 13 % ) usage GCLKs from 16 130.89 MHz (period = 7.640 ns )

407 (< 2%) usage from 33216 LEs 50 (10%) user I/O pins from 475 2 ( 13 % ) user GCLKs from 16 114.92 MHz (period = 8.702 ns )

80 (< 1%) usage from 33216 LEs 49 (10%) user I/O pins from 475 2 ( 13 % ) user GCLKs from 16 160.23 MHz ( period = 6.241 ns )

0.131w

0.132 w

0.114w



182 

The results of the same FPGA chipset implementation have shown that the FDWT circuit of the proposed efficient5/3 2-D DWT architecture can process a 256×256 image in 2.07301ms, which is at least four times faster than that of the other JPEG2000 and HWT FPGA implementation with less hardware utilization. The main objective of high speed or short computation time has been achieved by minimizing the number of clock cycles required for the multilevel 2-D DWT computation with little overhead on the hardware resources.

5.4

Summary From the simulation results, it can be deduced that the proposed 5/3 algorithm

is competent enough to reduce hardware expenditure and computation time in contrast to the traditional JPEG2000 filter for lossless image compression, Le Gall 5/3 lifting filter. The proposed 5/3 LS hardware filter produces identical results to its Matlab model while the HWT deviates very slightly from its Matlab model for the seven levels of DWT decomposition. The execution time and hardware area for performing DWT in the HWT architecture is, to some extent, higher than the other LS 5/3. The hardware logic and register element area generated for the proposed 5/3 filter is 127 slices, which used less than 1% of the Altera DE2 development board Cyclone II FPGA hardware area. The energy consumption of the 2-D DWT decomposition process is only 0.033 W. Simulations have been carried out using grayscale images of various sizes to validate the proposed design and reach a speed performance suitable for real-time applications. Four versions of the design are presented, with each version adapted to cater to 64×64, 128×128, 256×256 and 512×512 different input image sizes respectively.

183 

CHAPTER 6 CONCLUSIONS AND FUTURE STUDIES 6.1

Conclusion This study deals with the problems of DWT-based image compression

algorithms and their associated transforms such as power consumption, hardware cost and computation time. For such architectures, the major goals include: multiplierless, low memory demands and efficient utilization. Decline in power consumption in such architectures may be attained by decreasing or avoiding the use of high power consumption components, such as multipliers. Therefore, multiplierless multilevel decomposition architectures for image compression algorithms and their related DWT that compensate for these issues were proposed. The objective of this study has been to develop a scheme for the design of hardware resource-efficient and high-speed architectures for the computation of the 1-D and 2D DWT. Efficient low-power consumption architecture for multilevel decomposition FDWT and IDWT are outlined in Chapter3-.Chapter5.

The goal of high speed or short computation time has been achieved by minimizing the number of clock cycles required for the DWT computation and maximizing the operating frequency with little or no overhead on the hardware resources. In order to achieve this goal, certain characteristics inherent in the 1-D and 2-D discrete wavelet transforms have been exploited through a number of basic ideas so as the boundary extension can be performed without additional computational complexity.

The

JPEG2000

conventional

symmetric

extension

requires

supplementary memory units along with operations of high power and area consumption. Hence, extended clock cycles at the start and at the end are dissipated 184 

in processing each row and column of the image.

In order to enhance the

conventional symmetric extension, a study is undertaken in Chapter 3 for determining the amount of data terms to expand which is based on either the pixels data set begins or ends with an even or odd number indexed term. The enhancement of the symmetric data extension is based on the fact that both the extended data and the lifting structure are symmetrical; all the intermediate and final results of the lifting are also symmetrical with regard to the boundary points. Symmetric extension of the image to be modified is properly managed so that no supplementary computations or clock cycles is needed. It is shown in Chapter 5 that this embedded technique of extension is most practical from hardware standpoint, as it decreases the power consumption.

In Chapter 3, an all-inclusive mathematical modeling of DWT energy consumption was also proposed in chapter three, which involves calculating the computational and data access loads to estimate the energy efficiency of the proposed architectures. The DWT algorithm processing requires data storage memory and transfer operations and arithmetic operations (additions and shifting). Nevertheless, the wavelet algorithm energy adapting the 2-D DWT has been computed using an integer 5-tap/3-tap filter as a direct function of the consumed coefficients units. The total computation energy utilized by hardware architecture for the LS 5/3 2-D DWT is computed as a weighted sum of the computational load and memory data-access load. This hardware energy consumption required for shift, add, read, and write basic operations in a unit pixel.

185 

Chapter 3 also presents a modified LS DWT computation analysis based on the 5/3 algorithm. The modified LS DWT circumvents the demand for the high pass filter coefficients computations, thereby decreasing hardware use and enhancing the processing speed of image data transmission. Given that the high pass filter coefficients entails no computation, it supports the reduction of computation energy used in the entire wavelet based image compression procedure by decreasing the volume of executed computation operations required to compress an image. Even though the energy saved between the first two levels of decomposition is considerable (25%), it is noted there is nearly no identifiable difference in the quality of the two levels reconstructed images, since the image quality degradation is within only 9 dB. The communication energy saved increases concurrently with rise in volume of decomposition levels. The impact of decomposition levels on computation energy savings is more considerable at lower transform levels and become nearly stable from the third decomposition level.

In Chapter 4 - Chapter 5, a distinct 1D-DWT filter is integrated into the proposed linear algebra HWT based coding by means of multilevel RC architecture implementation to lessen hardware use. The RC is performed by applying the forward 1D DWT in both horizontal and vertical directions of the image, for a chosen amount of levels. Multiple levels of computations can be implemented in this architecture, with intermediate results between levels saved in the external memory. The proposed architectures have been examined using diverse images, different levels of decompositions and clock frequencies. The results showed that the amount of occupied LEs is relatively less than LEs in the target device. The HWT approach needs a little longer duration to complete processing demands compared to the

186 

traditional Le Gall 5/3 LS or efficient embedded extension 5/3 LS techniques. In addition, the proposed efficient architecture surpasses other similar architectures based on hardware usage, speed and consumption power.

6.2

Major Contributions The contributions towards the realization of the research objectives

aforementioned in Chapter 1 have been fully explained in Chapter 3 - Chapter 5, and are summarized below: a. The main contribution is developing high speed and resource efficient hardware architecture of 5/3 LS-based DWT multilevel decomposition hardware architecture that lowers power consumption as well as computation time. The decrease is attained by incorporating the data-extension into the main lifted-based DWT such that the computation procedures are reduced. b. A comprehensive mathematical model of DWT power consumption is developed by computing the computational and data access loads associated with the multi-resolution levels of the 5/3 LS-based DWT decomposition process. c. A modified computation of the DWT stages are explored from the standpoint of evading computing high frequency subbands, with considerable reduction in power consumption. d. Analysis results from FPGA evaluation measures proven that the proposed system attained high performance levels as compared with those from the standard JPEG2000 Lee Gall 5/3 LS and HWT filters.

187 

6.3

Future Studies

Further study on the following points can be carried out: 1. The future extension recommended by this work includes further work on the transformation phase as well the coding phase, to be able to develop a comprehensive scheme for image compression. A potential research direction is to merge the proposed DWT architectures with a quantization and encoding algorithm to generate comprehensive DWT-based image and video compression systems. DWT has been utilized for image compression algorithms comprising the Embedded Zerotree EZW, Set Partitioning in Hierarchical Trees SPIHT, and Embedded Bock Coding with Optimized Truncation EBCOT encoders. Hence, a promising research approach is to relate 5/3 DWT architecture with such encoders in order to establish a high performance DWT-based compression system.

2. Comprehensive 3-D compression systems based on a 3-D DWT architecture can also be regarded as a potential research path to achieve high throughput and low memory demands, particularly for high definition video processing and 3-D DWT based applications. Comparative analysis of the performance of new architectures and the architectures introduced in this study can be carried out.

188 

REFERENCES ACHARYA, T. & CHAKRABARTI, C. 2006. A survey on lifting-based discrete wavelet transform architectures. Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology, vol. 42, pp. 321-339. ACHARYA, T. & TSAI, P.-S. 2005. JPEG2000 Standard for Image Compression: Concepts, Algorithms and VLSI Architectures, John Wiley & Sons. ADAMS, M. D. 2013. Multiresolution Signal and Geometry Processing: Filter Banks, Wavelets, and Subdivision, University of Victoria Press, Victoria, BC, Canada, (Version: 2013-09-26), ISBN 978-1-55058-508-7. ADAMS, M. D. & KOSSENTNI, F. 2000. Reversible integer-to-integer wavelet transforms for image compression: performance evaluation and analysis., IEEE Transactions on Image Processing, vol.9, pp. 1010-1024. AHSAN, M. R., IBRAHIMY, M. I. & KHALIFA, O. O. 2011.VHDL modelling of fixed-point DWT for the purpose of EMG signal denoising, Third International Conference on Computational Intelligence, Communication Systems and Networks (CICSyN), pp. 236-241. AKANSU, A. N. & HADDAD, P. R. 2000. Multiresolution Signal Decomposition: Transforms, Subbands, and Wavelets, Academic Press. AKYILDIZ, I. F., SU, W., SANKARASUBRAMANIAM, Y. & CAYIRCI, E. 2002. Wireless sensor networks: A survey. Computer Networks, vol.38, pp. 393422. AL MUHIT, A., ISLAM, M. S. & OTHMAN, M. 2004.VLSI implementation of discrete wavelet transform (DWT) for image compression. Proceedings of the 2nd International Conference on Autonomous Robots and Agents, ICARA, Palmerston North, New Zealand, pp. 13-15. ALTERA CORPORATION. 2007. Cyclone II Device Handbook. 1. ALTERA CORPORATION. 2008. Quartus II Introduction using Schematic Design. ALTERA CORPORATION. 2014. Quartus II Handbook Volume 3: Verification. 3. ANDRA, K., CHAKRABARTI, C. & ACHARYA, T. 2002. A VLSI architecture for lifting-based forward and inverse wavelet transform., IEEE Transactions on Signal Processing, vol.50, pp. 966-977. ANDRES, E., WIDHALM, M. & CALOTO, A. 2009. Achieving high speed CFD simulations: Optimization, parallelization, and FPGA acceleration for the unstructured DLR TAU Code. 47th American Institute of Aeronautics and Astronautics AIAA, Aerospace Sciences Meeting and the New Horizons Forum and Aerospace Exhibit, vol.47, pp. 8745-8764.

189 

ANDREWS, G. E. 1998. The geometric series in calculus. American Mathematical Monthly, pp. 36-40. ANGELOPOULOU, M., MASSELOS, K., CHEUNG, P. K. & ANDREOPOULOS, Y. 2008. Implementation and comparison of the 5/3 lifting 2D discrete wavelet transform computation schedules on FPGAs, Journal of Signal Processing Systems, vol.51, pp. 3-21. AZIZ, S. M. & PHAM, D. M. 2012. Efficient parallel architecture for multi-level forward discrete wavelet transform processors, Computers & Electrical Engineering, vol.38, pp. 1325-1335. BAGHERI HAMANEH, M., CHITRAVAS, N., KAIBORIBOON, K., LHATOO, S. & LOPARO, K. 2014. Automated removal of EKG artifact from EEG data using independent component analysis and continuous wavelet transformation, IEEE Transactions on Biomedical Engineering, vol.61(6), pp. 1634 - 1641. BARUA, S., CARLETTA, J., KOTTERI, K. A. & BELL, A. E. 2005. An efficient architecture for lifting-based two-dimensional discrete wavelet transforms. INTEGRATION, the VLSI Journal, vol.38, pp. 341-352. BARUA, S., CARLETTA, J., KOTTERI, K. A. & BELL, A. E. 2012. A detailed survey on VLSI architectures for lifting based DWT for efficient hardware implementation, International Journal of VLSI Design & Communication Systems (VLSICS), vol.3, pp. 143-164. BENDERLI, O., TEKMEN, Y. C. & ISMAILOGLU, N. 2003.A real-time, low latency, FPGA implementation of the 2-D discrete wavelet transformation for streaming image applications, Euromicro Symposium on Digital System Design, Belek-Antalya, Turkey, pp. 384-389. BENKRID, A., BENKRID, K. & CROOKES, D. 2003. Design and implementation of a generic 2D orthogonal discrete wavelet transform on FPGA, 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, FCCM, pp. 162-172. BENKRID, K., CROOKES, D. & BENKRID, A. 2002. Towards a general framework for FPGA based image processing using hardware skeletons, Parallel Computing, vol.28, pp. 1141-1154. BEYLKIN, G., COIFMAN, R. & ROKHLIN, V. 1992. Wavelets in numerical analysis, Wavelets and Their Applications, pp. 181-210. BHARDWAJ, A. & ALI, R. 2009. Image compression using modified fast haar wavelet transform, World Applied Sciences Journal, vol.7, pp. 647-653. BHASKER, J. 1999. A VHDL Primer, Prentice Hall PTR.

190 

BOIX, M. & CANTO, B. 2010. Wavelet transform application to the compression of images, Mathematical and Computer Modelling, vol.52, pp. 1265-1270. BOLUK, P. S., BAYDERE, S. & HARMANCI, A. E. 2011. Robust image transmission over wireless sensor networks. Mobile Networks and Applications, vol.16, pp. 149-170. BURT, P. J. & ADELSON, E. H. 1983. The laplacian pyramid as a compact image code, IEEE Transactions on Communications, vol.31, pp. 532-540. CHAKRABARTI, C. & MUMFORD, C. 1996. Efficient realizations of analysis and synthesis filters based on the 2-D discrete wavelet transform, IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP-96, Atlanta, GA, vol.6, pp. 3256-3259. CHAKRABARTI, C. & VISHWANATH, M. 1995. Efficient realizations of the discrete and continuous wavelet transforms: From single chip implementations to mappings on SIMD array computers, IEEE Transactions on Signal Processing, vol.43, pp. 759-771. CHANG, C.-C. & LIN, Y.-L. 2004. A Dual Mode (5,3)/(9,7) FDWT/IDWT hardware accelerator IP, 15th International Symposium on VLSI Design, Automation and Test CAD, Kenting Taiwan. CHANG, Q. & GAOFENG, W. 2006. A wavelet-based parallel implementation for image encoding, 8th IEEE International Conference on Signal Processing, Beijing, China, vol.2, pp. 16-20. CHAO-TSUNG, H., PO-CHIH, T. & LIANG-GEE, C. 2002. Flipping structure: an efficient VLSI architecture for lifting-based discrete wavelet transform, AsiaPacific Conference on Circuits and Systems APCCAS '02, vol.1, pp. 383-388. CHAO-TSUNG, H., PO-CHIH, T. & LIANG-GEE, C. 2005. Generic RAM-based architectures for two-dimensional discrete wavelet transform with line-based method, IEEE Transactions on Circuits and Systems for Video Technology, vol.15, pp. 910-920. CHAO, W., ZHILIN, W., PENG, C. & JIE, L. 2007. An efficient VLSI architecture for lifting-based discrete wavelet transform, IEEE International Conference on Multimedia and Expo, Beijing, China, pp. 1575-1578. CHAVER, D., TENLLADO, C., PIÑUEL, L., PRIETO, M. & TIRADO, F. 2002. 2D wavelet transform enhancement on general-purpose microprocessors: Memory hierarchy and SIMD parallelism exploitation, High Performance Computing²HiPC 2002. Springer. CHEN, C.-Y., YANG, Z.-L., WANG, T.-C. & CHEN, L.-G. 2001. A programmable parallel VLSI architecture for 2-D discrete wavelet transform, Journal of VLSI signal processing systems for signal, image and video technology, vol.28, pp. 151-163. 191 

CHEN, P.-Y. 2004. VLSI implementation for one-dimensional multilevel liftingbased wavelet transform, IEEE Transactions on Computers, vol.53, pp. 386398. CHENG, C.-C., HUANG, C.-T., CHEN, C.-Y., LIAN, C.-J. & CHEN, L.-G. 2007. On-chip memory optimization scheme for VLSI implementation of linebased two-dimentional discrete wavelet transform, IEEE Transactions on Circuits and Systems for Video Technology, vol.17, pp. 814-822. CHENG, C. & PARHI, K. K. 2008. High-speed VLSI implementation of 2-D discrete wavelet transform, IEEE Transactions on Signal Processing, vol.56, pp. 393-403. CHIH-HSIEN, H., JEN-SHIUN, C. & JING-MING, G. 2013. Memory-efficient hardware architecture of 2-D dual-mode lifting-based discrete wavelet transform, IEEE Transactions on Circuits and Systems for Video Technology, vol.23, pp. 671-683. CHRISTOPOULOS, C., SKODRAS, A. & EBRAHIMI, T. 2000. The JPEG2000 still image coding system: An overview, IEEE Transactions on Consumer Electronics, vol.46, pp.1103-1127. CHRYSAFIS, C. & ORTEGA, A. 2000. Line-based, reduced memory, wavelet image compression, IEEE Transactions on Image Processing, vol.9, pp. 378389. COLOM-PALERO, R. J., GADEA-GIRONES, R., BALLESTER-MERELO, F. J. & 0$57Õ1 ғ (=-PEIRO, M. 2004. Flexible architecture for the implementation of the two-dimensional discrete wavelet transform (2D-DWT) oriented to FPGA devices, Microprocessors and Microsystems, vol.28, pp. 509-518. COMPTON, K. & HAUCK, S. 2008. Automatic design of reconfigurable domainspecific flexible cores, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.16, pp. 493-503. COMPTON, K. & HAUCK, S. 2002. Reconfigurable computing: A survey of systems and software, ACM Computing Surveys, vol.34, pp. 171-210. CONG, J. & XIAO, B. 2011. mrFPGA: A novel FPGA architecture with memristorbased reconfiguration, IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH), San Diego, CA, pp. 1-8. COSTA, D. G. & GUEDES, L. A. 2012. A discrete wavelet transform (DWT)-based energy-efficient selective retransmission mechanism for wireless image sensor networks, Journal of Sensor and Actuator Networks, vol.1, pp. 3-35. CROCHIERE, R. E., WEBBER, S. A. & FLANAGAN, J. L. 1976. Digital coding of speech in subͲbands, Bell System Technical Journal, vol.55, pp. 1069-1085.

192 

CROISIER, A., ESTEBAN, D., GALAND, C. 1976. Perfect channel splitting by use of interpolation/decimation/tree decomposition techniques, Proc. of Int. Symposium on Information Circuis and Systems Information, Patras, Greece, pp. 443-446. DAI, Q., CHEN, X. & LIN, C. 2004. A novel VLSI architecture for multidimensional discrete wavelet transform, IEEE Transactions on Circuits and Systems for Video Technology, vol.14, pp. 1105-1110. DARJI, A., AGRAWAL, S., OZA, A., SINHA, V., VERMA, A., MERCHANT, S. & CHANDORKAR, A. 2014. Dual-scan parallel flipping architecture for a lifting-based 2-D discrete wavelet transform, IEEE Transactions on Circuits and Systems II: Express Briefs, vol.61, pp. 433-437. DAUBECHIES, I. 1992. Ten lectures on wavelets, SIAM. DAUBECHIES, I. & SWELDENS, W. 1998. Factoring wavelet transforms into lifting steps, Journal of Fourier Analysis and Applications, vol.4, pp.247-269. DENK, T. C. & PARHI, K. K. 1998.Systolic VLSI architectures for 1-D discrete wavelet transforms, 1998, IEEE Conference Record of the Thirty-Second Asilomar Conference on Signals, Systems & Computers, pp. 1220-1224. DI CARLO, S., PRINETTO, P., SCIONTI, A., FIGUERAS, J., MANICH, S. & RODRIGUEZ-MONTANÉS, R. 2009. A Low-cost FPGA-based test and diagnosis architecture for SRAMs, First IEEE International Conference on Advances in System Testing and Validation Lifecycle VALID '09, Porto, pp. 141-146. DIA, D., ZEGHID, M., SAIDANI, T., ATRI, M., BOUALLEGUE, B., MACHHOUT, M. & TOURKI, R. 2009. Multi-level discrete wavelet transform architecture design, Proceedings of the World Congress on Engineering, pp. 1-3. DONG-GI, L. & DEY, S. 2002. Adaptive and energy efficient wavelet image compression for mobile multimedia data services, IEEE International Conference on Communications ICC, vol.4, pp. 2484-2490 ELFOULY, F. H., MAHMOUD, M. I., DESSOUKY, M. I. & DEYAB, S. 2008. Comparison between Haar and Daubechies wavelet transformions on FPGA technology, International Journal of Computer and Information Engineering, vol.2, pp. 37-42. ENZLER, R. 1999. The Current Status of Reconfigurable Computing, Swiss Federal Institute of Technology (ETH) Zurich, Electronics Laboratory. FARAHANI, M. A. & ESHGHI, M. 2006. Architecture of a wavelet packet transform using parallel filters, IEEE International Conference on Applied ElectronicsTENCON,Hong Kong, PP. 7-10.

193 

FAROOQ, U., MARRAKCHI, Z. & MEHREZ, H. 2012. Tree-Based ASIF using heterogeneous blocks, Tree-based Heterogeneous FPGA Architectures, Springer New York,pp. 153-171. FRAZIER, M. 1999. An Introduction to Wavelets Through Linear Algebra, Springer. GABOR, D. 1946. Theory of communication. Part 1: The analysis of information, Journal of the Institution of Electrical Engineers-Part III: Radio and Communication Engineering, vol.93, pp. 429-441. GAJSKI, D. D. & KUHN, R. H. 1983. Guest editors' introduction: New VLSI tools, Computer, vol.16, pp.11-14. GAO, Z.-R. & XIONG, C.-Y. 2005. An efficient line-based architecture for 2-D discrete wavelet transform, IEEE International Conference Proceedings on Communications, Circuits and Systems,vol.2, pp. 1322-1325. GOKHALE, M. B. & GRAHAM, P. S. 2006. Reconfigurable Computing: Accelerating Computation with Field-Programmable Gate Arrays, Springer Science & Business Media. GONZALEZ, R. C. & WOODS, R. E. 2002. Digital Image Processing, Prentice hall Upper Saddle River, NJ. GRGIC, S., GRGIC, M. & ZOVKO-CIHLAR, B. 2000. Optimal decomposition for wavelet image compression, IEEE Proceedings of the First International Workshop on Image and Signal Processing and Analysis IWISPA 2000, Pula, PP. 203-208. GRGIC, S., GRGIC, M. & ZOVKO-CIHLAR, B. 2001. Performance analysis of image compression using wavelets, IEEE Transactions on Industrial Electronics, vol.48, pp. 682-695. GRZESZCZAK, A., MANDAL, M. K. & PANCHANATHAN, S.1996. VLSI implementation of discrete wavelet transform, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.4, pp. 421-433. GUO, Y., ZHANG, H., WANG, X. & CAVALLARO, J. R. 2001.VLSI implementation of Mallat's fast discrete wavelet transform algorithm with reduced complexity, IEEE Global Telecommunications Conference GLOBECOM'01,San Antonio, TX, vol.1, pp. 320-324. GUPTA, V. & RAJ, K. 2012. An efficient modified lifting based 2-D discrete wavelet transform architecture, 1st International Conference on Recent Advances in Information Technology (RAIT), pp. 832-837. HAMBLEN, J. O., HALL, T. S. & FURMAN, M. D. 2006, Rapid Prototyping of Digital Systems: Quartus® II Edition, Springer.

194 

HONGYU, L., MANDAL, M. K. & COCKBURN, B. F. 2002. Novel architectures for the lifting-based discrete wavelet transform, IEEE Canadian Conference on Electrical and Computer Engineering CCECE, vol.2, pp. 1020-1025. HONGYU, L., MANDAL, M. K. & COCKBURN, B. F. 2004. Efficient architectures for 1-D and 2-D lifting-based wavelet transforms, IEEE Transactions on Signal Processing, vol.52, pp. 1315-1326. HSIA, C.-H., CHIANG, J.-S. & GUO, J.-M. 2013. Memory-efficient hardware architecture of 2-D dual-mode lifting-based discrete wavelet transform, IEEE Transactions on Circuits and Systems for Video Technology, vol.23, pp. 671-683. HUANG, C.-T., TSENG, P.-C. & CHEN, L.-G. 2004. Flipping structure: an efficient VLSI architecture for lifting-based discrete wavelet transform, IEEE Transactions on Signal Processing, vol.52, pp. 1080-1089. HUANG, C.-T., TSENG, P.-C. & CHEN, L.-G. 2005a. Analysis and VLSI architecture for 1-D and 2-D discrete wavelet transform, IEEE Transactions on Signal Processing, vol.53, pp. 1575-1586. HUANG, C.-T., TSENG, P.-C. & CHEN, L.-G. 2005b. Generic RAM-based architectures for two-dimensional discrete wavelet transform with line-based method, IEEE Transactions on Circuits and Systems for Video Technology, vol.15, pp. 910-920. HUNG, K.-C., HUNG, Y.-S. & HUANG, Y.-J. 2001. A nonseparable VLSI architecture for two-dimensional discrete periodized wavelet transform, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.9, pp. 565576. JAIN, R. & PANDA, P. R. 2007. An efficient pipelined VLSI architecture for liftingbased 2d-discrete wavelet transform, IEEE International Symposium on Circuits and Systems ISCAS,New Orleans, LA, pp.1377-1380. JAYAKUMAR, A. & BABU ANTO, P. 2013. Comparison of wavelet packets and DWT in a gender based multichannel speaker recognition system, IEEE Conference on Information & Communication Technologies (ICT), JeJu Island, pp. 847-850. JENSEN, A. & LA COUR-HARBO, A. 2001. Ripples in Mathematics: The discrete wavelet transform, Springer. JER MIN, J., PEI-YIN, C., YEU-HORNG, S. & MING-SHIANG, L.1999. A scalable pipelined architecture for separable 2-D discrete wavelet transform, Proceedings of the Asia and South Pacific Design Automation Conference ASP-DAC '99, Wanchai , vol.1, PP. 205-208. JIA, J. Y., SINGARAJU, P., DHAOUI, F., NEWELL, R., LIU, P., MICAEL, H., TRAAS, M., SAMMIE, S., HAWLEY, F. & MCCOLLUM, J. 2012. Performance and reliability of a 65nm Flash based FPGA, 11th IEEE 195 

International Conference on Solid-State and Integrated Circuit Technology (ICSICT),Xi'an, PP. 1-3. KAISER, G. 1998. The fast Haar transform. Potentials, IEEE, vol.17, pp. 34-37. KAISER, G. 2010. AFriendly Guide to Wavelets, Springer. KHAN, A., THAKARE, A. & GULHANE, S. 2010. FPGA-based design of controller for sound fetching from codec using Altera DE2 Board, International Journal of Scientific & Engineering Research, vol.1, pp. 1-8. KRISHNAIAH, G. C., PRASAD, T. J. & PRASAD, M. G. 2012. Algorithm for improved image compression and reconstruction performances, Signal & Image Processing : International Journal (SIPIJ), vol.3, pp. 79-98 KRONLAND-MARTINET, R., MORLET, J. & GROSSMANN, A. 1987. Analysis of sound patterns through wavelet transforms. International Journal of Pattern Recognition and Artificial Intelligence, vol.1, pp. 273-302. LAFRUIT, G., NACHTERGAELE, L., VANHOOF, B. & CATTHOOR, F. 2000. The local wavelet transform: a memory-efficient, high-speed architecture optimized to a region-oriented zero-tree coder, Integrated Computer-Aided Engineering, vol.7, pp. 89-103. LAKSHMANAN, M. K. & NIKOOKAR, H. 2006. A review of wavelets for digital wireless communication. Wireless Personal Communications, vol.37, pp. 387-420. LAN, X., ZHENG, N. & LIU, Y. 2005. Low-power and high-speed VLSI architecture for lifting-based forward and inverse wavelet transform, IEEE Transactions on Consumer Electronics, vol.51, pp. 379-385. LANDMAN, P.1996. High-level power estimation, IEEE International Symposium on Low Power Electronics and Design, Monterey, CA , pp. 29-35. LE GALL, D. & TABATABAI, A. 1988. Sub-band coding of digital images using symmetric short kernel filters and arithmetic coding techniques, IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP-88,New York, NY, vol.2 , pp.761-764. LECUIRE, V., DURAN-FAUNDEZ, C. & KROMMENACKER, N. 2007. Energyefficient transmission of wavelet-based images in wireless sensor networks. EURASIP Journal on Image and Video Processing, 2007. LEWIS, A. & KNOWLES, G. 1991. VLSI architecture for 2-D Daubechies wavelet transform without multipliers, Electronics Letters, vol.27, pp. 171-173. LIAO, H., MANDAL, M. K. & COCKBURN, B. F. 2004. Efficient architectures for 1-D and 2-D lifting-based wavelet transforms, IEEE Transactions on Signal Processing, vol.52, pp. 1315-1326. 196 

LOUIS, A. K., MAASS, D. & RIEDER, A. 1997. Wavelets: Theory and Applications, vol. 36 of Pure and Applied Mathematics: A Wiley Series of Texts, Monographs and Tracts. LUO, Y. & WARD, R. K. 2003. Removing the blocking artifacts of block-based DCT compressed images, IEEE Transactions on Image Processing, vol.12, pp. 838-842. MADISHETTY, S. K., MADANAYAKE, A., CINTRA, R. J. & DIMITROV, V. S. 2014. Precise VLSI architecture for AI based 1-D/ 2-D Daub-6 wavelet filter banks with low adder-count, IEEE Transactions on Circuits and Systems I: Regular Papers, vol.61, pp. 1984-1993. MAHAPATRA, C., LEUNG, V. C. & STOURAITIS, T. 2014. An orthogonal wavelet division multiple-access processor architecture for LTE-advanced wireless/radio-over-fiber systems over heterogeneous networks, EURASIP Journal on Advances in Signal Processing, vol.77, pp. 1-16. MAHMOUD, M. I., DESSOUKY, M. I., DEYAB, S. & ELFOULY, F. H. 2007. Comparison between Haar and Daubechies wavelet transformations on FPGA technology, Proceedings of World Academy of Science, Engineering and Technology, vol.26, pp. 68-72. MALLAT, S. G. 1989a. A theory for multiresolution signal decomposition: The wavelet representation, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.11, pp. 674-693. MALLAT, S. G. 1989b. Multifrequency channel decompositions of images and wavelet models, IEEE Transactions on Acoustics, Speech and Signal Processing, vol.37, pp. 2091-2110. MAMUN, M., JIA, X. & RYAN, M. J. 2014. Nonlinear elastic model for flexible prediction of remotely sensed multitemporal images, IEEE Geoscience and Remote Sensing Letters, vol.11, pp. 1005-1009. MANDAL, M. 2003. Digital Image Compression Techniques. Multimedia Signals and Systems, The Springer International Series in Engineering and Computer Science, vol.716, pp 169-202. MANO, M. M. 1993. Computer System Architecture (3rd ed.), Prentice-Hall, Inc. MANSOURI, A., AHAITOUF, A. & ABDI, F. 2009. An efficient VLSI architecture and FPGA implementation of high-speed and low power 2-D DWT for (9, 7) wavelet filter, IJCSNS International Journal of Computer Science and Network Security, vol.9, pp. 50-60. MARCELLIN, M. W. 2002. JPEG2000 Image Compression Fundamentals, Standards and Practice: Image Compression Fundamentals, Standards, and Practice, Springer. 197 

MARINO, F. 2000. Efficient high-speed/low-power pipelined architecture for the direct 2-D discrete wavelet transform, IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, vol.47, pp. 1476-1491. MARINO, F. 2001. Two fast architectures for the direct 2-D discrete wavelet transform, IEEE Transactions on Signal Processing, vol.49, pp. 1248-1259. MARINO, F., GUEVORKIAN, D. & ASTOLA, J. T. 2000. Highly efficient highspeed/low-power architectures for the 1-D discrete wavelet transform, IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, vol.47, pp.1492-1502. MARINO, F., PIURI, V. & SWARTZLANDER, E. 1999. A parallel implementation of the 2-D discrete wavelet transform without interprocessor communications, IEEE Transactions on Signal Processing, vol.47, pp. 31793184. MASUD, S. & MCCANNY, J. V. 2004. Reusable silicon IP cores for discrete wavelet transform applications, IEEE Transactions on Circuits and Systems I: Regular Papers, vol.51, pp. 1114-1124. MCCANNY, P., MASUD, S. & MCCANNY, J. 2002. Design and implementation of the symmetrically extended 2-D wavelet transform, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. III3108-III-3111. MEHER, P. K., MOHANTY, B. K. & PATRA, J. C. 2008. Hardware-efficient systolic-like modular design for two-dimensional discrete wavelet transform, IEEE Transactions on Circuits and Systems II: Express Briefs, vol.55, pp. 151-155. MEHER, P. K., MOHANTY, B. K. & SWAMY, M. M. S. 2015. Low-Area and Low-Power Reconfigurable Architecture for Convolution-Based 1-D DWT Using 9/7 and 5/3 Filters, 28th International Conference on VLSI Design (VLSID), pp. 327-332. MEYER, Y. 1993. Wavelets-Algorithms and Applications. Wavelets-Algorithms and applications Society for Industrial and Applied Mathematics Translation SIAM. MOHANTY, B. K. & MEHER, P. K. 2011a. Memory-efficient architecture for 3-D DWT using overlapped grouping of frames, IEEE Transactions on Signal Processing, vol.59, pp. 5605-5616. MOHANTY, B. K. & MEHER, P. K. 2011b. Memory efficient modular VLSI architecture for highthroughput and low-latency implementation of multilevel lifting 2-D DWT, IEEE Transactions on Signal Processing, vol.59, pp. 20722084.

198 

MOHANTY, B. K., MAHAJAN, A. & MEHER, P. K. 2012. Area-and powerefficient architecture for high-throughput implementation of lifting 2-D DWT, IEEE Transactions on Circuits and Systems II: Express Briefs, vol.59, pp. 434-438. MOHANTY, B. K. & MEHER, P. K. 2013. Memory-efficient high-speed convolution-based generic structure for multilevel 2-D DWT, IEEE Transactions on Circuits and Systems for Video Technology, vol.23, pp. 353363. MOVVA, S. & SRINIVASAN, S. 2003. A novel architecture for lifting-based discrete wavelet transform for JPEG2000 standard suitable for VLSI implementation. Proceedings, 16th IEEE International Conference on VLSI Design, pp. 202-207. NAYAK, S. 2005. Bit-level systolic implementation of 1D and 2D discrete wavelet transform, Circuits, IEEE Proceedings Devices and Systems, vol.152(1), pp. 25-32. PALERO, R., GIRONÉS, R. & CORTES, A. 2006. A Novel FPGA architecture of a 2-D wavelet transform, Journal of VLSI signal processing systems for signal, image and video technology, vol.42, pp. 273-284. PAPADOMANOLAKIS, K., KAKAROUNTAS, A., KOKKINOS, V., SKLAVOS, N. & GOUTIS, C. 2001.The effect of fault secureness in low power multiplier designs, IEEE International Workshop on Power And Timing Modeling, Optimization and Simulation (PATMOS'01),Switzerland, pp. 1-13. PARHI, K. K. & NISHITANI, T. 1993. VLSI architectures for discrete wavelet transforms, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.1, pp. 191-202. PARK, T. & JUNG, S. 2002. High speed lattice based VLSI architecture of 2D discrete wavelet transform for real-time video signal processing, IEEE Transactions on Consumer Electronics, vol.48, pp. 1026-1032. PARVATHAM, N. & GOPALAKRISHNAN, S. 2012. A novel architecture for an efficient implementation of image compression using 2D-DWT, Third International Conference on Intelligent Systems, Modelling and Simulation (ISMS), Kota Kinabalu, pp . 374-378. PATIL, N., DAS, D., SCANFF, E. & PECHT, M. 2013. Long term storage reliability of antifuse field programmable gate arrays, Microelectronics Reliability, vol.53, pp. 2052-2056. PEDRONI, V. A. 2004. Circuit Design with VHDL, Massachusetts Institute of Technology MIT press. PELLERIN, D. & THIBAULT, S. 2005. Practical FPGA programming in C, Prentice Hall Press. 199 

PHANG, C. & PHANG, P. 2008. Modified fast and exact algorithm for fast haar transform, International Journal of Computer Science and Engineering, vol.2, pp. 55-58. PITTNER, S. & KAMARTHI, S. V. 1999. Feature extraction from wavelet coefficients for pattern recognition tasks, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.21, pp. 83-88. PO-CHIH, T., CHAO-TSUNG, H. & LIANG-GEE, C. 2002. Generic RAM-based architecture for two-dimensional discrete wavelet transform with line-based method, Asia-Pacific Conference on Circuits and Systems, APCCAS '02, vol.1, pp. 363-366. POGREBNYAK, O., HERNÁNDEZ-BAUTISTA, I., CAMACHO NIETO, O. & MANRIQUE RAMÍREZ, P. 2014. Wavelet filter adjusting for image lossless compression using pattern recognition, Pattern Recognition, Springer International Publishing, Lecture Notes in Computer Science, vol.8495, pp. 221-230 POWELL, S. R. & CHAU, P. M. 1990. Estimating power dissipation of VLSI signal processing chips: The PFA technique, IEEE Workshop on VLSI Signal Processing, vol. IV, pp. 250-259. POWELL, S. R. & CHAU, P. M. 1991. A model for estimating power dissipation in a class of DSP VLSI chips, IEEE Transactions on Circuits and Systems, vol.38(6), pp. 646-650. QUIJAS, J. & FUENTES, O. 2014. Removing JPEG blocking artifacts using machine learning, IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI),San Diego, CA, pp. 77-80. QING, S., JIANG, J., YONGXIN, Z. & YUZHUO, F. 2013. A Reconfigurable Architecture for 1-D and 2-D Discrete Wavelet Transform, 21st IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 81-84. RAGHUNATH, S. & AZIZ, S. M. 2006. High Speed Area Efficient Multi-resolution 2-D 9/7 filter DWT Processor, IFIP International Conference on Very Large Scale Integration,Nice, pp. 210-215. RAJAN, S. 1995. Practical state machine design using VHDL, Integrated System Design, pp. 58-70. RAO, R. M. & BOPARDIKAR, A. S. 1997. Wavelet Transforms-Introduction to Theory and Applications, Addison-Wesley-Longman. REIN, S. & REISSLEIN, M. 2011. Low-memory wavelet transforms for wireless sensor networks: A tutorial, IEEE Communications Surveys & Tutorials,vol. 13(2), pp. 291-307.

200 

RIOUL, O. & VETTERLI, M. 1991. Wavelets and signal processing, IEEE Signal Processing Magazine,vol.8(4), pp. 14-38. ROESER, P. R. & JERNIGAN, M. E. 1982. Fast Haar Transform Algorithms, IEEE Transactions on Computers, vol.31(2), pp. 175-177. SALEHI, S. A. & SADRI, S. 2009. Investigation of lifting-based hardware architectures for discrete wavelet transform, Circuits, Systems & Signal Processing, vol.28, pp. 1-16. SALOMON, D. 2004. Data compression: the complete reference, Springer. SANCHEZ, F., FAJARDO, C. A., ANGULO, C. A., REYES, O. M. & BOUMAN, C. A. 2014. A computational architecture for discrete wavelet transform using lifting scheme, IEEE Symposium on Image, Signal Processing and Artificial Vision (STSIVA),Armenia, pp. 1-4. SAZISH, A. N. & AMIRA, A. 2008. An efficient architecture for HWT using sparse matrix factorisation and DA principles, IEEE Asia Pacific Conference on Circuits and Systems APCCAS, pp. 1308-1311. SENHADJI, L., CARRAULT, G. & BELLANGER, J. J. 1994. Interictal EEG spike detection: a new framework based on wavelet transform, IEEE-SP International Symposium on Time-Frequency and Time-Scale Analysis, Philadelphia, PA, pp. 548-551. SEO, Y.-H. & KIM, D.-W. 2007. VLSI architecture of line-based lifting wavelet transform for motion JPEG2000, IEEE Journal of Solid-State Circuits, vol.42, pp. 431-440. SHAHBAHRAMI, A., JUURLINK, B. & VASSILIADIS, S. 2008. Implementing the 2-D wavelet transform on SIMD-enhanced general-purpose processors, IEEE Transactions on Multimedia, vol.10, pp. 43-51. SHANNON, C. E. 2001. A mathematical theory of communication, ACM SIGMOBILE Mobile Computing and Communications Review, vol.5, pp. 355. SIFUZZAMAN, M., ISLAM, M. & ALI, M. 2009. Application of wavelet transform and its advantages compared to Fourier transform, Journal of Physical Sciences, vol.13, pp. 121-134. SILVA, S. V. & BAMPI, S. 2005. Area and throughput trade-offs in the design of pipelined discrete wavelet transform architectures, IEEE Proceedings Design, Automation and Test in Europe, vol.3, pp. 32-37. SODAGAR, I., LEE, H.-J., HATRACK, P. & ZHANG, Y.-Q. 1999. Scalable wavelet coding for synthetic/natural hybrid images, IEEE Transactions on Circuits and Systems for Video Technology, vol.9(2), pp. 244-254.

201 

STOKSIK, M., LANE, R. & NGUYEN, D. 1994. Accurate synthesis of fractional Brownian motion using wavelets, Electronics Letters, vol.30, pp. 383-384. STRANG, G. & NGUYEN, T. 1996. Wavelets and filter banks, SIAM. STROMME, O. & MCGREGOR, D. R. 1997. Study of wavelet decompositions for image/video compression by software codecs, Sixth International Conference on Image Processing and Its Applications, vol.1, pp. 61-63. SUBBARAYAN, S. & KARTHICK RAMANATHAN, S. 2009. Effective watermarking of digital audio and image using Matlab technique, Second International Conference on Machine Vision ICMV '09,Dubai, pp. 317-319. SWELDENS, W. 1996. The lifting scheme: A custom-design construction of biorthogonal wavelets, Applied and computational harmonic analysis, vol.3, pp. 186-200. TALUKDER, K. H. & HARADA, K. 2010. Haar wavelet based approach for image compression and quality assessment of compressed image, International Journal of Applied Mathematics IJAM, vol.36(1), pp.1-9. TAN, K. & ARSLAN, T. 2002. An embedded extension algorithm for the lifting based discrete wavelet transform in JPEG2000, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Orlando, FL, USA, vol.4, pp. IV-3513-IV-3516. TAUBMAN, D. 2000. High performance scalable image compression with EBCOT, IEEE Transactions on Image Processing, vol.9(7), pp. 1158-1170. TAUBMAN, D. S., MARCELLIN, M. W. & RABBANI, M. 2002. JPEG2000: Image compression fundamentals, standards and practice, Journal of Electronic Imaging, vol.11, pp. 286-287. TESSIER, R. & BURLESON, W. 2001. Reconfigurable computing for digital signal processing: A survey, Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology, vol.28(1-2), pp. 7-27. TIAN, X., WU, L., TAN, Y.-H. & TIAN, J.-W. 2011. Efficient multi-input/multioutput VLSI architecture for two-dimensional lifting-based discrete wavelet transform, IEEE transactions on computers, vol.60(8), pp. 1207-1211. TRAYLOR, R. 2001. Essential VHDL for ASICs, Version 0.1. UZUN, I. S. & AMIRA, A. 2004. Design and FPGA implementation of nonseparable 2-D biorthogonal wavelet transforms for image/video coding, IEEE International Conference on Image Processing ICIP'04, vol.4, pp. 28252828.

202 

VAN DE WOUWER, G., SCHEUNDERS, P. & VAN DYCK, D. 1999. Statistical texture characterization from discrete wavelet representations, IEEE Transactions on Image Processing, vol.8(4), pp. 592-598. VAN DER SPIEGEL, J. 2006. VHDL tutorial. University of Pennsylvania, Department of Electrical and Systems Engineering. 9(77(5/,0  .29$ý(9,û- Wavelets and subband coding, Prentice Hall PTR Englewood Cliffs, New Jersey. VETTERLI, M. & LE GALL, D. 1989. Perfect reconstruction FIR filter banks: Some properties and factorizations, IEEE Transactions on Acoustics, Speech and Signal Processing, vol.37(7), pp. 1057-1071. VISHWANATH, M. 1994. The recursive pyramid algorithm for the discrete wavelet transform, IEEE Transactions on Signal Processing, vol.42(3), pp. 673-676. VISHWANATH, M., OWENS, R. M. & IRWIN, M. J. 1995. VLSI architectures for the discrete wavelet transform, IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, vol.42(5), pp. 305-316. VOELMLE, J. 2009. Investigation of Altera DE2 Development and Education Board, Florida Gulf Coast University. WALKER, J. S. 2008. A primer on wavelets and their scientific applications, CRC press. WEEKS, M. & BAYOUMI, M. 2003. Discrete wavelet transform: Architectures, design and performance issues, Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology, vol.35(2), pp. 155-178. WEI, Z., ZHE, J., ZHIYU, G. & YANYAN, L. 2012. An Efficient VLSI Architecture for Lifting-Based Discrete Wavelet Transform, IEEE Transactions on Circuits and Systems II: Express Briefs, vol.59, pp. 158-162. WU, B.-F. & LIN, C.-F. 2005. A high-performance and memory-efficient pipeline architecture for the 5/3 and 9/7 discrete wavelet transform of JPEG2000 codec, IEEE Transactions on Circuits and Systems for Video Technology, vol.15(12), pp. 1615-1628. WU, H. & ABOUZEID, A. A. 2005. Energy efficient distributed image compression in resource-constrained multihop wireless networks, Computer Communications, vol.28(14), pp. 1658-1668. WU, P.-C. & CHEN, L.-G. 2001. An efficient architecture for two-dimensional discrete wavelet transform, IEEE Transactions on Circuits and Systems for Video Technology, vol.11(4), pp. 536-545.

203 

XIONG, C.-Y., TIAN, J.-W. & LIU, J. 2005. Efficient parallel architecture for lifting-based two-dimensional discrete wavelet transform, IEEE International Workshop on VLSI Design and Video Technology, pp. 75-78. XIONG, C.-Y., TIAN, J.-W. & LIU, J. 2006a. Efficient high-speed/low-power linebased architecture for two-dimensional discrete wavelet transform using lifting scheme, IEEE Transactions on Circuits and Systems for Video Technology, vol.16(2), pp. 309-316. XIONG, C.-Y., TIAN, J.-W. & LIU, J. 2006b. A note on" Flipping structure: an efficient VLSI architecture for lifting-based discrete wavelet Transform", IEEE Transactions on Signal Processing, vol.54(5), pp. 1910-1916. XIONG, C., TIAN, J. & LIU, J. 2007. Efficient architectures for two-dimensional discrete wavelet transform using lifting scheme, IEEE Transactions on Image Processing, vol.16(3), pp. 607-614. YU, C. & CHEN, S.-J. 1997. VLSI implementation of 2-D discrete wavelet transform for real-time video signal processing, IEEE Transactions on Consumer Electronics, vol.43(4), pp. 1270-1279. YU, P., YAO, S. & XU, J. 2009. An efficient architecture for 2-D lifting-based discrete wavelet transform, 4th IEEE Conference on Industrial Electronics and Applications ICIEA,Xi'an, pp. 3667-3670. ZERVAS, N. D., ANAGNOSTOPOULOS, G. P., SPILIOTOPOULOS, V., ANDREOPOULOS, Y. & GOUTIS, C. E. 2001. Evaluation of design alternatives for the 2-D-discrete wavelet transform, IEEE Transactions on Circuits and Systems for Video Technology, vol.11(12), pp. 1246-1262. ZHANG, C., WANG, C. & AHMAD, M. O. 2012a. A pipeline VLSI architecture for fast computation of the 2-D discrete wavelet transform, IEEE Transactions on Circuits and Systems I: Regular Papers, vol.59(8), pp. 1775-1785. ZHANG, W., JIANG, Z., GAO, Z. & LIU, Y. 2012b. An efficient VLSI architecture for lifting-based discrete wavelet transform, IEEE Transactions on Circuits and Systems II: Express Briefs, vol.59(3), pp.158-162. ZHI-RONG, G. & CHENG-YI, X. 2005. An efficient line-based architecture for 2-D discrete wavelet transform, IEEE Proceedings. International Conference on Communications, Circuits and Systems,vol.2, pp. 1322- 1325.

204 

APPENDIX A Hardware Description Languages and VHDL Overview VHDL stands for VHSIC (Very High Speed Integrated Circuits) Hardware Description Language. The requirements for the language were first generated in 1981 under the VHSIC program. In this program, a number of United States U.S. of America companies were involved in designing VHSIC chips for the Department of Defense (DoD). At that time, most of the companies were using different hardware description languages to describe and develop their integrated circuits (J. Bhasker, 1999). Thus, a need for a standardized hardware description language for design, documentation, and verification of digital systems was generated. In December 1987; the U.S. DoD and the Institute of Electrical and Electronics Engineers (IEEE) sponsored the development of this VHSIC hardware description language with the goal to develop very high-speed integrated circuit. This version of the language is now known as the IEEE Standard 1076-1987, hence VHDL-87 (Van der Spiegel, 2006). Various problems of this first standard have been analyzed by IEEE experts groups to give reasonable ways of interpreting the unclear portions of the standard. All IEEE standards are subject to a review and further developed every 5 years. This is where VHDL-93 standard comes from.

,W KDV EHFRPH QRZ RQH RI LQGXVWU\¶V VWDQGDUG ODQJXDJHV XVHG WR GHVFULEH digital systems. The other widely used hardware description language is Verilog HDL. Both are powerful languages that allow describing and simulating complex digital systems. A third HDL language is ABEL (Advanced Boolean Equation Language) which was specifically designed for Programmable Logic Devices (PLD). 205 

ABEL is less powerful than the other two languages and is less popular in industry (Van der Spiegel, 2006).

Although these languages look similar as conventional programming languages, there are some important differences. A hardware description language is inherently parallel, i.e. commands, which correspond to logic gates, are executed (computed) in parallel, as soon as a new input arrives. A HDL program mimics the behavior of a physical, usually digital, system. It also allows incorporation of timing specifications (gate delays) as well as to describe a system as an interconnection of different components (Van der Spiegel, 2006). A digital system can be represented at different levels of abstraction (Gajski and Kuhn, 1983). This keeps the description and design of complex systems manageable. VHDL includes facilities to describe structure and function at various levels of abstraction (above gate level). Figure 1 shows different levels of abstraction (Van der Spiegel, 2006).

Figure 1: Levels of abstraction: Behavioral, Structural and Physical (Van der Spiegel, 2006) 206 

The highest level of abstraction is the behavioral level that describes a system in terms of what it does (or how it behaves) rather than in terms of its components and interconnection between them (J. Bhasker, 1999). A behavioral description specifies the relationship between the input and output signals. This could be a Boolean expression or a more abstract description such as the Register Transfer or Algorithmic level (Van der Spiegel, 2006). The structural level, on the other hand, describes a system as a collection of gates and components that are interconnected to perform a desired function. A structural description could be compared to a schematic of interconnected logic gates. It is a representation that is usually closer to the physical realization of a system (J. Bhasker, 1999). At the most abstract level the system may be described in terms of ALGORITHMS. This is often called ³%(+$9,285$/ or )81&7,21$/ 02'(/,1*´ as shown in Figure 2. Most designers work more and more away from the center.

Figure 2: Domains and levels of abstraction: Behavioral or Functional, Structural and Physical or Geometric built on semiconductor wafers (Van der Spiegel, 2006) 207 

Contrary to regular computer programs which are sequential, VHDL statements are inherently concurrent (parallel). For that reason, VHDL is usually referred to as a code rather than a program. Figure 3 shows the VHDL code structure.

Figure 3: VHDL code structure A standalone piece of VHDL code is composed of at least three fundamental sections: 1. LIBRARY declarations: Contains a list of all libraries to be used in the design. To declare a LIBRARY (that is, to make it visible to the design) two lines of code are needed, one containing the name of the LIBRARY, and the other a USE clause (Pedroni, 2004). LIBRARY library _ name; USE library_ name .package _ name .package _ parts;

208 

At least three packages, from three different libraries, are usually needed in a design: _ IEEE.std_logic_1164.all (from the IEEE library) _ Standard (from the STD library), and _ Work (work library) (Pedroni, 2004). 2. ENTITY: Specifies the I/O pins of the circuit. An ENTITY is a list with specifications of all input and output pins (PORTS) of the circuit. ENTITY entity _name IS PORT ( port_name1: signal_mode signal_type; port_name2: signal_mode signal_type; ...); END entity name; The name of the entity can be basically any name, except VHDL reserved words (Pedroni, 2004). All names should start with an alphabet character (a-z or A-Z). Digits (0-9) and underscore (_) can be used in the name. Any punctuation or reserved characters (!, ?, ., &, +, -, etc.), UHVHUYHGZRUG $1'25HQWLW\« and two or more consecutive underscore characters ( _ _ ) within a name is invalid. The mode of the signal can be IN, OUT, INOUT. As illustrated in (IN) and (OUT) are truly unidirectional pins, while INOUT is bidirectional (Pedroni, 2004). Among different signal types that VHDL has, only few of them are

209 

synthesizable, like std_logic_vector,

std_logic and

integer. While the

std_logic_vector represents a vector of std_logic single bit. 3. ARCHITECTURE: Contains the proper VHDL code, which describes how the circuit should behave (circuit functionality). Its syntax is the following: ARCHITECTURE architecture_name OF entity_name IS Declarations] BEGIN Code) END architecture_name. Like in the case of an entity, the name of architecture can be basically any name (except VHDL reserved characters or words) (Pedroni, 2004). As shown above, architecture has two parts: a declarative part (optional), where signals and constants (among others) are declared always in the architecture before BEGIN statement, and the code part (from BEGIN down). Signals are declared exactly like ports in the entity except that the signals have no directions.

A digital system in VHDL consists of a design entity that can contain other entities that are then considered components of the top-level entity. Each entity is modeled by an entity declaration and an architecture body (J. Bhasker, 1999). One can consider the entity declaration as the interface to the outside world that defines the input and output signals, while the architecture body contains the description of the entity and is composed of interconnected entities, processes and components, all

210 

operating concurrently, as schematically shown in Figure 4. In a typical design there will be many such entities connected together to perform the desired function (Van der Spiegel, 2006).

Figure 4: A VHDL entity consisting of an interface (entity declaration) and a body (architectural description) (Van der Spiegel, 2006)

The process of checking the syntax and semantics of the code is called the analysis. Going through the design hierarchy and creating all of the objects defined in the declarations is the synthesis. Then the passage of time is simulated in discrete steps as shown in Figure 5.

Figure 5: A VHDL basic concepts 211 

VHDL allows describing a digital system at the structural or the behavioral level. The behavioral level can be further divided into two kinds of styles: Data flow and Algorithmic (Van der Spiegel, 2006). The dataflow representation describes how data moves through the system. This is typically done in terms of data flow between registers (Register Transfer level RTL). The data flow model makes use of concurrent statements that are executed in parallel as soon as data arrives at the input (J. Bhasker, 1999). On the other hand, sequential statements are executed in the sequence that they are specified (Van der Spiegel, 2006). The internal VHDL structure as shown in Figure 6 can be specified by any of the following styles:

Figure 6: A VHDL design styles

212 

‡ Dataflow Design To simulate a circuit, it is necessary to use some kind of intermediate signal to connect the output of a logical gate or simple combinational circuit with the input of the other gates. Several concurrent statements allow VHDL to describe a circuit in terms of the flow of data and operations on with the circuit. The order of statements inside the architecture does not affect the meaning of the code (since it represents a wiring). This style is called a dataflow description or dataflow design (Pedroni, 2004). Dataflow style is used to code simple combinational circuits that in which the output of the circuit depends only on the current inputs. The dataflow describes the using of two concurrent signal assignments. The symbol ( 0 THEN FillMemory(mem, INPUT_FILENAME); -- read pixels from input file END IF; END IF; -- dump memory to file IF dump'event AND dump = '1' THEN IF OUTPUT_FILENAME 'LENGTH > 0 THEN DumpMemory(mem,OUTPUT_FILENAME);-write coefficient to file END IF; END IF; 224 

-- Read or write processrespond --This memory includes a process statement that is activated on every positive --edge of the clk control signal (Memory must be clocked) IF clk'event AND clk = '1' THEN Synchronous clock design items IF enable = '1' THEN IF rw = '1' THEN -- Read memory process odd element in the data set -- If rw = '1' ,read & enable=1, then read data memory from current memory address data --when current state is " TS_READY " CASE ctrl_sig IS WHEN "000" => source source step destination hi destination lo dst_step_lo dst_step_hi elem_count END CASE;

state