efficient fractal image encoding using distributed ...

5 downloads 173 Views 1MB Size Report
but fractal encoding process is computationally intensive but the decoding phase is faster ...... model composed of monitor, that is used to add, delete the computing nodes. ...... (2.90GHz) in Ubuntu 14.04 machines having 1GB RAM capacity.
EFFICIENT FRACTAL IMAGE ENCODING USING DISTRIBUTED ARCHITECTURE AND SIMD APPROACH

ABSTRACT Today there are many application areas where tremendous computational resources are required. These areas such as image processing, big data and genetic mapping are computationally intensive areas and requires vast amounts of computing resources and time. Therefore, to solve such complex problems we need powerful computing environment. In this research the emphasis is done on fractal image compression. Fractal image compression reduces affine redundancy with the use of suitable affine transformations that requires less number of bits to encode the same image. However the process of encoding requires enormous computational processing to generate required fractal codes. To reduce computational time a distributed parallel method is proposed. The proposed methodology uses the SIMD approach (Single Instruction Multiple Data) is followed using distributed architecture. The research compares the performance on the basis of speed up and encoding time. The system may fail, when they perform execution. To handle this type of situation, because the failed system cause the overhead in the performance of the system. The dynamic load balancing algorithm is proposed to maximize the system performance. There are various problem encountered while designing a distributed architecture and parallel algorithm. These problems are a better partitioning scheme and ability to balance load while executing tasks. One of the major problem that is encoding because encoding requires huge amount of rangedomain comparison. In this research using multiple computing nodes the huge comparison task is distributed to them which minimize the task of comparison. But there are some limitations on task distribution. Due to Amdal’s law the speed up of system increases up to some computing nodes, but later it gets decreases due to sequential part which cannot be parallelized. In this research the threshold value of computing nodes are selected for experiment.

CHAPTER 1 INTRODUCTION This chapter describes about how fractal image compression works and why it is needed. The chapter details historical background about fractal geometry and how fractal geometry can be made useful to encode images. It is useful to research in the field of fractal image compression as it includes vision that will take us to the future new encoding methodology. This chapter also explains why we have chosen fractal image compression as research of interest and details the problem statement.

1.1 Introduction The process of representing an image into digital form requires large storage space that can be efficiently represented the entire image as per the HVS (Human Visual System). Huge amount of pictorial data requires a larger space and longer transmission time. The problem can be solved by compression methodology. To process pictorial data efficiently, means data that represent information in the form of pictures, and these images are used for various scientific purposes, for example to process satellite images require large computing resources and one of the major processing task is to compress an image. Some of the Mathematicians proposed method that can efficiently compress the image. These methods are DCT based compression and JPEG (Joint picture expert group) compression and many more. But effectively JPEG is used all over the globe for image compression. Similarly there is one more technique called fractal image compression that utilizes the self similarity in a image and compress the same. A fractal image compression was first developed by M.Barsney, and he founded a company based on fractal image compression [1]. A. Jacquin, student of M. Barsney developed a automated fractal image encoding method and publish it [2]. After his work there were many improvements are done. As the encoding procedure is not as much as faster than other encoding scheme. It requires complex computation that takes higher execution time. Our interest is to reduce this execution time by using distributed computing.

1.2 Background & Motivation As data on the internet is growing faster that increase cost of storing the data as well. Image compression techniques are needed as researchers have to deal with huge database.

Compressed image requires less number of bits for transmission. There are certain lossy image compression methods proposed that utilizes redundancy and removers unnecessary information from an image. The most popular and efficient method for image compression today is JPEG standard. JPEG utilizes DCT (Discrete cosine Transform). As all images have certain geometry through which we can compress them. The fractal image compression technique utilizes this geometry to compress an image which is evolved from fractal geometry and our objective is to propose a method to compress these types of geometry. Mandelbrot was the first person who enlightens this field of mathematics in depth in the 1960's. Later Barsney [1] presented collage theorem which characterizes an IFS (Iterated Function System). An IFS is a set of contractive affine transformations on a metric space. The fixed point or attractor of an IFS is a fractal image. Compression consists of finding the corresponding IFS for the coded image. The data necessary to represent the IFS transformations are normally considerably less than that in the original image. A very high compression ratio, from 30 to 10000 can be obtained. Barsney suggested that by using repeated iterations of affine transformations of the plane, one could reproduce a fractal-like image by storing the image as a collection of transformations rather than an accumulation of pixels.

Figure 1.1: Self similarities in Lena image

The image has certain self similarities it can be clearly seen in Figure 1.1 of Lena image the similarities can be found in different scales. As it can be seen the portion of the shoulder overlaps a smaller region that is almost identical, and a portion of the hat in the mirror is similar after certain transformations to a smaller part of her hat. The transformed copy that is similar is not the exact replica of image. So must allow some error in the digital representation

of an image. The fractal encoded image is not identical copy of the original image instead approximation of the same. Arnauld Jacquin[2], developed a method called partitioned Iterated Function Systems (PIFS)[2]. The PIFS process the subset of the image rather than the entire image. Jacquin suggested that image must be partitioned using a suitable partitioning technique in two subsets the first subset is called range and the latter called domain block. The process is to seek the best matching domain block from the range block which requires high computation resources. There are various methods have been proposed that exploits the redundancy in an image and produces encoded image that requires fewer bits than the original image, but these methods result in a very low compression ratio. To enhance the compression ratio of an image fractal image encoding method is proposed that exploits inter image redundancy of an image but fractal encoding process is computationally intensive but the decoding phase is faster and can be decoded at any level of resolution. A lot of parallel algorithms and architecture have been proposed to speed up Fractal image encoding but the most appropriate papers that resembles the parallel implementation of fractal image compression is Min, Palazzari and Cao [4, 8, 11], they have developed a parallel architecture and proposed an algorithm to minimize the communication overhead and processing time. The most computationally intensive operation of fractal image compression is the computation of distances between range and the transformed domain blocks. To match each domain with range we need huge number of range and domain blocks. If all comparison were made in less time, that will leads to efficient compression of an image. They use parallel computing to perform block matching and proposed a parallel architecture and parallel algorithm to improve fractal image encoding time. The fractal compressed image has following advantage over different image compression techniques. 

High Compression Ratio: It provides higher compression ratio than that of JPEG image.



Codebook: As it uses IFS (Iterative Function System), no need of separate codebook.



Resolution Dependence: Fractal encoded image is result of recursive transformation that produce image that can be decoded at a different resolution.



Interoperability: It can be easily combined with other technique.

1.3 Fractal Geometry and Fractal Image Compression The application of fractal geometry in multimedia can be very useful and initially this was done by using IFS (Iterated Function System) and Hutchinson’s [1] Theorem. An IFS is a mathematical function that describe complex geometrical sets and pictures. The IFS is used in computer graphics to compress images and data. The digital compression is the process of representing information using less number of bits. Fractal image compression relies on the fact that all images consist of affine redundancy; that is, under suitable affine transformations, it requires less number of bits to encode the same image. The choice of compression algorithm involves several parameters. These include degree of compression, speed of operation and size of compressed file versus quality of decompressed image.

Figure 1.2: Mandelbrot set

The French mathematician Benoit B. Mandelbrot first coined the term fractal in 1975. He derived the word from the Latin fractus, which means "broken", or "irregular and fragmented". In fact, the birth of fractal geometry is commonly traced to Mandelbrot and the 1977 publication of his seminal book The Fractal Geometry of Nature.

Mandelbrot

developed geometry called Mandelbrot set which can be seen in Figure 1.1. That states that fractal is a generally a rough or fragmented geometry shape that can be split into parts, each of which is a reduced-size copy of the whole [1]. Another type of fractal of similar form is called an Iterated Function System or IFS [1], introduced John Hutchinson in 1981 to describe fractal. Fractals are found everywhere in

nature. An iterated function system is a collection of affine transformations that map a plane to it. Figure 1.2, shows several iterations of this process on several inputs of tree fern.

Figure 1.3: A Fern created using IFS

Barsney suggested that IFS can be used to encode digital images. The IFS can be easily understood with the help of photo copy machine that reduces the image 3 times and again. It can be seen in Figure 1.3. The smiley is send photo copy machine that copies the original image and generates 3 more reduced copies of the image. This process can be run iteratively. All the copies seem to be converging to the same original image. Because the copy process reduces the input image, the copies of any initial image will be contracted to a point as we repeatedly run the output back as input; there will be more and more copies, but each copy gets smaller and smaller. So, the initial image doesn’t affect the original image. In fact, it is only the position and the orientation of the copies that determines what the original image will look like.

Figure 1.4: IFS process to generate the image

1.4 Problem Statement The fractal image compression can be defined over the image I, and to find a function Xi (I) = I that represent the image I in a smaller space, for that I is subsequently divided into two regions, i.e. range RI and domain DI and RI, DI ∈ I such that after the certain affine transformation over the subset DI and contractive mapping Xi between RI and DI yields a smaller space than I, that may represent an original image using lesser bits. The mapping function takes longer time to execute hence parallel approach can solve this problem as mapping of subsets is an individual process, and can be paralleled for every range and domain blocks. To parallelize fractal compression it is necessary to find out the sequential and parallel part and feeding parallel part to the parallel machine may result in compression with less encoding time. The parallel fractal image encoding has certain problems and these problems are identified as follows: Problem 1: Given a finite image space I, we need to find the partition function p (n) such that it can equally divide I to be processed by the parallel machine and finding the individual part that is not dependent on the other. As shown in Figure 1.4 various partition scheme is shown that divide the original image into subsets and these subsets should in equal size otherwise distribution of these onto multiple CN’s will be uneven that may lead to inconsistency. If we do not choose appropriate partition scheme, it may affect the whole execution time. Consider there are 4 CN’s each performing image compression. The master divides the image into four parts and in the form of task send to the 4 CN’s.

Original Image

Uneven Partition

Even partition

Fig ure 1.5: Various Partition Scheme When nodes receives its task the task was not equal for all of the node then node having lesser task will execute the task in less time and node having larger task may result in larger execution time. To reduce this we must find a appropriate partition scheme.

Problem 2: Given a subsets, geometrically transform the domain D I of an image over the Range RI, which is smaller than the domain blocks and apply the contractive mapping

Xi:

DI → RI for each domain blocks and measure the minimum difference against the corresponding range block at each position, and record those transformations until we get desired similarity as shown in Figure 1.5. The process of contractive mapping ensures IFS (Iterated Function System) [1] to converge to a fixed point of an IFS of an image and process of recording those transformation is known as fractal image encoding and finding those attractor which is computationally intensive. The image of smiley is first partitioned into two parts, the first part is known as range and second part is called domain blocks. As shown in Figure 1.6, the image is partitioned into 8x8 blocks which is further partitioned using partition function p (n) and sub divides the image into 2x2 domain blocks and 4x4 range blocks. The mapping is to find, which of the range block has similar shape that matched with domain blocks. As size of domain and range blocks are dissimilar hence they cannot be directly compare. First we must find a distance between these blocks using

di

they apply contractive

mapping theorem to calculate best matched blocks. Once matched blocks are obtained we only store the transformation of image blocks. The contractive mapping theorem and distance calculation between range-domain blocks is computationally complex procedure, which need tremendous computation time. The process of mapping only requires few bits to be stored. These bits stores only transformation of matched domain blocks.

Domain Blocks

Range Blocks

di Xi

Figure 1.6: The Mapping Procedure

Problem 3: If there are totally n tasks available to compress an image then to find the mechanism that dynamically balance the load among multiple CN’s (Computing Node). As shown in Figure 1.7 the task of unequal size is distributed to three CN’s, but the allocation is not equal on all

Task 3 Task 1

Task 4 Task 2

Task 1

Task 4 Task 2

Task 3

P1

P2

P3

Figure 1.7: Unequal Load distributions

CN’s, P1 has two tasks, P2 and P3 has one task equally, it can be clearly seen that after completion of Task 2 and Task 4 the CN, P2 and P3 will remain idle, which can lead to load imbalance problem. The load imbalance problem can lead to higher execution time which leads to lower speed up. There are some situation arrives when all nodes receives equal task but after some time if any node gets failed then the task of that node is not computed and when we consolidate the result which is incomplete.

1.5 Organization of the Thesis This thesis is organized as follows: Chapter1 presents the background and purpose of the research. This section also defines the problems that encountered during research work. In Chapter 2 we have reviewed research carried by many researchers and discussed their research scopes and limitations. The review includes encoding based on sequential, parallel and distributed machines. Chapter 3 introduces overview of fractal image compression and how it can be implemented for real time images. Chapter 4 describes about parallel and distributed computing. Next, it proposes challenges that encounters in implementing parallel fractal image compression. In Chapter 5 a distributed parallel architecture and encoding algorithm which may reduce the domain search and processing has been proposed. In Chapter 6 simulated results validates the distributed parallel architecture and its encoding algorithm. Chapter 7 gives conclusion for the research and also shows how it can be efficiently implemented in future.

1.6 Summary In this chapter basic description of fractal image compression is explained. Also how and why fractal image compression is needed to fulfill today need of new encoding scheme. Fractal image compression is studied and addressed what are the various problems that arise when encode the image using parallel computing.

CHAPTER 2 LITERATURE REVIEW

This chapter reviews basic and some popular image compression methods. Initially basic image compression methods, such as Huffman coding is reviewed. Then as our need for better compression method grows, researchers contributed in the field of image compression by introducing fractal image compression. Later reviewed some of the popular research in the field of parallel and distributed fractal image compression.

2.1 Introduction The need for high quality but lesser in size images provide the way to develop a technique where this above requirement can be fulfilled. The existence of digital world will be possible without compression. Initially the need of compression started with text, how to compress the text? We started developing some techniques to reduce the number of bits. Later as in our nature we move on and we still find some problems when we implemented the compression techniques developed for text are used with images. But we fail, and we must come up with a solution, and we did it with a concept of lossy compression technique. The need for image compression is to reduce the number of bits required to represent an image. These bits can be retained after compression or some of the bits removed as per human vision system. On this basis of retaining and removing of bits the image compression scheme is divided into two categories: lossless and lossy. In lossless compression the reconstructed image after image compression is retained pixel-by-pixel. But in lossy compression technique, the reconstructed image is presented with some loss of bits in an image. But this loss cannot be normally identified by the human vision. As we have JPEG and other efficient techniques. But still they have some problem that we have discussed and described in the literature. One technique that is being developed and refined for future compression standard is fractal image compression. Researchers have made effort to utilize this technique for real time applications as the major problem with fractal image compression that its tremendous encoding time. Once we cross this hurdle of encoding time, we may soon see this compression technology in future.

2.2 Related Research on Image Compression Model In this chapter there are many techniques reviewed that reduces the complexity of image compression by their image compression models. The Huffman coding algorithm was created in early 1950’s by David Huffman. Huffman developed an efficient compression algorithm for sequence of a text. In which sequence that occurs more frequently will have shorter code words. Later that algorithm was applied to binary images. For image to be compressed it is important to generate Huffman code for the set of values that any pixel may take. After assigning values, we get reduction of about ½ to 1

bit per/pixel. Huffman coding algorithm is inefficient when the probability of an occurrence of a different letter skewed. [2] The one of the fast and efficient coding is transform coding technique, that includes Fourier related transform such as DCT and wavelet transform coding such as DWT are the technique used to compress the image. Compression include variable-length encoding, Adaptive dictionary algorithms such as LZW, bit-plane coding, lossless predictive coding, etc. The approaches for lossy compression include lossy predictive coding and transform coding. Transform coding, which applies a Fourier-related transform such as DCT and Wavelet Transform such as DWT are the most commonly used approach [2]. Over the past few years, a variety of powerful and sophisticated Fractal image compression schemes for image compression have been developed and implemented. The iteration function system provides a better quality in the images. Entropy coding achieves compression by using the statistical properties of the signals and are, in theory, lossless. In the 1980's the new method for compression is presented named JPEG stands for Joint Photographers Expert Group. The JPEG model is lossy image compression model. In this scheme a subset of an image, say a block of 8  8 pixels is selected. There are basically four steps in the JPEG algorithm. First it is necessary to break the M  N image into local blocks, most popularly into 8  8 blocks. Second, these blocks need to be transformed, using the cosine transform, in order to identify the high frequency components. The cosine transform exchanges raw spatial information for information on frequency content. Then a quantizing method, or a "rounding" procedure, needs to be applied to the transformed coefficients. The high frequencies are usually reduced considering the human eye is insensitive to high frequencies. The fourth step is encoding the output of the quantizing step. The JPEG standard uses Huffman and arithmetic coding. JPEG compression typically reaches compression ratios of 20:1 or more [2].

2.3 Related Research on Fractal Image Compression Barnsley [1] researched that the fractal theory could be applied to image compression. The fractal image compression is achieved by Jaquin [4], using small sized image blocks called range block and larger size image blocks called domain blocks are compared against by applying transformation over domain blocks. Each transformed domain blocks are matched with range blocks and best matched blocks are retained for image reconstruction. As the image is divided into number of range and domain blocks that increases the block comparison due large number of blocks. To reduce this Jaquin used perceptual geometric feature. The classification was based on shade blocks, edge blocks, and midrange blocks. The test result was obtained using Lena 256x256 sized images on a domain block of 16x16 the PSNR result 33.1dB and compression ratio of 18.3 achieved. Fisher et al. [5, 7] introduced a adaptive method to enhance fractal image encoding. Using quad tree, triangular and rectangular partition scheme for range block that will improve image fidelity. In this method, a square domain/range block in subdivided in four quadrants i.e.,

upper left, upper right, lower

left

and

lower

right.

These quadrants are numbered

sequentially and their average pixel intensities. There are 24 different possible orderings of the variances that defines 24 subclasses for every major class. In this way the total domain and range blocks are represented in 72 classes. In coding process any range block is compared with the domain blocks, which belongs to the same category only. After applying this scheme the compression verses fidelity yields the performance improvements in the resultant image. Hurtgen et al. [7] proposed a fast hierarchical search method to improve fractal image encoding. The method first average intensities of four quadrants of any block are calculated and compared with the average intensity of overall block. Each quadrant is assigned a bit, which is ‘1’ if its mean is higher than the overall mean, and ‘0’ if it is lower or equal to the overall mean. In this way every block is represented by four bits, which could be arranged in 16 possible ways. Since combination containing all 1’s will be always empty hence the blocks are divided in 15 major classes. Along with it, there are 24 subclasses of each major class according to the ordering of variance as in Fisher ’s method. In this way all the blocks are classified in 360 classes. Saupe et al. [8] accelerate the fractal image compression by multidimensional nearest neighbor search method. They have shown that the fractal image compression is equivalent to the multidimensional nearest neighbor search. Then searching optimal domain-range pairs is equivalent to solving nearest neighbor problems in a suitable Euclidean space of feature vectors of domains and ranges. The data points are given by feature vectors of the domains and query point by feature vector of range. Multidimensional nearest neighbor searching is a well known data structures and algorithms for them operate in logarithmic time. This approach provides an acceleration factor from 1.3 up to 11.5 depending on image and domain pool size with negligible or minor degradation in both image quality and compression ratio. Ghosh et al. [10] proposed a new technique in which one can generate relative fractal code of any other image of same size. This method is useful for satellite remote sensing image compression as the spectral bands are correlated. In this encoding technique, given the fractal code of a reference image, one can generate a relative fractal code of any other image of the same size. This relative fractal code combined with the code of the reference image, produces the complete fractal code of the target image. In the relative code, the same range-domain mappings of the reference image are used. Only the transformation of brightness values is changed, if required. As the application of this relative can be useful in satellite image

compression. In satellite as the spectral components are strongly correlated, one of them could play the role of the reference image. The other two components are relatively coded with respect to the code of the reference one. The selection of the component as the reference image is important. The experiment is performed in Pentium III (866 MHz, RAM 256 MB) system using LINUX operating system. Two satellite images of size 512x512 are selected and results show improved performance in encoded image. In this technique the convergence of decoder has been resolved by computing the limit cycle points and transferring those points while coding the spectral bands with reference to other. Yung et al. [11] decrease the fractal encoding time by mean and variance to classify image blocks and combine the transformation reduction technique. To classify the domain blocks variance of difference of range and domain block is compared to a threshold. If they are lesser than the threshold than those domain blocks considered and classified in any of the three classes.. The experiment is performed on Celeron 2GHz PC having 512 MB RAM. The evaluated result is 480 times than the conventional fractal encoding also the image quality is better than that of conventional approaches. Wu et al. [12] presented a fast fractal encoding method based on an intelligent search of a Standard Deviation (STD) value between range and domain blocks. This method enhanced STD search algorithm by introducing domain intelligent classification algorithm based STD classified domain blocks. In STD search method the difference between standard deviation of range and domain blocks are calculated. On the basis of threshold value the domain blocks are selected for matching otherwise they are discarded. The major problem in STD search algorithm is to select appropriate threshold value. In this research authors have developed a method called ISA (Intelligent search classification method) and to classify the domain blocks they have used ICA (Intelligent Classification Method). The ICA methods groups all domains that have similar STD value. Constructing a location index vector consisting of positional information for all domain blocks in all defined groups. Based on the range block STD, the algorithm searches only those domain blocks that have a similar STD and are therefore logically grouped together. The exeperiment is performed using 256x256 image, with fixed 4x4 range block size. The experiemt shows that by using ISA method all images that are encoded takes less number of bits. The experiment result shows less encoding time with lowest PSNR. The approximate encoding time for all the images taken was found to be less than 10 seconds. Which approximately 10% of the search time used in the traditional STD method [13]. Furao et al. [13] proposed a fast no search method to encode image using fractal image compression paradigm. In this approach the algorithms first choose the tolerance level and mark all range blocks uncovered. While there are so many range and domain blocks. According to the position of one of the range block take out the corresponding domain block from the original image. If the error is less than the defined tolerance level then mark that range block covered. If the error is larger, then partitions the range blocks into smaller blocks and mark it uncovered and perform the same operation. In this algorithm to partition the range blocks quad tree partitioning is used. The no search strategy is used to eliminate the search from domain pool. The groups of four range blocks share their union as a common domain

block. In that case encoding and decoding both have linear cost with image size, but the approximation quality is poor. The experiment is performed using Lena image of size 512x512 on Pentium PIV 2.8GHz machine, it encodes an image in 0.078 seconds and obtained resultant image gets 34.04 dB PSNR. The result is comparative which was compared to Tong and Wong algorithm and found to be efficient and its suitable for real time applications. Wang et al. [14] proposed a novel image compression with block classification and sorting based on Pearson’s correlation coefficient. The method uses the fact that the affine similarity between two blocks in fractal image compression is equivalent to the absolute value of Pearson’s correlation coefficient. The classification scheme is also based on Pearson’s correlation coefficient method. In this method the all the range and domain blocks are selected and classified using above method called APCC (absolute value of Pearson’s correlation coefficient). This classification increases the matching probability. The sorting of domain blocks can be done using APCC between domain blocks and preset blocks in each class. The matching domain blocks can be searched only in those class where APCC’s of above blocks are closer. The experiment is performed with Lena image of size 512x512 with 8x8 and 4x4 domain blocks size. The above methods encode the image using 8x8 domain blocks in 2.1 seconds and by using 4x4 the encoding time is 8.7 seconds. Tong et al. [15] worked on adaptive based search method to speed up the performance of fractal image encoding. To reduce number of range domain comparison authors have proposed a new bit allocation scheme. They researched and found scaling parameter is strongly correlated with luminance offset. The coupling effect generated by this makes difficult to find optimal bit allocation for scaling and luminance offset. In this method a better classification scheme of domain blocks are adapted in which standard deviation of range and domain blocks and measuring difference between them. If the difference is lesser than the defined threshold, that domain block is discarded. Initially for Lena image we put some tolerance level that describes the accepted errors. Lowering the tolerance level will lead to higher speed up. As the search for best matching domain blocks depends on the quantized scaling and inner product. As in this approach range blocks are quantized. The experiment is performed on Lena 256x256 image having range block size of 4x4. By applying full search method algorithm encodes the same in 65 seconds and decoded image has PSNR of 31.01 dB. By lowering down the tolerance level to 1/4 th and then 1/8th the encoding time reduces to 20 and 13 seconds respectively and PSNR decreases slightly to 30.79dB and 30.28dB respectively. The algorithm is also tested with Fishers scheme by taking Lena image of size 512x512 on Dell Pentium II 233 Mhz machine. After encoding the image takes 125 seconds with PSNR of 34.62dB. Riccio et al. [17] analyzed the problem of fractal image compression complexity and provide a new classification technique based on an approximation error measure. By classifying the domains it is possible to reduce the number of range-domain comparisons. The basic idea is to defering the comparison between ranges and domains with respect to a preset domain block. In the first phase of encoding only feature vectors of range and domains are stored. The experiment is performed with the Lena 512x512 image that is decoded at different compression ratio. The result obtained with compression ratios of 6.98 with PSNR of 37.29dB, 16.70 with PSNR of 33.95dB, 34.03 with PSNR of 30.43 is obtained.

2.4 Related Research on Parallel Fractal Image Compression The sequential fractal image encoding procedure requires very high computational time and hence a parallel architecture and efficient algorithm are required to enhance fractal encoding. There have been different parallel algorithms and architectures presented to increase the efficiency of fractal image encoding in existing literature. Few of the prominent research contributions are reported here in terms their findings and limitations. Xue et al. [25] implementation reduces the encoding complexity by parallel algorithm on the SPHINX machine using multi-SIMD (Single Instruction Multiple Data) quad pyramid. The encoding procedure first divides the image into two halves domain and range blocks. The proposed method uses layer and loaded original image onto layer 0 and apply the reduction operation, each processor calculates average pixel value for the four child processors in order to obtained reduced image. Next cut the layer into 4 identical squares segments and load the reduced image in each of these segments. Move the reduced image step by step in x and y direction and apply the transformations and each transformation find the similarity between domain and reduced range block. By reducing range block size yields significant speed up and the complexity reduces from O (n4) to O (n2). The parallel algorithm obtained a speed up of O (n2) and gains 25 % in efficiency in encoding procedure. Jackson et al. [26] introduced a parallel scheme for hypercube multiprocessor to increase speed up of encoding process. The method uses Fishers [5] quad tree model to partition the image into range blocks, then they are compared to the domain blocks. The method translates sequential encoding procedure to parallel on PVM (parallel virtual machine) using ncube processor. To improve the performance quad tree partition scheme is used. The implementation uses n-cube parallel computer with 1 MB of RAM. The program is divided into two sections host process, executing on a single node and a number of slave processes one running on each slave node. The host process distribute the part of the image to multiple slave nodes and ecch slave nodes performs matching between range and domain blocks. The host process checks the message buffer is there any request for new range, if yes then it sends the range to requested slave node. The slave nodes receive domain blocks and the range block is compared to domain blocks belonging to that slave process. If the message buffer of the slave node is empty it requests it for the new block from the host. The implemented scheme also includes a simple classification algorithm. When a slave receives a range block from the host, the block is compared to a gray shaded block with a gray shade level equal to the average grayscale level of the range block. If the difference is less than the RMS threshold value, no comparison is performed and the average gray shade value is returned to the host. Similarly, when domain blocks are received from the host, each slave classifies them into grayshaded and non-grayshaded domains. The grayshaded domain blocks are not used in the comparisons with non-grayshaded range blocks. The experiment is performed on boy image which shows a compression ratio of 26.7:1. The high efficiency of the algorithm degrades when the complexity of the image is unevenly distributed, that result load imbalance. The proposed architecture is not scalable as there is only 1MB memory available for each processing element.

Nge et al. [27] speeds up fractal encoding procedures by applying load among 12 Hewlett Packard Visualize C180 workstations interconnected in an Ethernet network. The proposed method is implemented on PVM (parallel virtual machine) and utilizes both static and dynamic load allocations to speed up. In a static load allocation scheme master performs distribution of task to slaves nodes. Slave nodes performs range- domain comparision, which returns the data to the master. Static load distribution does not provide substantial improvement on varying load which result in lower speed up. In a dynamic load allocation masters task is similar to that of static load implementation. The minimum range block size used for each task is 64x64. The algorithm uses 512x512 image which generates 64 tasks for 12 slaves. The masters only calculates the size and coordinates of each slave task for the 4 n total tasks, where n is a positive integer. Later master checks its pool of task whether it has send all its task to slave nodes. After receiving all data from its slave it will output data to a separate file. Static load gains average speed of 6.9 times for the Lena image, the compression time is still high of about 119.86s. Dynamic load allocation gives a gain in speed up to 1.5 times which is better than the static load allocation. The implementation uses only 64 task to 12 computing node. The compression time is 119.86s which is very high. The compression time can be reduced by applying load balancing strategies. Uhl et al.[28] developed parallel algorithm for fractal image compression on MIMD architecture. The experiment is performed with 256 x 256 pixels Lena image and 256 grey values. For the result the experiment is performed on DEC AXP 3000/400 workstations interconnected by a fiber distributed data interface (FDDI) in a star topology and we use the parallel programming library PVM. The whole computational task is split into subtasks and a host-process is responsible for the distribution of these subtasks among the node processes. The node processes do their assigned calculations and send the result back to the host process. The subtasks are distributed dynamically if possible - whenever a node process has sent back the result to the host-process, the host process will assign another subtask to this idle node process until all the subtasks are evaluated. The speed up was linear for all stages. Chow et al. [29] proposed a hardware based pipeline architecture using multiple digital signal processor that utilizes Hurd and Barsney algorithm. The implementation uses multiple ISA bus based DSP cards, which is developed for use as low cost, general purpose digital signal processing (DSP) card. The master is connected to multiple slave PE’s that has subset of domains in pipeline fashion. The slaves perform comparison between range and domain. The parallel execution of domain search result in 519 sec. In this pipeline architecture after completion of one job the first card will have to pause for a short time while waiting for the 2ndcard to finish also it requires an additional file management program that after completion of one job stores the compressed data, which creates overhead. Palazzari et al.[30] presents massively parallel processing on SIMD machines that uses High Level Parallelism (HPL) to minimize communication overhead, as in HLP the speed up is equal to the number of processors. The implementation is performed on MPP APE100/ Quadrics machine. The machine is connected according to a 3-dimensional toroidal topology and, in its largest configuration, uses 2048 processors giving a peak power of 100Gflops. The machine is hosted by a Sparc20 workstation and I/O is performed through an HIPPI channel (High Performance Parallel Interface) which offers a band width of 20MByte/s. Algorithm uses quantized value of range and domain that can be further implemented in an Coefficient

Quantization algorithm. The image of size 512x512 using 8x8 range block, The compression time, with DP step set to 2 was11.2s, compression ratio CR= 18.3 and PSNR = 30.3 dB. If we reduce the DP by step size 8, we obtained compressed image in 0.7s, in this case CR= 21.3 and PSNR =30dB. The proposed architecture result in 30% over classical IFS coding with no loss in compression ratio. The architecture has peak power of 100Gflops as there are 2048 processors connected in 3-dimensional toroidal topology which compressed the image in 11.2 Sec., although the system is not cost effective because of a very large number of processors. Hammerle et al.[31] proposed a new parallelization strategy for fractal image compression, which utilizes block classification method. The algorithm is implemented on 8 DEC AXP 3000/400 workstations, a Parsytec CC-48, and a SGI POWER Challenge GR. PVM is employed for implementing a message passing based host/node configuration. The algorithm uses MIMD architecture and for MIMD computation derived two classes for parallelism, one that employ parallelism via range and other via domain. By applying the adaptive quadtree algorithm (starting with range-size 16x16 pixels we partition until size 4x4 is reached, subsequently the range is coded directly) to the following images with 256 gray values: the well known Lena image (512x512 pixels for the Parsytec, 1024x1024 pixels on the NOW) and a large satellite image (2048x2048 pixels on the SGI). The domains in the domain pool are chosen to be non-overlapping (as it is done in most fractal coders in order to keep the complexity reasonably low). For the smallest image considered we notice a speedup of 3.3 on 12 PEs on the Parsytec. Speed up of 4.17 with 7 PEs on the NOW for the 1024x1024 pixels image. For the 2048x2048 pixels satellite image observed increasing speed up 8.4 achieved. Anacarani et al. [32] worked on array of processors to improve the performance of fractal encoding. The author proposed an ASIC architecture that performs the comparison between the domain and the transformed range blocks. In order increase the performance, the mean absolute difference (MAD) is used for comparison. To store the range and domain blocks temporary there two register used, the range blocks stored in range block matrix (RBM) file of 32 registers and domain blocks are stored in directly domain block matrix (DDBM). To read the matrixes in fixed sequence we obtain the right data streams for the eight MAD comparators. Each MAD comparators performs the accumulation of the absolute value of the difference between the pixels. At the end of the comparison each MAD result are compared with each other. The experiment is performed with some real images. The ASIC architecture is designed using add-in board for host PC platform, connected by PCI bus. The experiment encodes the image in 9.8 seconds which very less. The speed measured after computation was 300 times the software simulated encoding time. Acken et al.[33] focused on pixel block comparison and presented parallel ASIC array architecture for use in fractal encoding that uses full domain quad-tree search. The parallel architecture for fractal image compression is designed by add-on board for PC (personal computer) connecting via PCI bus. The encoding includes preprocessing, MSE minimizing and post processing. The characteristics of MSE metric is the accumulation of pixel value into large bit-width temporary values, then ultimately reducing the bit-width of the resultant value back to small number of bits. To form domain pool of size 2x2 pixels domain is contracted down by averaging. The experiment is performed at 16x16 quad-tree level on the pixel block processors that yield sum of the pixels and sum of the pixel squared to the MSE processors. By using Lena 128x128, 256x256, 512x512 and 1024x1024 image encoding time

is calculated. The result shows that for an image of size 256x256 the encoding time approaches real time. But for the image of size 1024x1024 the encoding time is slightly higher as pixel blocks swapped into and out of the array. The experiment is performed using full bit precision fractal encoding and reduced bit precision fractal encoding. The encoding time measured after computation is less than 6 seconds for above all types of image. Jackson et al.[35] researched on pipeline computation model to solve fractal image complexity. In this model toroidal linear array of processors is employed and utilized in a pipelined fashion. Initially the image is partitioned into range and domain blocks using quad tree partition method. The parallel architecture which is used for implementation is nCUBE-2 parallel supercomputer. This architecture makes use of MIMD (Multiple Instruction Multiple Data) paradigm. Each processor executes its own individual instruction and has individual data items to be performed. The nCUBE-2 has hypercube array of 8192 nodes. There is one host that distributed the task to slaves and slaves perform the individual operations. Slaves can communicate in circulating pipeline fashion. The host first load the image then image is divided into domain blocks of an appropriate size according to the current level. These domain blocks are distributed among slaves nodes. The slaves check its buffer for range blocks that may be arrived from other slaves. If the range block is classified as grayscale block, the average grayscale is determined and result is returned back to the host. The parallel algorithm gives 95% efficiency. The images are compressed with the JPEG compressed image which gives promising results. Era [36] worked on graphics hardware based parallel fractal image compression on SIMD (Single Instruction Multiple Data) architecture. By using programmable capability of GPU (Graphics Processing Unit) image compression can be achieved. The GPU performs parallel pairing test of all ranges for a given domain blocks. There are different modules in this scheme, the first module computes takes the original image and returns a texture as output. It has buffer that can accommodate 32 bit color component texture. The second module pre computes the domain summations and can hold 16 bit color component texture. The third module is govern by the CPU itself, in which CPU only performs the start up routine to generate streams of fragment program. The forth module is handled by GPU itself which performs the operation of pairing test between range-domain blocks. The last module takes care of scaling and offset. This operation is performed to avoid useless write operations which in result optimize the memory bandwidth. The experiment is performed on Pentium IV based machine with 3.2GHz processor speed and 1GB of RAM. The used graphics card was a GeForce FX 6800, with 128MB of video memory, core speed of 300MHz and memory speed of 800MHz. The test image has a resolution of 256 × 256 pixels. Choosing a range size of 4 × 4 pixels we obtain 64 × 64 ranges. The domain blocks must be twice the size of range blocks and using a step of one pixel to scan the image. The CPU version takes about 280 seconds to perform all pairing test whereas the GPU version takes about 1 second. Yunda et al.[37] combined both sequential and parallel process to gain maximum speed up. In this research work, a derivative tree topology is proposed to reduce complexity in parallelism. To reduce encoding time dual classification scheme is used in which after creating range blocks. The first classification is created using domain blocks of similar size of range blocks. Second classification is based on double sized domain blocks that classify as similar size of domain pool and double sized domain pool. The method compares its result

with Fisher’s [3] and Hurtgen’s [5] results. The method uses tree topology based MIMD architecture which is consist of one host and three slave machines. The experiment is performed on four IMS T800 machines. The first experiment is performed for choosing the minimum number of classes and by increasing the gray level interval PSNR remains unchanged and execution time increases. The second experiment is performed to compare PSNR among Fisher’s, Hurtgen[5] scheme. The PSNR reaches about 0.8dB higher than that of Fisher’s and Hurtgen scheme. The third experiment is performed to compare encoding time that shows higher speed up for larger images. Cao et al.[38] proposed Jacquin [2] fractal coding based parallel algorithm using an OpenMP program model. Searching range in sequential program of Jacquin[2] fractal coding is partitioned into a sub search range, which can be allocated to multiple threads OpenMP program model. OpenMP is a portable, scalable model that gives shared memory parallel programmers a simple and flexible interface for developing parallel applications. The OpenMP begins with the master thread that executes sequentially until the first parallel region is encountered. The master threads creates a team of parallel threads, called fork. When team of threads completes its operation they synchronize and terminate, leaving only master thread, this process is called join. Variance between each range block with all transformed domain blocks are calculated that finds most similar domain block. Image sliced are fed into the individual node of multi-core processor. The experiment is performed on Intel (R) Core (TM) 2 8200 processor (2.33 GHz) with four cores of 2GB RAM. The experiment result shows four times better speed up than the sequential algorithm. The proposed model can only operate on two times the number of cores. Cao et al.[39] implemented a research on Fishers[3] fractal image encoding scheme. Using OpenMP programming model on Intel (R) Core (TM) 2 8200 processor (2.33 GHz) with four cores of 2GB RAM, performed a experiment in which the image is partitioned using quad tree partitioning. As in fishers encoding there are different classes of domain. Initially variance between each range blocks is in loop with all of transformed domain blocks. The algorithm computes independently for each range block searching for best matched domain blocks. The multi core threads search the fractal codes of range block involved in loop. The experiment is tested on Lena 512x512 grayscale image. The performance is evaluated using 2, 4, 8, 16 and 32 threads. It was observed that performance increased linearly when there are only 8 threads. The performance is measured using single and multiple threads. The obtained speed up is 4 times better than the sequential algorithm. Fang et al.[40] created distributed parallel computing system based on web service. The model composed of monitor, that is used to add, delete the computing nodes. Client, it establishes a connection if nodes are found, then it partition the parallel computing task to many smaller task blocks. Web service computing node, it is responsible for computing task blocks and sending result to client. The model partitioned a image into range block of sizes 4x4, 8x8, 16x16, according to step length these blocks are divided into multiple task blocks. Client creates a threads for each computing node and each threads is responsible for assigning task to corresponding web service computing node. The experiment is based on five PC’s each of Intel Pentium IV 2.93 GHz with 2 GB of RAM. The experiment performed two test for 8-bit grey conference image of 256x256, lena image of 512x512 and movie image of size 720x576, using global search and nearest neighbor search for domain blocks. The first test

performed using global search for conference image that result compression ratio of 15.977 and for lena image 15.944 compression ratio were recorded. The second test was performed using nearest neighbor search for domain block for lena image and obtained the compression ratio of 15.944 and the same test performed for movie image that result compression ratio of 15.996. The experiment also shows speed up of 483.5 times for Lena image and 801.5 times for movie image. Wakatani et al.[41] enhances fractal image coding algorithm using Graphic Processing Unit (GPU) on CUDA (Compute Unified Device Architecture). CUDA is a framework that constructs and executes general purpose application n GPU’s. It consist of API’s for c language. To facilitate the communication between CPU and GPU, global memory is used. But the accessing speed of global memory is slow that of shared memory. To utilize the performance of parallel application shared memory must be used efficiently. The parallel algorithm, initially reduces the computational complexity by reducing memory acces cost. CUDA uses coalesced communication for moving data between global and shared memory. The coalscede communication is that memory access to adjacent addresses are combined into one memory request with several threads and the utilization of the memory bandwidth can be enhanced. So when the memory access cost dominates the computational cost, the modification enhances the effectiveness of the parallelism. The parallel algorithm yields the speed up of the GPU over CPU is from 32 to 238, and the PSNR performance measured for 4x4, 8x8 and 16x16 range blocks are 34.9 dB, 30.25 dB and 25.98 dB, respectively. The above experiments shows that the quality reduces as we increase the size of the range block, thus an adative approach is proposed to overcome this problem. In an adaptive approach range block of size 16x16 is choosenand quality is calculated of each block of the compressed data. Then, if the quality is less than the required level, the block is divided into four 8x8 range blocks and these block should be compressed. The algorithm, achieves the speed up of about 128.35 on a given GPU and the PSNR remains 34.9 dB. Still the problem remains that the memory access cost of GPU is high. Wakatani et al. reduced the memory access cost of the GPU by adding more threads for each thread block, as the result the algorithm achieves speedup of about 162.54 on a given GPU. Wakatani et al.[42] proposed a improvements in its own method i.e. GPGPU implementation of adaptive fractal image coding algorithm using index vectors. They reduces the memory access cost of the parallel algorithm [31] of GPU by Adding more threads for each thread block. In an adaptive fractal image coding on GPU is that the speed up is very low due to hardware requirement like the size of the shared memory, due to which the occupancy of GPU core reduces and to increase the occupancy more threads are added for each thread block .The result of the algorithm achieves the speed up of about 171.98 on a given GPU [32]. Gomes et al.[50] modeled a multi processor architecture in order to get reduction in fractal encoding time. The experiment is performed to show that the increasing the number of cores lead to higher sped up and efficiency. The experiment is performed using 256x256 Lena and 512x512 Coliseum images. Initially the image is partitioned into equal subsets of range and domain blocks. Experiment starts with the n number of threads, and the value of the threads depend on the number of range partition. The result is best for the range block of size 2x2. By using only one thread 6.57 seconds encoding time is achieved. But when we increase the number of threads to 32 gives us higher encoding time. This is behavior occurs due to

overhead of mutex, synchronization and thread management primitives. With only 2 cores the efficiency reaches to 92%. But when execution starts with 4 cores it the efficiency goes down. The reason for that is the each thread compete for the cores. The speed up of 1.97 is achieved while using 2 threads which is better. When the number of threads reaches to 32 the speed up goes down. Which happens due to the fact of lower computation grain per each thread. Later the experiment is performed with dual and quad core processors and they gives the average efficiency of 46.3% for 4 threads and 68.2% respectively. Lee et al. [34] proposed a efficient parallel architecture for fractal image encoding based on fixed size full search algorithm. The main features of this architectures is that it local communication between processors. In the proposed architecture the systolic array of PE (processing elements) are created and capable of computing range-domain distances. Each range blocks are preloaded into all PE’s. Each PE’s range blocks are compared to domain blocks in parallel and all the domain blocks are shifted and the process is repeated. As the research involves developing half overlapping domain which overlaps along the vertical and horizontal directions with the overlapping interval set of domain blocks. Each PE has three modules, fast compression, contractive mapping, 1-to-2 decoder and 2-to-1 multiplexer. The proposed architecture is capable of performing fast compression and best suited of VLSI realizations. Peter et al.[48], proposed a quad tree based MIMD (Multiple Instruction Multiple Data) architecture to maximize processor utilization in parallel. In this paper the author solved the problem that arises when we use quad tree based partition scheme to partition the range and domain blocks. The quad tree based partition executes recursively and terminates when the previously fixed threshold remain non exceeded. Therefore no processor can be able to calculate and store the whole domain pool for classification. Due to this some processors goes in no operation mode or idle mode. Initially there are two algorithms for fractal image encoding to be applied on MIMD architecture. The first stores the whole image on each PE (Processing Elements). Second, the algorithm distributes domain pool to the all PE statically or dynamically. The processing starts and each PE range-domain comparison. The problem arises when most of the processors will be idle. At that time finding a good match for a more complex range block may require more computation, while others have to wait for this PE until it terminates searching to proceeds to next quad tree level. The utilization of processors can be increased by assigning task immediately to the PE. There is one master and multiple slaves in the architecture. The master sends the task to slaves after dividing range blocks. Slaves perform their operation and after completion of operation each slave sends their task to master. The result shows linear speed up up to certain number of processors. Gehrig et al. [50] developed a method to compress images using geometrical coding approach by distributed compression. The proposed distributed compression for multi-view images, where each camera individually and efficiently encodes its visual information locally without any collaboration with other cameras. For camera sensor network, where each camera has limited power and communication resources. The distributed source coding approach is used to modeled a 2D piecewise polynomial functions with 1D linear boundaries. In this research work the decomposition method using binary tree is replaced by quad tree. The experiment is simulated with the real images and encoded at 0.32 bpp. There are six different

views of image is taken by the camera is taken and only the quad tree structure is transmitted for the other four views. The experiment result outperforms with considerable bit-rates.

2.5 Summary of Literature Survey A large number of parallel and distributed fractal image encoding algorithm and architecture have been proposed by various researchers. As distributed computing seems to be very cost effective for fast computing using limited and existing resources. In this chapter some of the major researchers work has been examined on high performance computing. It was also shown how encoding time of fractal image compression can be reduced by using GPU.

CHAPTER 3 FRACTAL IMAGE COMPRESSION

The chapter details about fractal image compression and various fractal encoding techniques by researchers. This chapter describes how fractal image compression is mathematically modelled. In this chapter general fractal image encoding algorithm and its functionality is described.

3.1 Introduction A fractal is a geometrical concept that lets us see objects in a different way. The research made by the Mandelbrot in 1982 that gave us the concept of fractal geometry. Repeatedly iterating a simple mathematical function at different locations until either it converges on a single value, or is found to be unbounded. Later this concept of fractal geometry is applied in the field of image compression. Fractal image compression can be easily understood with the help of a photo copy machine. The output image seems to be converging to the original image as shown in Figure 3.1. The output image has detailed at every scale and the same output image is transformed copies of the original image. There are eight transformations that can be applied over a reduced copy of the image. These transformations are skewed, stretch, rotate, scale and translate an input image. Mathematically, it can be represented as

 x   a b  x   e       w(x)  w     y   c d  y   f 

Figure 3.1: A Photocopy machine that makes three reduced copies of the input image

(3.1)

These transformations are called affine transformation. To obtain fractal encoded image, it is necessary to divide the image into smaller blocks. The smaller part is called ranges and larger ones domain. Generally the domain size is twice of the range block size. A. Jacquin developed an automated method for encoding images using PIFS (Partitioned Iterated Function System). After partitioning the image using appropriate partitioning technique, find an IFS for each of these blocks. Using contractive mapping theorem, IFS will converge on a single fixed image after repeated iterations. The next step is to map range over domain and find a appropriate matching. We only store transformation or IFS code for each block. An optimal fractal image encoding algorithm if performed in full search would require O(n4).

3.2 Mathematical Background To produce Iterated Function Systems, there must exist a space that supports images and on which distances can be measured. For an IFS to converge to an attractor, the mapping that defines this IFS must be a contraction mapping. Having a metric will allow us to measure distances on a given space, as well as determine which are contraction mappings and which are not. Also, having a contraction mapping is an essential ingredient in Fractal Image Compression. Definition 1

A metric space

 , d 

is a set  together with a real-valued function

d :     R , which measures the distance between pairs of points x and y in  . d is called a metric on the space  when it has the following properties [2]: i.

d ( x , y )  d ( y , x ), x, y  

ii.

d ( x, y )  0, x, y  

iii.

d ( x , y )  0 iff x  y, x, y  

iv.

d ( x, y )  d ( x , z )  d ( z , y ), x , y , z  

Definition 2 Let  be a space. A transformation, map, or mapping on  is a function f :    . If S  , then f ( S )   f ( x) : x  S . The function f is one-to-one if x, y  

with f ( x )  f ( y ) implies x  y . It is onto if f ( )   . It is called invertible if it is one-toone and onto: in this case, it is possible to define a transformation f 1 :    , called the inverse of f, by f

1

( y)  x , where x   is the unique point such that y  f (x) [2].

There are eight isometric transformations that can be applied over an image. These transformations are, identity, rotation through +900, rotation through +1800, rotation through +270 0, reflection about mid-vertical axis, reflection and rotation -90 0, reflection and rotation 1800, reflection and rotation -2700. These isometric transformation are known as affine transformation. These transformation are generally applied to range blocks.

(a). Shear

(b). Stretch

(c). Skew

(d).Rotate Figure 3.2: Affine Transformations

Definition 3 Affine transformations on R are transformations f : R  R of the form f ( x )  ax  b, x  R , where a and b are real constants [2].

If a  1, then this transformation contracts the line toward the origin. If a  1 , the line is stretched away from the origin. If a  0 , the line is flipped 180 about the origin. The line is translated, or shifted, by an amount b. If b  0 , then the line is shifted to the right. If b  0 the line is translated to the left. We will consider affine transformations on the Euclidean plane. Let w : R 2  R 2 be of the form w( x, y )  ( ax  by  e, cx  dy  f )  ( x , y ) , where a , b, c, d , e, and f are real numbers, and ( x , y ) is the new coordinate point. This transformation is a two-dimensional affine transformation. We can also write this same

 x   a b  x   e        Ax  T , where transformation with the equivalent notations: w(x)  w     y   c d  y   f  e A is a 2  2 real matrix and T    represents translations [2]. The matrix A can always be f written in the form of

 a b   r1 cos 1    A    c d   r1 sin  1

 r2 sin  2  , where ( r1 , 1 ) are the polar r2 cos 2 

coordinates of the point ( a, c ) and ( r2 , ( 2  2 )) are the polar coordinates of the point (b, d) [2]. This means that r1  a 2  c 2 , tan  1 

c b , r2  b 2  d 2 , tan  2  . a d

The different types of transformations that can be made in R 2 are dilations, reflections, translations, rotations, similitudes, and shears.

r A dilation on ( x , y ) is written in the form wd ( x, y )  (r1 x, r2 y ) or wd (x)   1 0

0  x   . r2  y 

Depending on the values of r1 and r2 , this dilation could contract or stretch x [2]. A reflection about the x axis can be written in as wrx ( x, y )  ( x, y ) , while a reflection about the y axis is written as wry ( x, y )  (  x, y ) [2]. In matrix representation, these reflections would be as

 1 0  x   1 0  x    and wry (x)     wrx (x)    0 1 y   0 1  y  Translations can be made in the x or y direction by adding a scalar to the corresponding component of the map [2]. Translations are written in the form wt ( x, y)  ( x  e, y  f ) or

 1 0  x   e       . If e  0 , the map translates in the negative x direction. If e  0 , wt (x)    0 1  y   f  the map translates in the positive x direction. If f  0 , the map translates in the negative y direction. If f  0 , the map translates in the positive y direction. A rotation mapping has the form wr ( x, y )  ( x cos   y sin  , x sin   y cos  ) ,

 cos also expressed as wr (x)    sin

 sin  x    , for some rotation angle  , 0    2 [2]. cos  y 

A similitude is an affine transformation w : R 2  R 2 of the form,

 r cos ws (x)    r sin 

 r sin  x   e   r cos      , w s (x)   r cos  y   f   r sin 

r sin   x   e       ,  r cos  y   f 

for some translation (e, f )  R 2 , some real number r  0 , which is the scale factor, and some angle  , 0    2 [2]. A similitude combines the rotation, dilation, and translation rules together. A shear transformation, or a skew transformation,

takes one of the forms,

 1 b  x   1 0  x    , or w(x)     , where b and c are real constants [2]. In each case, w(x)    0 1  y   c 1  y  there is one coordinate, which is left unchanged. One can imagine the action of this mapping on some rectangle as if shearing a deck of cards.

3.2.1 Convergence and contractive mapping In producing Iterated Function Systems, it is necessary to have a set of transformations that converge to a desired image. For these mappings to converge to the desired image, they must be contraction mappings. To apply IFS to Fractal Image Compression we will need the following definitions and theorems. The Contraction Mapping Theorem Let f :    be a contraction mapping on a complete metric space (  , d ) . Then f possesses exactly one fixed point x f   , and moreover for any point x   , the sequence

f

n



( x) : n  0,1,2,... converges to x f ; that is lim f  n ( x)  x f , for each x   n 

DI

RI

Figure 3.3: Mapping of domain to range 

Definition 4 A sequence x n n1 of points in a metric space ( , d ) is said to converge to a point x   if for any given number   0 , there is an integer N  0 such that d ( xn , x)   for all n  N [2]. The point to which the system converges, x   , is called the limit of the sequence. This definition is different than that of the Cauchy sequence because the metric, d, is not measuring the difference of values between consecutive numbers, but between each term in the sequence and x, the constant value which the sequence is approaching. So, if a sequence is converging to x, then as n approaches infinity, the nth term of the sequence will grow closer, less than some epsilon, in distance to x. Relating this terminology to an IFS, the limit x   of the



sequence x n n 1 corresponds to the attractor of an IFS, where the nth term in the sequence,

x n , is the nth level of an IFS after n iterations on the seed image x0 .

3.2.2 Iterated Function System The most general form of encoding images is based on IFS (Iterated Function System) describes by Barsney [2]. The IFS creates a contractive functional mapping from an original image to smaller portions of the same image. The basic problem here is to find appropriate constructive transformations whose attractor are an approximation of the given image. Thus, for the purpose of image compression it is enough to store the relevant parameters of the said transformations instead of the whole image. Iterated function system is the concept that was developed by Barsney [2] and later laid by Huchinson [2]. Let (X,d) be a complete metric space and let wi : X  X be a collection of mappings ( wi ; i  1,2,3....N ) .Then the equation that defines IFS is given by   { X , ( wi ); i  1,2,3...N }

Figure 3.4: Sierpinski Triangle

Also, if we consider there are two pixel points such as p1 and p2, then metric space X maps p1 to pixel p2. For IFS to be contractive the distance must be lesser after evaluating the IFS. The concept of IFS can be easily understood by Sierpinski triangle named after mathematician Waclaw Sierpinski, who first described the Sierpinski triangle can be as Multiple Reduction Photocopy Machine. An image is placed on the machine, reduced by one half and copied

(3.2)

three times, once onto each vertex of a triangle. If the corresponding contractive affine transformations are written in the form:  (x, y) = (ax + by + e, cx + dy + f)

(3.2)

The Sierpinski triangle is shown in Figure 3.4 which is the result of running the aeterministic algorithm.

3.2.3 Recurrent Iterated Function System The recurrent iterated function system extends the IFS with added capability of mapping different IFS into one. Consider a situation to create a Barsney fern with sierpinski triangle leaves.

Figure 3.5: Fern with Sierpinski Triangle

In this scenario we need two IFS, one to create sierpinski triangle and other would place the triangle in position of the leaves. As per the IFS applied both will copy itself, and as fern will copy the triangles, the resultant fern will look like as shown in Figure 3.5.

3.2.4 Partitioned Iterated Function System Given the IFS of an image, finding the attractor is relatively simple. The complexity of fractal coding lies in the inverse problem of an image given, finding an acceptable IFS. Finding full IFS of an image is a difficult problem with unknown automated solution. An approximate solution to this inverse problem is to create the IFS by mapping parts of the image to other parts of the image. This allows for full automation of fractal coding. This

method uses local iterated function systems, (also called partitioned iterated function systems (PIFS)) and was first presented by Jacquin [4]. In this method, the original image is divided into non overlapping groups of pixels called range blocks, and the image is also divided into possibly overlapping groups of pixels called domain blocks. The range blocks are then mapped to individual blocks. Partitioned iterated function system is the generalisation of an IFS. IFS is applied over an entire image instead PIFS applied over a part of an image. Let (X,d) be metric space D  X be a set of domains. Let W : D  X be a set of contractive transformations. Then   { X , ( wi ); i  1,2,3...N } Where wi W is partitioned iterated function system. The PIFS of an image is as closed to the encoded image. The compressed image is stored as the parameter of all mapping function. 3.4 Fractal Image Encoding Fractal image encoding is the core process that can be computed using PIFS and dividing image into two halves. The first part is range and second part becomes domain of an image. Let us consider there is a image I which has to be encoded using PIFS. If we divide image into two parts such as range and domains, whereas range is a set of non overlapping blocks

RI  R1, R2 ...Rn

blocks

R  R i

j

  is called pool of I, Where

i j

N

I   Ri

(3.3)

i 1

Also a set D overlapping domain blocks DI  D1 , D2 ...Dn is called a domain pool of I, where N

I   Di

(3.4)

i 1

A set W consists of contractive affine transformations ( wi ; i  1,2,3....N ) , which for each range cell map on it the corresponding domain blocks:

wi : Di  Ri for (i  1,2,3....N ) and Ri  wi ( Di )  si ( Di )  oi where φ is a real value of the spatial contraction function which contracts the domain cells to the size of the range cells, si is a real value of the contraction function for image measure

(0  si  1) and oi is a real value of an offset factor that determines the translation. The image encoding algorithm is describe below: Domain Block

Range Block

Figure 3.6: Partitioned Range-Domain blocks Step1 Partition the image I into non-overlapping range blocks R Step 2 For each range block Rn  RI , find a domain blocks Dn  DI transform the parameters such

that

the

distance

and

d ( Ri , si ( Di )  oi ) is

minimized Step 3 save the transformation parameters ( , si , oi )

The process of image coding results in a set of contractive affine transformations. The process of encoding is a complex and takes huge amount of computing time to search. The partition of an image I using above described algorithm is shown in Figure 3.6. Given a subsets, geometrically transform the domain D I of an image over the Range RI, which is smaller than the domain blocks and apply the contractive mapping Xi: DI → RI for each domain blocks and measure the minimum difference against the corresponding range block at each position, and record those transformations until we get desired similarity.

Begin

Divide the image into non overlapping Range blocks

Divide the image into overlapping Domain blocks

Calculate the difference between range-domain blocks

Perform contractive mapping

Apply Affine Transforation onto domain blocks If domainrange mapping matched

Store the transformation parametes

Figure 3.7: Fractal Image Encoding Flow Chart

The process of contractive mapping ensures PIFS (Partition Iterated Function System) [1] to converge to a fixed point of an IFS of an image and process of recording those transformation is known as and finding those attractor which is computationally intensive. As shown in Figure 3.6 there is an input image of Lena 512x512 in which we have used uniform partitioning and divided the image into domain and range blocks. The smaller block as shown is range and larger one is domain. The above mentioned algorithm flow chart is shon in Figure 3.6.

3.4 Fractal Image Decoding Decoding of an image is not as complex as encoding. Decoding of an image can be obtained by iterating W from an initial image. The iterated function system computed starting with and initial arbitrary image. To obtain image convergence the IFS need only little number of iterations. For each domain block, we need coefficient of isometric transformation from encoded image and map the domain block into range. Fractal images have the details at every scale, therefore any resolution can be chosen in decoding. The above decoding model uses function of infinite solution.

3.5 Parallel Fractal Image Compression As all images have certain geometry through which we can compress them and fractal image compression technique utilizes this geometry to compress an image which is evolving from fractal geometry and the objective is to compress these types of geometry using parallel computing. The fractal image compression can be defined over the image I, and to find a function

Xi (I) = I that represent the image I in a smaller space, for that I is subsequently

divided into two regions, i.e. range RI and domain DI and RI, DI ϵ I such that after the certain affine transformation over the subset DI and contractive mapping Xi between RI and DI yields a smaller space than I, that may represent an original image using lesser bits. The mapping function takes longer time to execute hence parallel approach can solve this problem as mapping of subsets is an individual process, and can be paralleled for every range and domain blocks. Fisher [2] discussed that if we assign a part of an image to the multiple processing elements (PE) that can be achieved in reduced encoding time. To parallelize fractal compression it is necessary to find out the sequential and parallel part and feeding parallel part to the parallel machine may result in compression with less encoding time. To understand how parallelism can be applied to image or subset of image. An architecture is shown in Figure 3.8, which shows part of the image i.e. subset is sending to individual CN’s (Computing nodes) as each node perform operation individually and there is no data dependency among them. The nodes perform the operation and return the result as result produced by each node must be consolidated. The result consolidation of task is generally done by the master node, node that initiates and distributes the task.

Result 1

Subset 2

CN 2

Result 2

Subset3

CN 3

Result 2

Subset 4

CN 4

Result 4

Original Image

Encoded Image

CN 1

Subset 1

Affine Transformation CN

Result

Contractive Mapping

Figure 3.8: Parallel Encoding

3.7 Summary In this chapter, basics of fractal image compression and basic algorithm for encoding in a sequential machine has been presented. This chapter also briefs about parallel fractal image compression on multiple computing nodes.

CHAPTER 4 PARALLEL AND DISTRIBUTED COMPUTING

Parallel and distributed computing is the emerging field, and in this chapter we have described the application and need of parallel and distributed computing in various fields. Also chapter includes a comparison of both computing techniques.

4.1 Parallel Computing As the computing need of people and data used by them is increasing exponentially. Today we don't only store data in the data centre, but also process them to serve to the end users. This type of processing requires huge computing power, which cannot be served by a single core processor. If consider a problem of 10 GB in size, and using a single processor with the processing of the speed of 1MB/second would take approximately 2 hours (10X10240MB/1MB/Sec ≈10240 seconds≈ 2 hrs) [10]. If the processing time needs to be reduced parallel processing is an alternative approach. Hence, to enhance and to process such data, we need parallel processing. These days parallel computing is being used in academic research. The parallel computing is used for speeding up the performance of computation. The major law that is used in parallelism is Amdal’s law that states about speeding up. Speed up is the ratio of computation using a single processor over computation using multiple processors. The multiprocessor system is also known as shared memory system as they share address space throughout the system. The communication among these processor takes place via shared data variables and control variables among the processors. If there are n processors and each have equal task to compute then we can say that speed up is linear for such type of computation. But in practical it not possible, we do not get linear speedup. The primary objective of parallelism is to increase speed up. Today to solve large scientific problems such as compression of a large image can be solved by parallel computing. Today the parallel computing is widely used using graphic processors. In the year 1999 NVIDIA developed a many core processor called graphic processing unit having more than 100 cores in it. Initially, it was intended to be used for high end graphics games. Now using CUDA (Compute Unified Device Architecture) programming environment. CUDA provides the means for developers to execute parallel programs on the device without the need of mapping computation to a graphics API. When programming CUDA, programmers take GPU as a processing device capable of executing a large number of threads in parallel.

It issues and manages computations on GPU as a data-parallel computing and programmable GPU’s we can program it to compute complex computation. GPU executes instruction in SIMT (Single Instruction Multiple Thread). The GPU consists of an array of SM multiprocessors, each can support up to 1024 concurrent threads. A single SM contains 8 scalar SP processors, each with 1024 32-bit registers, for a total of 64KB on-chip memory that has very low access latency and high bandwidth.

4.2 Classification of Parallel System A parallel system can be classified into three types;  Multiprocessor system: In this multiple system have direct access to shared memory which forms a common address space. A multiprocessor system is also known as UMA (Uniform Memory Access). In this the access time for any memory location is same [3].  Multicomputer Parallel System: The parallel system in which the multiple processor does not share a common address space. They do not have direct access to shared memory. The multicomputer system is also known as distributed system or NUMA (Non uniform memory access). With multicomputer system the access time for shared memory varies for different processors.  Array Processors: These computers are very tightly coupled, and have common system clock may not share memory and communicate via message passing. IS: Instruction Stream

IS

DS: Data Stream CU

PE: Processing Element CU: Control Unit IS

PE

DS

Figure 4.1 Single Instruction Single Data

Flyn’s identified four modes of processing, based on whether the processor executes the same or different instruction streams at the same time. These classifications are as follows:



Single Instruction Single Data (SISD): With this system a single memory and CPU is connected by the system bus. The processor executes one instruction at a time. The processor receives one instruction at a time and is performed on single data stream. As shown in Figure 4.1 there is a PE (processing element) and CU (control unit), receives an instruction and send it to the PE via system bus and PE performs the operation on the basis of instruction received.



Single Instruction Multiple Data (SIMD): This mode corresponds to the processing by multiple homogenous processors which execute in lockstep on different data items. Applications that involve operations on large arrays and matrices, such as scientific applications, can best exploit systems that provide the SIMD mode of operation because the data sets can be partitioned easily. This architecture is used when there are huge data available to be processed by the PE. As this requires only one instruction that can be operated on multiple data elements as shown in Figure 4.2. This property of this architecture makes it suitable for very large and complex computation. Due to its flexibility we have proposed a method of fractal image encoding in chapter 5 that utilizes SIMD architecture. IS

IS: Instruction Stream DS: Data Stream

CU PE: Processing Element IS

CU: Control Unit

PE

DS 1

PE

DS 2

Figure 4.2 Single Instruction Multiple Data 

Multiple Instruction Single Data (MISD): This mode corresponds to the execution of different operations in parallel on the same data. The multiple instructions can be provided to the PE to perform operation on single data shown in Figure 4.3. This architecture was not practically implemented.

IS: Instruction Stream

IS1

IS2

DS: Data Stream CU

PE: Processing Element CU: Control Unit

IS1

CU

IS2

PE

DS

PE

DS

Figure 4.3 Multiple Instruction Single Data



Multiple Instructions Multiple Data (MIMD): In this mode, the various processors execute different code on different data. This is the mode of operation in distributed systems as well as in the vast majority of parallel systems.

IS: Instruction Stream

IS1

IS2

DS: Data Stream CU

PE: Processing Element CU: Control Unit

IS1

CU

IS2

PE

DS1

PE

DS2

Figure 4.4 Multiple Instruction Multiple Data

There is no common clock among the system processors. This architecture is most widely used architecture in grid and cloud computing as shown in Figure 4.4, all the PE are isolated and performs independent operations. We have reviewed some papers that use MIMD architecture in chapter 2 that details about this architecture.

Problems Associated with Parallel Processing 1. Parallel machines cannot be reconfigured when problem grows in size. 2. It cannot rollback from failure, which makes it difficult for real time computation. If the computer fails then the user has to wait until the next start of the machine.

4.3 Distributed Computing When multicore processor system takes higher computation time to solve the larger problem or sometimes they cannot solve the problem due to the high memory requirement of the problem. This bottleneck can be solved individual computers that cooperate to solve a problem that cannot be solved by individual processors. Solving such problem by many computing nodes are known as distributed computing [14]. CN

CN

Communication cost

Interconnection Network Consolidation cost Start up cost

CN

CN

Figure 4.5: Computing node’s connection through interconnection network

The distributed computing also known as NUMA (Non uniform memory access) that do share address space. The communication takes place among multiple computing nodes via message passing mechanism. These nodes are connected through interconnection network, which could be LAN (Local area networks) or any wireless networks. There are certain limitations of parallel processing, computers, as they cannot be scaled to meet the problem size. With multicore processor we cannot not solve problem which are too big in size. To overcome this limitation, researchers developed the concept of distributed computing. With this approach as per the problem the size the multiple computers are connected via interconnection network.

As this kind of architecture needs only some kind of computer that ranges from lower speed to higher speed computers. As it can we add and remove any number of computers from the interconnection network. We can say that distributed computing architecture can be reconfigured. The advantage of distributed computing is to enhance the power of computing using existing resources. Which makes it a low cost, high performance computing environment for academicians.

4.3.1 Characteristics of Distributed computing  Resource Sharing: Resources such as peripherals, complete information sets in databases, special libraries cannot be fully replicated at all the sites because it is often neither practical nor cost-effective. Further, they cannot be placed in a single site because access to that site might prove to be a bottleneck. Therefore, such resources are typically distributed across the system.  Enhanced reliability: A distributed system has the inherent potential to provide increased reliability because of the possibility of replicating resources and executions. Distributed system provides the fault tolerance ability, which makes it reliable in case of node failure.  Scalability: Adding or removing one or more nodes in the distributed systems architecture that may be connected via local area network does not impose bottleneck.

4.4 Challenges in Distributed Computing If the processing time needs to be reduced distributed processing is an alternative approach. The primary objective of distributed computing is to increase speed up. Today to solve large scientific problems such as processing of a large image can be solved by distributed computing. To process an image it has to be partitioned in subtask in such a way that it should run on different CN’s independent. In distributed computing the task must be divided in such a way that it should be interdependent on other tasks [45]. As shown in Figure 4.6 there are three computing nodes trying to solve the task independently. But result of one CN is dependent on the other. The second and third CN has to wait until CN1 completes its execution. In distributed computing environment its necessary that each operation are separated and executed by multiple CN’s these operations are interdependent. Each CN is connected via interconnection network. This network may fail or go down or processes may fail to communicate to other nodes. These are the problems that must be conquered before designing any distributed computing architecture.

Image

CN1

Result 1

Result 1

CN2

Result 2

Result 2

CN3

Result 3

Figure 4.6: Serial Execution

To process the task using distributed computing architecture, it is important to partition the task using the partitioning technique. The major factors that diminish distributed computing speed up are: (i) Start up (ii) Consolidation cost, (iii) Communication cost (iv) load imbalance (v) Synchronisation (vi) Fault Tolerance (vii)  Start up cost: To initializing number of CN’s, requires start up cost. If there are many CN’s it can affect overall speed up. If task is smaller and start up cost is higher than it will dominate the overall computing process. This process cannot be parallelized.  Consolidation cost: Once all CN’s completed their operation, now the intermediate result produced must be collected and combine, as this procedure is also sequential hence cannot be parallelized. The process of consolidation is done by only one CN. Hence parallelism cannot be applied, this affect speed up.  Communication Cost: Involves task sending and receiving among multiple CN’s that may affect the parallelism as if this cost is higher will lead to lower speed up. As there are many processes involved in execution in a synchronized environment. The process wanting to communicate with others may be forced to wait for other processes to be ready for communication. This waiting time may affect the whole process.  Load Imbalance: If task partitioning is uneven among all processors which may lead to load imbalance, which refers to uneven distribution of load or task among multiple CN’s, due to this load imbalance speedup may reduced.  Synchronization: The Mechanisms for synchronization or coordination between the processes are essential. As in distributed computing we use message passing mechanism for communicating among different computing nodes. Hence the task distribution and consolidation must be synchronized.

 Fault Tolerance: When processing is started by the various computing nodes, and while processing any node failed. In that situation the task of that node is not executed and when the result is consolidated, it is not the complete result. Hence we must have a mechanism that can receive the task of the damaged node and complete the task after completion of its own task.  Application Programming Interface (API): The API for distributed system is very much important for the easier for users. It is very much important to develop an API for distributed systems as per the architectural requirement. There are many libraries available that can be used for distributed computation. The widely used libraries are OpenMPI that uses message passing mechanism for communication among multiple CN’s.

4.5 Performance Measures of Parallel and Distributed Computing The performance of any system can be measured on certain criteria. These criteria are the result of rigorous research on that particular field. Similarly parallel and distributed computing has certain criteria on that basis we could measure the performance of both. The performance measure is very much important as they play vital role in computation.

4.5.1 Speed up, Efficiency and Scalability In order to demonstrate the effectiveness of parallel processing, several concepts have been defined. Mostly the criteria that is used always for parallel processing performance evaluation is speed up, that is defined as comparing the time needed to solve the problem on N processors with time needed on single processor [3]. This can be written as:

Computation performed by Uniprocessor Speed up = Computation performed by Multiprocessor

S(n) = T(1)/T(n)

(4.1)

Speed up

Linear Speed up

Sub linear Speed up

Number of Processors Figure 4.7: Speed up curve

The linear speed up refers to performance improvement accordance with number of processors. As shown in Figure 4.7 the first dotted curve is linear speed up and second curve which grows linearly but after some threshold point, the point at which after increasing the number of processors the performance will not be gain further. The efficiency or parallel efficiency is the ratio of speed up to number of processors. It can be clearly seen from the equation 4.2, the efficiency will be decreased when we increase number of processors. E(n)= S(n)/n

(4.2)

Scale up is the measure of functionality to handle the larger task by increasing the degree of parallelism. The Figure shows linear scale up, refers to the ability to maintain the same level of performance when both workload and resources are proportionally added. When scale up is equal to 1 called linear scale up. When scale up reduces to less than 1 called sub linear scale up.

Scale up

Linear Scale up

Sub linear Scale up

Workload Figure 4.8: Scale up curve

4.5.2 Amdahl’s Law Amdahl’s proposed a law that, for fixed size problem speed up is generally gets reduced as more processors are added. This is due to the fact that amount of parallel speed up is restricted by the sequential part of the program. This result implies the limit of parallel performance: when the number of processors reaches some threshold, adding more processors will no longer generate further performance improvement and will even result in performance degradation, due to decrease in time saving brought by further division of task and an increase in overhead of interprocess communication and parallel computation. Amdahl’s presented an analysis of parallel computation that shown in equation 4.3[3].

S ( n) 

1

1  p s s n

(4.3)

where p is the fraction of portion that is parallelizable, s is the serial portion and n is the number of processors. The law states that speed up cannot exceed

1 s

even though we increase

the value of n. In the Figure 4.9 the parallel part is shown as unshaded area and 1, 2, 4 and 6 are the number of processors. The shaded area is sequential part that cannot be easily parallelized.

1

2

4

8

Figure 4.9: Fixed Problem Speed up

Gustafson’s founded a problem in Amdahl’s law and stated that speedup should be measured by scaling the problem to the number of processors, not fixing problem size. The amount of work that can be done in parallel varies linearly with the number of processors and the amount of serial work, mostly vector startup, program loading, serial bottlenecks and I/O, does not grow with problem size. 4.5.3 Parallel Execution In general the parallel execution performed using threads. Today most of the parallel language uses philosophy of master and slave threads. The thread which starts the execution known as master thread and master can creates multiple threads as per requirement of the program. At the moment all slaves perform execution, get terminated and send back the result to the master thread. To minimize communication or other overhead we only creates threads once [32]. After start-up, the master thread starts the execution of the program while slave threads wait idly. When the master thread encounters a parallel loop, different iterations of the loop are distributed among the slave and master threads which take up the execution of the eyelet. Later on each thread finishes execution of its chunk it synchronizes with the continuing threads.

Master Thread

Master Thread

Slave Threads Figure 4.10: Master and Slave Threads Task Subtask1 Subset1

Subtset2 Subtask2 Subtask3

Subset3

Subset4 Subtask4

CN1

Sub result1

CN2

Sub result2

CN3

Sub result3

CN4

Sub result4

Figure 4.11: Parallel Execution

As shown in Figure 4.11 the complete task is subdivided into subsets, then each subset is converted into subtask1, 2, 3 and 4. Each subtask is sent to the CN’s as we have considered

all subtask are not interdependent. Each CN’s performs operation individually and produces sub result that can be consolidated later to complete the whole operation.

4.6 Parallel Algorithm Model An algorithm model is the representation of a parallel algorithm by choosing a strategy for separating the data and processing technique and employing the appropriate method to reduce interactions. They are available in various models: 4.6.1 Data Parallel Model In this model, the tasks are statically or semi-statically mapped onto processes and each task performs similar operations on different data item. This case of parallelism that is a result of identical operations being used concurrently on different data items is called data parallelism. The work may be performed in phases and the data operated upon in different phases may be dissimilar. Typically, data-parallel computation phases are interspersed with interactions to synchronize the tasks or to fetch fresh data to the chores. Since all tasks perform similar computations, the decomposition of the problem into tasks is usually grounded on data partitioning because a uniform partitioning of information followed by a static mapping are sufficient to guarantee load balance. This can be implemented using shared address space or message passing. 4.6.2 The Task Graph Model The computations in any parallel algorithm can be viewed as a task graph. The taskdependency graph may be either trivial, as in the case of matrix multiplication, or nontrivial. Nevertheless, in certain parallel algorithms, the task-dependency graph is explicitly applied in mapping. In the task graph model, the interrelationships among the tasks are employed to promote locality or to reduce interaction costs. This example is typically used to solve problems in which the sum of information associated with the tasks is large relative to the measure of computation associated with them. Usually, tasks are mapped statically to help optimize the price of data movement among tasks. Sometimes a decentralized dynamic mapping may be utilized, but even then, the mapping uses the information about the taskdependency graph structure and the interaction pattern of tasks to minimize interaction overhead. Workplace is more easily shared in paradigms with global addressable space, but mechanisms are available to share work in disjoint address space. 4.6.3

The Work Pool Model

The worker pool model is a dynamic mapping of tasks onto processes for load balancing in which any task may be executed by any procedure. The function may be centralized or decentralized. The tasks may be stored in a list or in a priority queue. The task may be statically available in the beginning, or could be dynamically generated; i.e., the processes may generate work and add it to the global (possibly distributed) work pool. 4.6.4 The Master-slave model In the master-slave or the manager-worker model, one or more master processes generate work and allocate it to worker processes. The projects may be allocated a priori if the manager can calculate the size of the tasks or if a random mapping can serve an decent job of load balancing. In some other scenario, workers are assigned smaller pieces of work at different times. The latter system is chosen if it is time consuming for the master to generate employment and therefore it is not suitable to pull in all workers wait until the master has brought forth all work pieces. In some instances, work may need to be done in phases, and workplace in each phase must finish before work in the next phases can be brought forth. In this example, the director may cause all workers to synchronize after each stage. Normally, in that respect is no desired premapping of work to processes, and any worker can perform any task assigned to it. The master-worker model can be extrapolated to the hierarchical or multifloor manager-worker model in which the top-level manager feeds large chunks of tasks to second-stage managers, who further subdivide the tasks among their own actors and may perform part of the work themselves. 4.6.5 The Pipeline or Producer-Consumer Model In the pipeline model, a stream of data is passed on through a succession of processes, each of which do some labor on it. This simultaneous execution of different programs on a data flow is called stream parallelism. With the exclusion of the process leading up the pipeline, the arrival of new data triggers the implementation of a new task by a process in the grapevine. The processes could form such pipelines in the shape of linear or multidimensional arrays, trees, or general graphs with or without wheels. A pipeline is a string of producers and consumers. Each operation in the pipeline can be reckoned as a consumer of a chronological succession of data items in the process preceding it in the pipeline and as a producer of data in the process following it in the line.

4.6.6 Hybrid Model A hybrid model may be composed either of multiple models applied hierarchically or multiple models applied sequentially to different phases of a parallel algorithm. In some other case, the major portion of computation may be identified by a dependency graph, but each node of the graph may represent a super task comprising multiple subtasks that may be suitable for data-parallel or pipelined parallelism.

4.7 Summary This chapter summarizes parallel and distributed system by describing its basics and classification. In this chapter we have detailed about differentiation and the relationship between parallel and distributed system. This chapter introduced challenges in making distributed architecture.

CHAPTER 5 METHODOLOGY This chapter details about distributed fractal image encoding architecture and parallel algorithm. Research have been done and developed two methods for enhancing speed up. Initially research have been done on distributed computers with varying domain and range blocks. In the next stage a load balancing algorithm using SIMD (Single Instruction Multiple Data) on distributed architecture is developed using fixed sized domain-range blocks.

5.1 Introduction As all images have certain geometry through which we can compress them and fractal image compression technique utilizes this geometry to compress an image which is evolving from fractal geometry and the objective is to compress these types of geometry using parallel computing. As the fractal image encoding requires tremendous computing time, and to overcome this difference parallel and distributed architectures are proposed. For parallel computing based architectures as the number of range blocks increases the computation complexity also increased, which cannot be reduced by adding more processors which is practically not possible. To overcome this problem, distributed computing architectures have been proposed. The advantage of this kind of architectures is that they can be reconFigured as the problem size grows and in our case number of comparisons of range to domain blocks grows exponentially. Fisher [2] discussed that if we assign a part of an image to the multiple processing elements (PE) that can be achieved in reduced encoding time. To parallelize fractal compression it is necessary to find out the sequential and parallel part and feeding parallel part to the parallel machine may result in compression with less encoding time. The parallel fractal image encoding has certain problems, hence it was needed to develop a distributed fractal image encoding architecture. Initially it was experimented with the variable sized domain blocks with corresponding fixed sized range blocks. At the second stage of experiment, worked with fixed sized range blocks with corresponding domain blocks. The variable domain and range blocks are preferred initially to check the what should be the standard size of range and domain block shoud be choosen to get higher speed up. Also in second stage of experiment the load balancing strategy have been adopted.

5.2 Distributed Fractal Image Encoding Architecture In this initial experiment a distributed parallel method is proposed to reduce computational time by partioning and distributing the input image among different computing node, as each computing node performs encoding and block matching individually; which results in significant reduction in processing time to generate fractal codes. The research objective is to discover the threshold of number of computing nodes at which we could perform the operation without reducing the speed up. The research proposed a parallel algorithm and distributed parallel architecture that has been applied to enhance the speed up. At the second

stage of the experiment we have worked on initial research and modified with new methodology called SIMD. In our initial stage of experiment we did not use any kind of load balancing algorithm and found any difficulties earlier. Then introduced a load balancing strategy for our new distributed computing architecture.

Original Image

Partitioned Image CN1

CN 2

CN 3

CN 4

Figure 5.1: Distributed Processing

Fractal image compression reduces affine redundancy with the use of suitable affine transformations that requires less number of bits to encode the same image. However the process of encoding requires enormous computational processing to generate required fractal codes. The aim of this experiment is to find out the minimum number of CN’s for which we could efficiently apply the parallel algorithm and load balancing strategy. Also it is required to find out for which size of domain blocks we can obtain good quality image in lesser time. Experiment is performed on Linux installed machines using java programming environment.

5.2.1 Distributed Parallel Method for Efficient Fractal Image Encoding (DPM) A distributed computing architecture to encode the image using the fractal image encoding method is proposed. As the number of range domain comparison is huge to solve, which takes enormous computation time. In this architecture, we have make use of old computing resources that can be configured with the help of any type of computing machines. We wanted to solve the problem with existing computing resources that are already available to us. The problem of encoding is solved by partitioning and distributing to multiple CN’s.

Original Image

Send Task to computing Slave

1

2

n

Range 1

Range 2

Range n

DP

DP

DP

Consolidated Code Book

Write Fractal Code to File Figure 5.2: Distributed Parallel Architecture

For an image of size N x N, if there are n task available after partitioning the original image, then to solve it using ∝n processors it would take ∝n. tn time, where tn is the time to perform operations. If there are Oi operations to be executed in parallel, then the computation time with ∝n processors Oi/∝n. A distributed computing architecture is proposed to reduce the encoding time. If for an image I of size N X N, we must find a partition scheme which uniformly subdivide the subsets of I i.e. range RI = r, (r = N/4) and domain block DI = RI/f where f is a domain offset r