Deduplication and Compression Techniques in Cloud ... - IEEE Xplore

14 downloads 6908 Views 848KB Size Report
Abstract—Our approach to deduplication and compression in cloud computing aims at reduction in storage space and band- width usage during file transfers.
Deduplication and Compression Techniques in Cloud Design Amrita Upadhyay, Pratibha R Balihalli, Shashibhushan Ivaturi and Shrisha Rao International Institute of Information Technology Bangalore Electronics City, Bangalore 560100 Email: {amrita.upadhyay, pratibha.br, shashibhushan.ivaturisasi}@iiitb.org and [email protected]

Abstract—Our approach to deduplication and compression in cloud computing aims at reduction in storage space and bandwidth usage during file transfers. The design depends on multiple metadata structures for deduplication. Only a copy of the duplicate files is retained while others are deleted. The existence of duplicate files is determined from the metadata. The files are clustered into bins depending on their size. They are then segmented, deduplicated, compressed and stored. Binning restricts the number of segments and their sizes so that it is optimum for each file size. When the user requests a file, compressed segments of the file are sent over the network along with the file-to-segment mapping. These are the uncompressed and combined to create a complete file, hence minimizing bandwidth requirements. Index Terms—cloud, deduplication, compression, segmentation, binning, Eucalyptus, algorithm, hash value, SHA-1, network, bandwidth.

I. I NTRODUCTION Cloud technology has emerged and become a very important aspect of many business fields. Computer resources are used and accessed through networks. Storage in a public cloud in terms of space and time is vital for backup and recovery, email systems, etc. This paper focuses on the infrastructure services dealing with storage and network usage. Deduplication and compression [1], which are described in detail later, are some of the important data optimization services that the cloud offers. We propose at a new architecture for deduplication on the cloud using functionalities like segmentation, compression and binning. This paper talks about local block-level deduplication which is verified using the Eucalyptus environment [2]. A global deduplication across various users can have security issues [3]. We have designed multiple metadata structures which enable faster lookups and enhance user experience. A cloud based on the proposed architecture has better storage efficiency and lesser bandwidth consumption. This architecture benefits both cloud service providers and users. Considerable amounts of space can be saved in the cloud by means of deduplication and compression [4]. Segmentation of a file reduces it to smaller chunks which are easier to transfer over the Internet. Sending unique compressed segments minimizes bandwidth consumption significantly. This of course results in reduction of cost [5] and increase in storage efficiency, while also improving the user experience.

978-1-4673-0750-5/12/$31.00 ©2012 IEEE

Our experiments indicate that in case of duplicate files, almost 80% of redundant data is removed. The transfer is not only less costly but also faster at the user end. The architecture and implementation of the system are explained in detail in Section II. In this paper, deduplication is done in two situations: for existing files and for incoming (new) files. Files are initially segregated into different bins depending on their size. When segmentation is done with constant size for all files irrespective of their size, it either leads to a large number of segments or very few segments. Binning helps in deciding the size of each segment to be formed, based on the size of the parent file. This decreases the time to process and save the segments, also response time over the network. Based on our test results, a reduction of 47.5% in the processing time is seen due to this process. For a user, bandwidth is a limiting factor when accessing a public storage cloud over the Internet. With the advent of mobile devices which connect to cloud, the end user uses cloud services that may require frequent uploads and downloads. A user is also charged by the cloud service provider based on the storage space and/or the bandwidth consumed. This paper provides a solution which aims at minimizing both, without loss of data. When a user requests for a file, it is reconstructed and given to him. This file at the user end is no different from the one uploaded, making the whole process transparent to the user. Experiments show that the bandwidth utilization as measured using iperf is reduced by 31%. In industry a deduplication process described as source blocklevel is implemented by NetApp Data ONTAP [6] and EMC Atmos [7]. In Data ONTAP, steps taken to reduce storage space is by grouping unstructured data which in turn helps reduce duplication. This results in saving of up to 55% in storage space for most data sets and up to 95% for complete backups. EMC Atmos offers a multi-petabyte solution for information storage and distribution. For cloud storage, Atmos offers scalability, automated data placement and information services all over the globe. He et al. [1] mention about various deduplication techniques. The basic principle is to maintain only one copy of the duplicate data and all the duplicate blocks pointing to this copy. This can be achieved at file level, block level or byte

level. Li et al. [8] discuss various cloud storage techniques. In data deduplication technology, they suggest retaining only the unique instances of the data, reducing data storage volumes.

has mapping as its main functionality. Also other stand-alone functionalities like uncompressing and combining the segments are shown.

A semantic-aware multi-tiered source de-duplication framework is proposed by Tan et al. [9]; this provides a tradeoff between deduplication overhead and efficiency. The local blocklevel deduplication combined with global file-level deduplication can shorten the time periods for sending datasets to a backup destination. They use file semantics to identify files which require zero deduplication overhead. Zhu et al. [10] discuss a method to solve the problem of disk bottlenecks while deduplicating. Machines with insufficient RAM pose a challenge of maintaining large amounts of metadata about the stored blocks. Hence, a major issue to be addressed is to identify and eliminate duplicate data segments on these low-cost systems. Sarawagi et al. [11] used deduplication to remove duplicate citations on a well known citation website CiteSeer. The deduplication method that is used is active learning. Using this algorithm, duplicate citations are detected.

Fig. 2.

System Model: File Upload

II. S YSTEM A RCHITECTURE AND D EVELOPMENT The system architecture for the proposed application is shown in Figure 1. The application implements local block-level deduplication. Information about the original files and the new segments is maintained in one of the metadata structures. Deduplicated segments and metadata structure are sent over the network at a client’s request. These segments are combined on the client side to form the original file using the metadata structures. This saves storage space at source and also bandwidth over the network.

Fig. 3.

System Model: File Download

The aforementioned modules are described in the following subsections. Methods are briefly explained in Table I. A. Segmentation The deduplication process is run on existing files and also on new incoming files, as mentioned above. In the segmentation module, files are segregated depending on their initial sizes into three bins. The bin structure is as follows: Fig. 1.

System Architecture

System models for the same are shown in Figures 2 and 3. In the first system model, Segmentation and Deduplication are the main modules. These modules have functionalities such as mapping, binning, comparison and compression. In the second system model, File Retrieval is the main module. This module

∙ ∙ ∙

Bin1 – contains files of size < 10 MB. Bin2 – contains files of size 10 MB to 1 GB. Bin3 – contains files of size > 1 GB.

Each file from each of the bins is then divided into segments which are of fixed size depending on the bin it is allocated to. This helps in optimizing the number of segments created and their sizes.

Terms

Description

segmentFile() divideFile()

Bins and segments the file Divides file into segments depending on the bin Creates a mapping of filename and its corresponding segment numbers Finds unique segments and updates metadata files Lists all the existing finger prints by reading from the fingerprints file of the particular bin Saves unique segments after comparison with existing hash-codes Updates the file metadata Updates the bin information with the count of segments added to it Requests the file passing filename as parameter Collects all the segments based on file-to-segment metadata structure Client receives all the segments from the cloud Segments are decompressed and combined together to create the file by invoking

fileToSegMapping()

deduplicate() hashFingerPrints()

save()

segToHashMapping() updateConfigFile()

requestForFile() lookupMetaData()

receive() reconstructFile()

TABLE I G LOSSARY OF T ERMS

There are three metadata structures, maintained by the storage controller: 1) Bin configuration file. 2) File to segment mapping. 3) Fingerprint mapping. The bin configuration file defines the bin limits and corresponding segment sizes. The structure of the configuration file is shown in the box in Table II. ::

TABLE II S TRUCTURE OF C ONFIGURATION F ILE

The segmentation of a file is as described in Algorithm 1. Inputs for this function are file(s) specified by the user and the output is an array of segments for each of the file(s). In lines 4,

Algorithm 1: File Segmentation Data: file, segments[ ] Result: segment 1 begin SegmentFile 2 /* Every file is put into one of the 3 bins depending on its size */ 3 /* Once the segmentation is completed, file to new segment mapping is done */ 4 if (sizeof(file) < 10 MB) then 5 bin1 ← file 6 /* divide file into segments */ 7 segments[ ] segments = divideFile() 8 fileToSegMapping(file, segments) 9 end 10 if (sizeof(file) > 10 MB and sizeof(file) < 1 GB) then 11 bin2 ← file 12 /* divide file into segments */ 13 segments[ ] segments = divideFile() 14 fileToSegMapping(file, segments) 15 end 16 if (sizeof(file) > 1 GB) then 17 bin3 ← file 18 /* divide file into segments */ 19 segments[ ] segments = divideFile() 20 fileToSegMapping(file, segments) 21 end 22 end

10 and 16, the file size is checked. Depending on its size, the file is put into one of the three bins. Lines 7, 13 and 19 divide the file based on the segment size fixed for the particular bin. The convention for naming the file is segment is, ⟨ file name ⟩⟨ unique identifier ⟩. This is shown in lines 8, 14 and 20. If any of the segments match the existing segments, then they point to the already-saved segment number. B. Deduplication and Compression After segmentation, the hash value for each of the file segments is calculated using the SHA-1 algorithm. The first time that deduplication is performed, hash values are calculated for every segment that is created. These hash values are recorded in a metadata structure. For every newly uploaded file, hash values are calculated for its segments and then compared with the list of existing hash values. If there is a match, the corresponding segment file is not saved, which means this is a duplicate segment file. File-to-segment metadata is updated with the bin information of the new file and also the segment numbers created for the file. For the duplicate segments, the segment number remains the same. The configuration file of the bin structure is updated with the number of newly created unique segments. This file basically keeps a count of the number of segments under each

bin. In this module, gzip is used to compress and decompress files. This technique, further, saves storage space. Algorithm 2: Deduplication and Compression Data: segment Result: fingerPrint 1 begin Deduplicate 2 fingerPrint[ ] fingerPrints = hashFingerPrints() 3 foreach (𝑠𝑒𝑔𝑚𝑒𝑛𝑡 in segment[]) do 4 fingerPrint ← calculateHash(segment) 5 if (fingerPrint == fingerPrints[ ]) then 6 /* do not save this segment */ 7 end 8 else 9 save(segment) 10 end 11 segToHashMapping(segment, fingerPrint) updateConfigFile(fileSegmentCount,fileBin) 12 end 13 end The procedure for calculation of the hash values is described in Algorithm 2. Input is the array of file segments. For each of these segments, hash value is calculated as shown in line 4. Lines 5–7 check if the calculated hash value is already present by a list lookup. This list lookup is generated from the method hashFingerPrints() as shown in line 2. If the hash value is present, that particular segment is not saved. Else, the segment is saved as shown in lines 8–10. Lines 11 and 12 update the mappings with the new segment’s information. Finally, the segment is compressed and saved. C. File Retrieval This module takes care of transmitting files over the network on a user’s request. With the help of file-to-segment mapping, compressed segments of the requested file are gathered and sent over the network. Mapping details are also sent. By sending the compressed segments, bandwidth consumption is reduced. On the user’s side, these segments are put together to form the original file with the help of file-to-segment mapping. The procedure for retrieving the file is described in Algorithm 3. The file name is given as an input by the user. In line 2, a user requests the file(s) by providing the filename(s). A lookup is performed in line 3 from the mapping to find the segment numbers of the files requested. All segments are collected in line 4. In line 5, these segments are sent over the network. On the client’s side, these segments are received as shown in line 7. From the segments, the file(s) can be constructed as shown in line 8. This restores the file(s) that the user initially requested for.

Algorithm 3: File Retrieval Data: fileName Result: segments[ ] 1 begin DownloadFile 2 requestForFile(fileName) 3 lookupMetaData(fileToSegMapping) 4 segments[] segments ← collectSegments() 5 send(segments, fileToSegMapping) 6 /* on the client’s side */ 7 receive(segments, fileToSegMapping) 8 file ← reconstructFile(segments) 9 end

III. R ESULTS A. Storage A set of sample files along with their copies are saved on the deduplicated filesystem. This set comprises of different types of files to ensure that deduplication is invariable of binary or text files. In this test case, the original storage space, the storage space after deduplication, segmentation and compression are calculated and the results shown in Table III. TABLE III S PACE S AVED FOR D IFFERENT S IZES OF DATA Test run 1 2 3 4

Number files 13 11 8 16

of

File Size Before Deduplication 261 MB 85.8 MB 37.4 MB 74.8MB

Space Used After Deduplication 201.3 MB 23 MB 37.2 MB 37.2MB

The results shown in Table III correspond to the graph in Figure 4. In the first reading, 13 files each of size 261 MB are considered. Here, one file is the original and the rest are its copies. Hence, the storage space used is 201 MB since just one file is saved and the other 12 are discarded. Similarly, in the second reading, 11 files each of size 85.8 MB are considered, which use only 23 MB after deduplication. Again, there is one original and the other files are its copies. In the third reading, 8 files of size 37.4 MB are considered which are all unique and space used is 37.2 MB. In the last run, 8 originals and 8 duplicates are stored, total of size 74.8 MB. After deduplication, space used is 37.2 MB. In Figure 4, test runs are plotted on the 𝑥-axis and file size on the 𝑦-axis. Results inferred from the test runs for storage space used are: 1) 51% of storage space is saved after the deduplication process for unique text files. 2) 10% of storage space is saved after the deduplication process for unique image files. 3) 12% of storage space is saved after the deduplication process for unique media files.

TABLE V BANDWIDTH C ONSUMPTION File Size (in MB) 2200 4500 9000 60000 125000

Fig. 4.

Bandwidth used before deduplication (in MB) 3.775 3.925 4.025 2.2 2.6

Bandwidth used before deduplication (in MB) 2.9 2.55 2.7 1.475 1.53

Storage Space Saved

4) More than 80% of storage space is saved when duplicate binary files are present. B. Bandwidth

Fig. 5.

When a node requests for files from the cloud, bandwidth consumption is reduced by sending only unique segments. The metadata files sent along with the segments are small files of few KBs. This performance testing process involves requesting for duplicate files in one run and only unique files in the second run. The same process is repeated for different file types and sizes. The amount of data received on the node is recorded along with the total time taken. Response times for different sets of data are shown in Table IV. TABLE IV T IME TAKEN TO R ETRIEVE THE F ILE BY THE N ODE Type of the file text pdf mp3 text, pdf and mp3 (cumulative)

Number of files 2 2 7 4, 4, 3 respectively

Size of each file 72KB 30 MB 33.9 MB 230.5 MB

Bandwidth Consumption

C. Binning The time taken for fixed-size segmentation increases drastically with respect to the uploaded file size. Binning makes the deduplication process scalable and works efficiently with different file sizes. In binning, the segment size of the uploaded file is proportional to the range of file size specified for the respective bin. Thus, the time taken to save the files with binning is lesser compared to that without binning, as shown in Figure 6.

Time taken to download 1.17s 274.053s 248.045s 644.810s

Time taken to download from the cloud increases proportionally with the file size. Bandwidth consumed with and without deduplication are shown in Table V. The same has been plotted in Figure 5, where the 𝑥-axis is file size (in MB) and the 𝑦-axis is the bandwidth consumption. From the graph it can be seen that efficiency of bandwidth utilization is increased with the proposed approach.

Fig. 6.

Processing Time Reduction Using Binning

In Figure 6, file size is plotted on the 𝑥-axis and time taken

to save the file on the 𝑦-axis. The upper curve shows that the time taken to save a file without binning increases rapidly with file size, and the lower curve indicates that time taken does not increase drastically with an increase in file size. On the other hand, when the file is downloaded, recollection of the segments takes less time for a file with a smaller number of segments. As the bin structure maintains a smaller number of segments for larger files, file reconstruction becomes faster. IV. C ONCLUSION In this paper, a solution for deduplication on the cloud is proposed. This solution aids in saving storage space as well as in minimizing bandwidth requirements. This helps in easy maintenance of data on the cloud platform. The user can retrieve any file or data dynamically without any data loss. Binning and segmentation modules enhance user experience. The time taken for the user to interact with the cloud reduces considerably as bandwidth is a vital resource for the user. Testing was carried out thoroughly and the results suggest a considerable saving in the storage space and bandwidth requirements. Deduplication is implemented on the storage space of the cloud controller. In this solution, deduplication is possible at user’s bucket level. Future enhancement would be to achieve deduplication on the complete cloud storage. ACKNOWLEDGEMENTS We would like to thank Sarika Hablani and Chanchal Dhaker who helped with the initial implementation of our work. R EFERENCES [1] Q. He, Z. Li, and X. Zhang, “Data deduplication techniques,” in International Conference on Future Information Technology and Management Engineering (FITME), Changzhou, China, October 2010, pp. 430–433. [2] R. Mikkilineni and V. Sarathy, “Cloud computing and the lessons from the past,” in 18th IEEE International Workshops on Enabling Technologies: Infrastructures for Collaborative Enterprises, Los Altos, CA, 2009, pp. 57–62. [3] D. Harnik, B. Pinkas, and A. Shulman-Peleg, “Side channels in cloud services: Deduplication in cloud storage,” IEEE Security and Privacy, vol. 8, pp. 40–47, 2010. [4] L. Aronovich, R. Asher, E. Bachmat, H. Bitner, M. Hirsch, and S. T. Klein, “The design of a similarity based deduplication system,” in Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference. New York, NY, USA: ACM, 2009, pp. 6:1–6:14. [5] L. DuBois, “Data deduplication for backup: Accelerating efficiency and driving down it costs,” White paper, EMC Corporation, May 2009. [6] “NetAppTech OnTap,” Jun. 2009. [Online]. Available: http://www.netapp.com/us/communities/tech-ontap/ tot-dedupe-unstructure-0409.html [7] “Atmos Multi-tenant, distributed cloud storage for unstructured content,” 2009. [Online]. Available: http://www.emc.com/products/ detail/software/atmos.htm [8] Z. Li, X. Zhang, and Q. He, “Analysis of the key technology on cloud storage,” in International Conference on Future Information Technology and Management Engineering (FITME), Changzhou, China, October 2010, pp. 426–429.

[9] Y. Tan, H. Jiang, D. Feng, L. Tian, Z. Yan, and G. Zhou, “Sam: A semantic-aware multi-tiered source de-duplication framework for cloud backup,” in 39th IEEE International Conference on Parallel Processing, San Diego, CA, 2010, pp. 614–623. [10] B. Zhu, K. Li, and H. Patterson, “Avoiding the disk bottolneck in the data domain deduplication file system,” in 6th USENIX Conference on File and Storage Technologies (FAST ’08), 2008, pp. 269–271. [11] S. Sarawagi and A. Bhamidipaty, “Interactive deduplication using active learning,” in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’02. New York, NY, USA: ACM, 2002, pp. 269–278.