Revisiting Reliability Strategies for Peer-to-Peer Storage System

0 downloads 0 Views 533KB Size Report
peer nodes (connected by a wide-area network such as ... main redundancy mechanisms: replication and erasure codes, which are used by ... scale self-managing distributed systems. ... A number of existing distributed storage systems (e.g.,.
Revisiting Reliability Strategies for Peer-to-Peer Storage System Bing Tang, Gilles Fedak UMR CNRS - ENS Lyon - INRIA - UCB Lyon 5668, LIP Laboratory, ENS Lyon 46 all´ee d’Italie, 69364 Lyon Cedex 07, France

Abstract—This paper surveys previous Peer-to-Peer storage systems and related data redundancy and fault-tolerance schemes which are introduced to overcome the impact of host churn on data reliability. Furthermore, a hybrid storage system model is proposed which integrates aggregated idle storage from volatile voluntary nodes and a stable and durable storage utility. In order to ensure high availability and durability for this hybrid storage system, we explore four reliability improvement strategies, and they are File Replica Strategy, File Encoding Strategy, Replica Repair Strategy, and Stable-Volatile Strategy, as well as the combination of these four strategies. Extensive simulations based on real traces are performed, in which data availability, data durability, system utilization ratio are evaluated. Simulation results show that compared with previous Peer-to-Peer storage system, the proposed hybrid storage system could achieve a higher availability and durability with less storage consuming, due to proposed new strategies and the cost-effective usage policy for the stable storage utility. Our evaluations also demonstrate the prospect for a cloud picture of hybrid Peer-to-Peer storage system applied for cloud data storage. Keywords-Peer-to-Peer Storage System; Data Redundancy; Data Availability; Data Durability; Cloud Storage.

I. I NTRODUCTION Peer-to-Peer storage systems have been used to realize traditional applications like file sharing, file systems, Content Distribution Networks(CDN), backup, and archival storage. The storage capabilities of desktop PCs are often underused with vast amounts of idle space. This creates an opportunity to ’scavenge’ storage; that is, to aggregate the storage space of these network-connected machines to build a low-cost data storage service. This research explores how to build a reliable storage system that offers an Amazon S3-like storage service at a much lower price through aggregating idle storage from large-scale volatile voluntary nodes. Our work is motivated by several efforts, such as OceanStore [1], PAST [2] or CFS [3], that use replication to build distributed wide-area storage system in a peerto-peer environment. Storage is provided by autonomous peer nodes (connected by a wide-area network such as the Internet) whose participation in the system is generally entirely voluntary. The lack of control over peer participation implies that the underlying storage substrate in such a system may be highly unreliable. Although the peer-to-peer storage system has attracted our attention since recent decade, it is still confronted with the difficulty of ensuring reliability.

Reliability has many aspects in this context; one well studied metric is availability, defined as the fraction of the time the system is able to provide access to the replicated entity. The entity is available when any one of its replicas is functioning; conversely, it is unavailable when all of its replicas are not. Another metric of reliability, which is the focus of this paper, is durability, defined as the duration of time the system is able to provide access to this entity. Thus, while availability deals with temporary inaccessibility when all replicas are non-operational, durability deals with more permanent loss, when the system no longer has even a single replica. Our work is also motivated by cloud computing and Amazon Simple Storage Service (S3) . Amazon S3’s standard storage is designed to provide 99.999999999% durability and 99.99% availability of objects. It also provides an alternative Reduced Redundancy Storage (RRS) that enables customers to reduce their costs with relaxed 99.99% durability, thus it is even more cost effective. Similar to Amazon S3, providing a similar system on Desktop Grid would give user an alternative the ability to store their data on existing Desktop Grid infrastructures to decrease the cost of on-line data store [4]. The largest risk for durability in volunteer computing systems is the limited lifetime of volunteers that stop participation after a few months, days or hours, while the greatest risk for availability is the short-term churn and unavailability of hosts. Hence, techniques such as replication and erasure code are often applied to ensure availability at any time. Replication can be used by the system to provide high availability in a statistical sense. Sometimes, to significantly extend the durability of an object for periods exceeding individual node lifetimes, the system must also implement a repair or recreate/reactive mechanism that compensates for lost replicas by creating new ones. In this paper, we propose a hybrid storage architecture. To obtain strong durability while keeping costs under control, we propose an architecture that integrates two types of lower-cost storage components. First, stable node: a stable node with high availability and durability. For example, this component can be a stable workstation or a storage utility such as Amazon S3. Second, volatile nodes: a large number of volatile nodes that, on aggregate, provide a low-cost storage space. These nodes, for example, can be a subset of the desktop available in a company or a research institution.

This hybrid storage system brings two challenges. The first challenge is“how to design efficient data redundancy strategies for host churn”. The second challenge is “how to ensure user-defined reliability guarantees”. To tackle these challenges, this paper addresses the following questions: The factors that may affect availability/durability include the number of objects maintained by the system, the redundancy scheme, and the characteristics of the volatile nodes. The goal of this paper is to understand the relationship between these factors and availability in order to offer strategy to provide user-defined availability guarantees. In this paper, we present a quantitative study of data survival in peer-to-peer storage system. We first recall two main redundancy mechanisms: replication and erasure codes, which are used by most previous peer-to-peer storage systems, to guarantee data durability. To answer these questions, and to study the impact of these factors, we develop a lowlevel simulator for the proposed architecture. The simulator uses host availability traces from a real peer-to-peer system SETI@home, implements the details of the data redundancy scheme. Different with previous, the two metrics availability and durability are separated, decouple them, and studied individually, give the measurement method. The innovative work in this paper exists that we evaluated the Synchronization Interval Time and Failure Timeout Period, and other papers have not considered this factor. Further more, we consider the limited storage space, and integrating one stable node into the storage system. The rest of the paper is organized as follows. Section 2 surveys previous peer-to-peer storage system, and related data redundancy schemes. Section 3 proposes a new hybrid storage system composed of aggregated idle storage from volatile voluntary nodes and a stable storage utility. Section 4 briefly describes a trace-driven simulation model to study data availability and durability using different reliability improvement strategies. Data availability and durability simulation results and analysis are presented in Section 5, and the final section offers concluding remarks. II. BACKGROUND

AND

R ELATED W ORK

A. Survey of Previous P2P Storage Systems Peer-to-Peer (P2P) storage systems harness idle disk space from thousands of wide-area, distributed workstations and use replication to both reduce the probability of losing data and increase availability. In recent yeas, the peer-topeer model has emerged as a paradigm for building largescale self-managing distributed systems. This has led to several efforts to build highly available distributed storage systems based on this paradigm, such as OceanStore [1], [5], PAST [2], CFS [3], TotalRecall [6], Ivy [7], Farsite [8], and Freenet [9], [10] also discussed the feasibility of

a serverless distributed file system deployed on an existing set of Desktop PCs. Similarly, cluster-based storage systems (e.g., PVFS [11], GFS [12], HDFS [13]) hosted on dedicated, well-connected components use the same technique, yet operate in a different deployment environment. Some storage system works on relatively stable than peer-to-peer environment, closer to our system are scavenged storage system, such as FreeLoader [14], which aggregate idle storage space from LAN-connected workstations. A number of existing distributed storage systems (e.g., cluster-based and peer-to-peer storage systems) attempt to offer cost-effective, reliable data stores on top of unreliable, commodity or even donated storage components. To tolerate failures of individual nodes, these systems use data redundancy through replication or erasure coding. Among these systems, some use replication scheme, such as OceanStore, while others use erasure coding to ensure high availability, such as TotalRecall. So as to the stable node, such as AmazingStore [15], is a peer-to-peer storage system which also provide a stable node as one of the data source to ensure high availability. Another work that similar to ours is ThriftStore [16], [17] which also proposed the idea of dedicated stable component plus aggregated storage from volatile nodes, and it also explored data reliability tradeoff, while how to use the stable component is not clear. Newly similar system implementations include PeerStrip [18], and Storage@desk [19], and Sector [20], which also utilize contributory storage in desktop grids. Table ?? presents a comparison of these P2P storage system. THRIFTSTORE: This project explores the feasibility of a cost-efficient storage architecture that offers the reliability and access performance characteristics of a high-end system. This architecture exploits two opportunities: First, scavenging idle storage from LAN-connected desktops not only offers a low-cost storage space, but also high I/O throughput by aggregating the I/O channels of the participating nodes. Second, the two components of data reliability Cdurability and availabilityC can be decoupled to control overall system cost. To capitalize on these opportunities we integrate two types of components: volatile, scavenged storage and dedicated, yet low-bandwidth durable storage. On one side, the durable storage forms a low-cost back-end that enables the system to restore the data the volatile nodes may lose. On the other side, the volatile nodes provide a high-throughput front-end. While integrating these components has the potential to offer a unique combination of high-throughput, lowcost, and durability, a number of concerns need to be addressed to architect and correctly provision the system. To this end, we develop analytical- and simulation?based tools to evaluate the impact of system characteristics (e.g., bandwidth limitations on the durable and the volatile nodes, space constraints, replica placement scheme) on data avail-

Table I C OMPARISON WITH PREVIOUS PEER - TO - PEER STORAGE SYSTEMS .

Data File Partition Replica

Erasure Codes

Replica Repair

Stable Storage Component

Evaluation Method Real platform 42 nodes in Cluster PlanetLab, 43 sites Simulation 2250 nodes Real 12 nodes over Internet

System Utilization Evaluation

***

***

***

***

***

***

***

***

***

***

***

***

OceanStore

No

Yes

Yes (16, 32)

No

No

PAST

No

r=5

No

Yes

No

r

No

No

No

CFS Freenet Farsite TotalRecall

Yes 8KB a a a a No

Yes

Yes Lazy repair

a a Yes 1MB Yes 4MB

Yes

No

No

Yes

r

Yes (2, 3) XOR code

Yes

No

Sector

No

Yes

No

Yes

No

AmazingStore

No

Yes

No

Yes

Yes

ThriftStore

No

Yes

Yes

Yes

Yes

proposed model

Yes

Yes

Yes

Yes

Yes

Ivy FreeLoader PeerStrip

ability and the associated costs in terms of maintenance traffic. Further, we implement and evaluate a prototype of the proposed architecture: namely a GridFTP server that aggregates volatile resources. Our evaluation demonstrates an impressive, up to 800MBps transfer throughput for the new GridFTP service. B. Review Reliability Strategies Redundancy is essential to achieve resilience. Storage system realize redundancy by using either replication or error correcting codes (e.g., erasure codes) as discussed next. There are three reliability strategies, replication, erasure codes, and replica repair or reactive. 1) Replication. distributing replicas on multiple servers. i.e, mirroring, stores the same object on multiple nodes. It is normally efficient when the size of the object is samll, or the object is frequently accessed. Note that a large object may still be split in multiple fragments, and each of these fragments will be replicated. 2) Erasure codes. Under this scheme, each object is divided into m blocks which are then encoded into n fragments to store, with an effective redundancy factor n/m. The object can be reconstructed from any available m fragments taken from the stored n fragments.

Real platform 15 nodes in LAN Simulation, 10K nodes PlanetLab, 40 sites Real testbed Trace Statistical Analysis Theoretical analysis simulation Simulation 1000 nodes

No Yes No

No Yes No No No Yes

Object availability is given by the probability that at least m out of n fragments is available. To overcome host churn and maintain data reliability and availability, unreachable fragments are continuously recovered. Erasure codes are a class of error-correcting codes, which can transform a m-fragment object into n(n¿m) encoded fragments, such that the original object can be reconstructed from any m out of the n encoded fragments. This typically leads to a storage overhead slightly more than n/m. The rate of erasure codes r is defined as the fraction of fragments required for decoding... 3) Replica repair. There are Lazy repair, and threshold deterministic, and there are also replica repair and erasure code repair. Although replication is sometimes regarded as a special case of erasure codes, [21] studied replication strategies for highly available Peer-to-Peer Storage. Some paper also researched comparison of replication and erasure coding. For instance, [22] studied a quantitative comparison between erasure code and replication. [23] studied and compared for ensuring high availability in DHTs. OceanStore and TotalRecall used erasure coding strategy to improve reliability. Replica repair method, TotalRecall, Proactive replication

for data durability [24], C. Availability and Durability Study Existed paper may focus on availability, or focus on durability, these two metrics are coupled, and sometimes confused them. Many paper just through predicting host availability by equation to calculate the replica which should be stored in the system, or design some components to protect or reactive the lost replica, and even use some predictor to predict the host failure and them to adjust replica level or reactive. While this paper is try to realize a fixed high availability and durability just as Amazon S3 did, targeting at 99.99% availability and 99.99999999% durability. So, it must consider the worst satiation that host with a high probability of permanent failure with cached data losing. Chun et al. efficient replica maintenance for distributed system [25] use Markov chains. [26] and [16] proposed separating availability and durability. This paper decouples these two conceptions, divide them, and give a brief definition, and decouple them, give some metrics, and discovery the relationship of the data redundancy strategies. Adding the stable node to improve and to achieve six-nine durability, but the stable is costeffective, and the storage space is used efficiently. 1) Availability Theoretical Analysis: In order to ensure the required availability, some papers propose the widely used equation to calculate the replica level. Suppose the target availability is A, and the availability if each individual host is p. When using a replication scheme, the availability of a data object with r replicas, which is the probability that at least one of the r replicas of object is available. Thus, to satisfy target availability, the minimum number of replicas is denoted as minr. For replication strategy, we have A = 1 − (1 − p)r

(1)

minr = lg(1 − A)/lg(1 − p)

(2)

and

For erasure codes strategy, we have A=

mr " # ! mr

i=m

i

pi (1 − p)mr−i

(3)

Here is a brief explanation. There are two factor in the availability equation one is availability effect, another one is combinatorial effect [27]. 2) Markov Model: A large number of storage systems use replication or erasure codes to improve reliability without differentiating between data durability and availability. Many papers just confuse these two concepts, availability and durability. Some effort has made to study the reliability

and durability, which are based on Markov chain model and probability model [28], but our paper focuses on simulation. 3) Simulation-based Research: High available, scalable storage, dynamic peer networks, pick two [29]. understanding availability [30] [31] analyzed the durability in replicated distributed Storage System by simulation method, which only studied replicas and repair. III. P ROPOSED S TORAGE S YSTEM M ODEL A. System Assumptions Firstly, we demonstrate several important assumptions for the storage system model: –Logically centralized metadata service. Clients locate objects by contacting a metadata service. This service maintains all metadata related to objects and makes replica placement decision. Note that clients never read or write data through the metadata service, hence keeping it outside the critical I/O path. –Failures are detected using timeouts. Volatile nodes declare their availability to the system using heart-beat message sent to the metadata service. We define two timeouts time: Synchronization Interval Time (SIT) and Failure Timeout Period (FTP). –Data fragment. In order to realize large-scale storage, the big data is separated to chunks, and the chunk size can be defined by the scheduler. Data Files are partitioned into fixed-size fragments. –Security model. As explained in the previous section, we assume a trusted deployment environment in which all system components and the communication channels among them are trusted. This paper doesn’t consider any security issues. We assume that the security solutions of existing desktop grid systems can be applied to the system. Consequently, we do not directly address the security issue in this paper. We haven’t consider any security issues in this paper. –Network distance and communication time. In this paper, we don’t consider any cost for replica FT, replica repair and reactive when lost, as well as the network distance, and time spent for data upload and download. We haven’t considered the network distance, and the network bandwidth, and data transfer time, only study ensuring data availability and durability, not give any performance. –Node availability. For the general applicability, we conservatively assume that the node unavailability cannot be known a priori. –Host storage limit. each host gives a maximal idle storage contribution. If the utilized storage reaches the limit, then this host will forbid accepting new data. B. Architecture However online storage on Desktop Grids implies some requirements in term of data durability, data availability, and

also access performance. Data durability is the need for preserving the data without errors. Data availability refers to the accessibility of the data. Access performance refers to how quickly the data can be located or accessed. This section presents the high-level design of the system. The storage system consists of one stable node and large scale volatile nodes. Here, the stable node could be the workstation, storage server, and it can also be hired from commercial cloud data center. Although the hardware is very cheap now, that is to say, the use of the storage space is cost-efficient, and whether to store one replica on the stable node as backup is according to determination strategy and scheduler algorithm. How to use the stable node efficiently by not copy all of the data on it? But, it is not discussed in this paper, which is left for research in the future. In this paper, we regard put one copy of all data in the stable node. Data writes can be handled in two ways that trade between throughput and durability. 1) Data writes are executed first on the stable node then creates replicas on volatile nodes and makes the data available for reading. 2) Alternatively, data write are first performed on the volatile nodes, then data is transferred in the background on the stable node. When the user retrieves and download request. First, it resorts to volatile nodes, if it is unavailable, then resorts to stable node. We model the system as having one stable node and a large number of volatile nodes. In reality, the stable node could be an external archival service or a complex set of devices, or hired storage service; however, for the purpose of our analysis, we model it as a single stable component. In the system considered above, integrate with the stable node, if we take one of the replicas and place it on a stable node. How much traffic generated at the stable node? How to schedule replica to stable node? Primary storage and primary data, secondary storage Ideas from Hybrid Cloud Storage Appliances Peer-to-Peer Storage: Projects and Products Cloud Storage: Products Strategies: How to efficiently use the stable storage space Cache to improve read/write performance? Address these concerns through the use of a local cache and algorithms that govern what data remains local and what data is transferred to the cloud, how data storage administrators are weighting decisions of hybrid cloud storage Cloud gateway, MetaCDN, C. Node Failure Detection Firstly, We describe node status immigration. We define three host status: OFFLINE, ONLINE, UNCONNECTED. We also define two threshold: Failure Timeout Period (FTP) and Synchronization Interval Time (SIT), As you can see from Figure 1, the green dots stand for synchronization time, while the red dots stand for expected synchronization time because the host status becomes UNCONNECTD. There is a special UNCONNECTED status. When the server cannot receive the synchronization signal from one host, then this host will falls to UNCONNECTED.

After that, if this host send synchronization signal and recovers in a short-time, then it will becomes to ONLINE again, otherwise it still keeps in UNCONNECTED. If the alivetime is not updated, and exceed the failure timeout period, this host becomes OFFLINE. In each synchronization, alivetime will be updated to the current time. Node failure detection is based on timeout approach. Host synchronization, failure determination, and data scheduling + host churn and host determination also lifetime of individual volunteers In a simulation environment, host status comes from trace data, while the host failure is temporary or permanent is not known in a real system. In this simulator, we define three kinds of host status: ONLINE, UNCONNECTED, and OFFLINE, The status migration chart is shown by Fig. 1. In dynamic peer-to-peer environment, A node may“leave”, i.e., no longer be part of, the system at any time, at which point the system cannot retrieve any objects stored on that particular node. Nodes that leave the system are replaced by new ones. When a node is offline, it is temporarily unable to participate in the system; thus, any objects stored on that particular node are inaccessible until the node becomes online again. In BitDew environment, it implements a data scheduler algorithm through two important steps: remove old data and get new data. Volatile nodes declare their availability to the system using periodical synchronization with BitDew service. Two timeouts time: Synchronization Interval Time (SIT) and Failure Timeout Period (FTP) is used for node failure determination. Choosing a timeout period depends on how aggressively the system wants to repair. That is the frequency problem, which is also studied in this research. We now focus our attention on how the system can implement durable storage. A replica may be either online, offline or dead, depending on the node that stores it. Given a degree of replication r, the system starts by creating r replicas of the object. Subsequently, the system attempts to maintain r replicas of the object at all times through a process of repair. To guard against this, the system must active a repair, i.e., create a new replica of the object whenever it detects that an existing replica is dead. We will call replica that is not online, i.e., is in either the offline or dead state, as not-online. A natural mechanism to decide when to trigger a repair is to wait some period of time for the not-online replica to return to the online state, i.e., to use a timeout. Second, when a timeout occurs, the system attempts to create a new replica; however, it may not be able to do so immediately. D. Definition of Metrics Therefore, data reliability can be split into two interrelated components: durability (i.e., ability to preserve data over time) and availability (i.e., ability to instantly serve the data). For many applications, durability is the critical











  Figure 1.



 

 

 

 

Host synchronization and timeout-based node failure detection method. Remember: add the data read and data write event to this figure

property: they may tolerate short-term service interruptions (that is, lower availability) as long as data is not permanently lost (e.g., due to disk failure). Decoupling durability and availability offers the opportunity to engineer systems that provide strong durability guarantees (e.g., ensuring ten 9’s durability) while relaxing availability guarantees to reduce the overall cost. In this section, we give the definition of availability and durability used throughout the paper. As mentioned in the previous section, there are two metrics for reliability, availability and durability. Data availability refers to the accessibility of the data. Data durability is the need for preserving the data without errors for a long-term. From a discrete point of view. Under these assumptions, although objects are never lost due to the contribution of the durable component, an object may not be instantly accessible if it is not available on at least one volatile node. Therefore, in this context, we define availability as the ratio of the total time an object replica exists on at least one volatile node. From a continuous point of view. We define durability as the ratio of the object lifetime to the total application time. E. Reliability Improvement Strategies Oriented from previous work, replication, erasure coding and replica repair have been proved by many work about distributed P2P storage system. Here we propose the following strategies for the hybrid storage system. 1) File Replica Strategy (FR) 2) File Encoding Strategy (FE) 3) Replica Repair Strategy (RR) 4) Stable-Volatile Strategy (SV) 5) Combined strategy, including: • FR-RR, FE-RR • FE-FR, FE-RR, FE-FR-RR • SV-FR, SV-FE • SV-FR-RR, SV-FE-RR, SV-FE-FR

give explanation of it, and give the explanation of (m,n) or r, and SV-FR-RR is no sense, exaplan all possible combinary Advantages and disadvantages - Consider a system built upon volatile, unreliable nodes only, that employs data redundancy (data replication, erasure coding, or FT flag) to provide both durability and availability. - What data redundancy strategy enables maximum availability? Cache and redundancy There is one additional factor that we now consider, e.g., the role of memory. Repair without memory: It assumes that the replica is lost (in effect, treating its node as though it were joining the system for the first time). With memory: The system remembers the time-out replica, and may decide to readmit it to the set of existing replicas. IV. S IMULATION -BASED E VALUATION A. Simulation Model 1) Simulator: Moreover, this paper discusses a tracedriven discrete event simulator that we have developed to help us study the contribution and advantages/disadvantages of each strategy. The simulator reads configuration file and trace data, and estimates multiple metrics, including reliability. The simulator has been used to help us determine redundancy strategies. The simulator is implemented with about 6000 lines of Java code. The simulator can also be adopted to find a better and develop/test a scheduler algorithm. The simulator is a trace-driven discrete event data desktop grid simulator, which can simulate data distribute algorithm, and simulate different scheduler strategy. It allows running on only one node to simulate large scale nodes joining/leaving activity. By loading the trace data from a real voluntary computing or desktop grid system, the simulator will simulate an environment which is close to the real environment. The trace-driven discrete event simulator is designed to facilitate study the feasibility of storage system. The trace

data comes from the Failure Trace Archive (FTA)[X], which record the node joining/leaving time information. In order to execute one simulation, the following parameters should be configured firstly: 1) trace time, how long trace time it will load; 2) host number; 3) data number (storage load), how many data are put and maintained by the system; 4) data size; 5) replicas; 6) erasure codes scheme; 7) replica fault-tolerance flag; 8) SIT; 9) FTP; 10) data read/write model. Component: simulator components and Event Queue Principle The main components of the simulator include: 1) Trace Data and Trace Loader; 2) Event Queue; 3) Data Write/Read Model Generator; 4) Host Queue. Figure 2 presents the Event Queue and all the important components. Event Queue manages all the events during the simulation, there are 4 event types (Data Write Event, Data Read Event, Data Sync Event, Data Joining Event, and Data Leaving Event), and event handler for each event is defined. Some events come from trace data, e.g., Data Joining Event, and Data Leaving Event. Some event generate dynamically during the simulation, e.g., Data Sync Event. The Event Queue is a Priority Queue, so that one event is inserted to the queue, sorted by the simulation clock. Therefore, if we start the simulation, the simulator scans the Event Queue, and gets one event, and triggers the corresponding event handler, until it reaches the simulation termination condition. 2) Simulation environments configuration: Performance evaluation is conducted on one AMD Opteron Dual-Core server node of Grid’5000 platform. We run the simulator with different parameters. We periodically put a set of data into the storage system simulator. During the simulation, the results of data availability, data durability, and redundancy rate are written to output files. In our simulation, we use 30 days’ trace data from a real SETIhome traces from Failure Trace Archive [FTA]. The number of host is 1000, which is a randomly selected subset from the full 110000 hosts of SETI@home traces, and we put 100 data to the system, the data size is a random value from 8 MB to 2048 MB, with a total size of 106257 MB. FTA [32] After these 100 data is stored, then we perform 1000 reads. Other parameters, chunk size, erasure codes scheme, replicas, SIT and FTP, are configured with varied value. The same simulation condition, we run for many times (typically 10), and get the average value. 3) Simulating Data Write/Read Operations: Data write and read model just define the time of when to write and when to read. In general, there are several models, such as periodically random, poison distribution, Gaussian distribution. In our simulation, data write model and data read model are the simple one, periodically random. Before we run the simulation, we generate the Data Write Event

and Data Read Event at the beginning, and then insert to Event Queue. During the simulation, Data Write Event and Data Read Event Handler are triggered. B. Metrics of Reliability We give how to measure the defined metrics, because both the availability and durability are probability values. We use the method of ”Probe”, that is when the simulator handle read event, it reads all data stored in the system, to check there are enough chunks to recover the original data, and count how many percent can be retrieved successfully. 1) Data Availability: Instead of measuring data available time for each data, because it is an amazing time-consuming work to scan each data during the simulation, we just densely and periodically read all data, and count the available data number. The percentage of available data approximately equals data availability of the storage system, if we run the simulation for many times (typically 10 in this paper) to get the average available data percentages. 2) Data Durability: The durability can be defined as the ratio of data lifetime to total simulation period. If durability = 1, it means this data can be retrieved successfully during the whole simulation period. If durability = 0.5, it means this data can not be retrieved since the half simulation period (for example, if we don’t consider a true FT flag, replicas maybe lost.) In this paper, data durability is expressed by Cumulative Distribution Function (CDF). The main measure of durability that we will use in our evaluation is data object lifetime. Each data has a lifetime. If we make one data persistent, that is to say forever. This is defined as the time that elapses between the instant the system creates the data object and its replicas, and the instant the system no longer has any replica of the data object. The data durability for one data is defined as the ratio of its lifetime to the total simulation period. The data durability of the storage system is also presented by the average durability of all data in the system. V. S IMULATION R ESULTS AND A NALYSIS System utilization rate and data insert failure rate: The redundant rate of the system implies the storage overhead of the system. In BitDew, we have two options for host local cache (cache in memory, write to file). When a host leaving, and the replicas on this host maybe lost, this replica reschedule to another one. When this host connected again, if the replica still exists, a data redundancy occurs. Although both the replication and erasure codes schemes lead to data redundancy, we measure the real-time data redundancy of the system. The total contributed storage space and total utilized storage space can also be measured, consider limited storage space for each host. A. Data Availability The data availability is evaluated by simulation study. This section presents the results of our simulation study.

Data Availability (number of nines)

8

FR FR-RR

7 6 5 4 3 2 1 0

1

2

3

4

5

6

7

8

# Replicas

Figure 2. Data availability versus different replication level with FR Strategy and FR-RR Strategy.

2) File Encoding Strategy: We studied 8 groups of erasure codes as follows, • the first group, r=1.25, (m, n)=(4,5) and (8,10); • the second group, r=1.5, (m, n)=(8,12); • the third group, r=2, (m, n)=(1,2), (4,8) and (16,32); • the fourth group, r=3, (m, n)=(4,12); • the fifth group, r=4, (m, n)=(4,16). From Fig. 3, we also know that when replica faulttolerance strategy is used, there is not a conspicuous improvement. In this figure, the minimal data availability

result comes from the scheme of (m, n)=(4,5), where the redundancy factor is 1.25. A larger redundancy factor would lead to a better result. Even without replica fault-tolerance strategy, but with a larger redundancy factor (e.g., r=3, r=4), the result reaches nearly 1.0. 8 Data Availability (number of nines)

We simulate the behaviors of the storage system, including node joining/leaving, and data writing/reading, over a large number of runs. For each run, we firstly set the following parameters values, 1) data chunk size; 2) replicas number; 3) replica ft flag; 4) erasure coding flag; 5) erasure coding scheme (m-of-n); 6) SIT; 7) FTP. In order to explore the contribution of replication strategy to the availability, we do separate test for three kinds of strategy, and measure the success rate for each full read (read all data), and the average success rate. Data availability when varied chunk size, varied replica, varied erasure coding, with/without FT, varied SIT/FTP (frequency of timeout). For three strategies, each on has their own features, and absolutely the contribution of each strategy should be emphasized and analyzed. 1) File Replication Strategy: When replica is varied from 1 to 8 with/without replica fault-tolerant flag, the data availability is shown by Fig. 2. The parameters are as follows: data chunks size is 64MB, FIT is 20 minutes, and FTP is 60 minutes. From this figure, we know that the data availability reaches 99.999% if the replica number is larger than 2, and we also know that when replica fault-tolerance strategy is used, there is not a conspicuous improvement. The data availability reaches nearly 1.0, even if without replica fault-tolerance strategy but with a replicas number more than 3.

FE FE-RR

7 6 5 4 3 2 1 0

(4,5)

(8,10)

(8,12)

(1,2)

(4,8)

(16,32) (4,12)

(4,16)

Erasure Codes (m,n), replica=1

Figure 3. Data availability versus different erasure codes with FE Strategy and FE-RR Strategy.

3) Replica Repair Strategy: Compared Figure 7 with Figure 8, in a high dynamic environment, simple replication scheme maybe more efficient than sophisticated erasure codes. Because erasure coding is really a computingintensive decoding process, it decreases data read/write performance in a real hybrid storage prototype system. 4) Combined strategies: If we combine two independent strategies (replicas and erasure codes) and replica faulttolerance together, Fig. 4 shows the data availability result when replicas number is 1 and 2. The erasure codes used can be divided into two groups: 1) r=1.25, (m, n)=(4,5) and (8,10); 2) r=1.5, (m, n)=(4,6) and (8,12). When replicas number is 1, in the m-of-n erasure codes scheme, we found that 8-of-10 scheme is better than 4-of-5 scheme, and also 8-of-12 scheme is better than 4-of-6 scheme. Therefore, we may make a rough declaration that for the same redundancy factor r, a large m is better. Although when replicas number is 2, we obtain an excellent result of nearly 1.0, we must also be aware that the storage cost is also very high for two replicas with FT, including time cost, and storage overhead cost. 5) The impact of data chunk size on availability: We also investigate the affect of varied data chunk size. Fig. 5 clearly shows that the data chunk size varies directly with the data availability. Absolutely, the situation of two replicas outperforms the situation of only one replica. Normally, the medium value is selected as the data chunks size, not so large, nor so small. For example in our previous tests, we select 64MB as the chunk size. 6) The impact of SIT/FTP on availability: Another factor that may affect availability is the value of Synchronization

8

1

0.95 6 5

Data Availability

Data Availability (number of nines)

7

4 3

0.9

0.85

0.8

2 1

0.75

FTP=3*SIT FTP=4*SIT FTP=5*SIT FTP=6*SIT

0 (4,5)

(8,10) (4,6) Erasure Codes (m,n)

FE-RR and replicas=1

(8,12) 0.7 5m

20m

30m

45m

1h

2h

Figure 6. Strategy.

1

1

0.9

0.9

CDF of Data Durability

0.7 0.6 0.5 0.4

FR and r=1 FR and r=2 FR and r=3 FR and r=4 FR and r=5 FR and r=6 FR and r=7 FR and r=8

24

32

48 64 80 96 Data Chunk Size (MB)

128

180

0.7 0.6 0.5 0.4 0.3

0.1

0.2 16

5h

0.2

FR-RR and r=1 FR-RR and r=2

8

4h

Data availability versus varied SIT/FTP value with FR-RR-1

0.8

0.8

0.3

3h

Synchronization Interval Time (FR-RR and r=1)

Figure 4. Data availability versus the combination of replicas and erasure codes with FE-RR-1 Strategy and FE-FR-RR-2 Strategy.

Data Availability

15m

FE-FR-RR and replicas=2

256

512

Figure 5. Data availability versus varied data chunk size with FR-RR-1 Strategy and FR-RR-2 Strategy.

Interval Time (SIT) and Failure Timeout Period (FTP). As you see in Fig. 6, if we use a long Synchronization Interval Time and a long Failure Timeout Period, for example, 5 hours as SIT and 20hours as FTP, the availability is bad and less that 0.8. This result can be explained by the host failure determination method. A long Failure Timeout Period means that detecting host failure will be delay, and the lost replicas will not re-scheduled (recreate) to other hosts immediately, which may lead to failed data retrieve. The longer the SIT/FTP is, the worse the result will be. While A short SIT/FTP may lead to a heavy load for server, due to the frequent communications between host and server. Thus, there is usually a trade-off between the number of host and the value of SIT/FTP. Generally, if the scale of host is 10,000, the SIT should be longer than 30 minutes. B. Data Durability The data durability is expressed by y axis of Cumulative Distribution Function (CDF) versus x axis of fraction of

0 10

Figure 7.

20

30

40 50 60 Fraction of Data (%)

70

80

90

100

CDF of data durability versus replicas without FT.

data. Similar to the study of data availability, durability is also evaluated with the independent strategy and combined strategies by simulation, and records the lifetime of each data. 1) File Replication Strategy: The parameters are as follows: data chunks size is 64 MB, FIT is 20 minutes, and FTP is 60 minutes. When the replica number is varied from 1 to 8 without replica fault-tolerant flag, if the replica number is more than 3, a good durability is obtained. When replica number is exactly 3, only 2% data objects whose lifetimes are less than 0.8, which can be found from Fig. 7. 2) File Encoding Strategy: We tested the durability with 8 different erasure codes scheme without replica fault-tolerant flag. Except three erasure codes (m, n)=(1,2), (4,5) and (8,10), other schemes obtain good results. 3) Replica Repair Strategy: If we add replica faulttolerant to the previous two figures, the results are excellent and near 1.0, which are not presented here in order to shrink the paper.

1

CDF of Data Durability

0.8 0.7 0.6

FTP=5*SIT, FR-RR and r=1 1

FE-(1,2) FE-(4,5) FE-(4,8) FE-(4,12) FE-(4,16) FE-(8,10) FE-(8,12) FE-(16-32)

SIT= 5m SIT=15m SIT=20m SIT=30m SIT=45m SIT= 1h SIT= 2h SIT= 3h SIT= 4h SIT= 5h

0.9 0.8 CDF of Data Durability

0.9

0.5 0.4 0.3

0.7 0.6 0.5 0.4 0.3 0.2

0.2

0.1

0.1

0

10

20

30

0

Figure 8. Strategy.

20

30

40 50 60 Fraction of Data (%)

70

80

90

CDF of data durability versus varied erasure codes with FR

FE-FR-(4,5)-1 FE-FR-(4,6)-1 FE-FR-(8,10)-1 FE-FR-(8,12)-1 FE-FR-(4,5)-2 FE-FR-(4,6)-2 FE-FR-(8,10)-2 FE-FR-(8-12)-2

14

70

80

90

100

FR-RR-1 FR-RR-2 FR-RR-3 FE-RR-(4,8) FE-RR-(4,12)

12 Redundancy Rate

CDF of Data Durability

0.7

60

Figure 10. CDF of data durability versus varied SIT and FTP (FTP=5*SIT and FR-RR-1 Strategy). 16

0.8

50

100

1 0.9

40

Fraction of Data (%)

10

0.6 0.5 0.4

10 8 6

0.3

4

0.2 0.1

2

0 10

20

30

40 50 60 Fraction of Data (%)

70

80

90

100

Figure 9. CDF of data durability versus the combination of replicas and erasure codes with FE Strategy.

4) Combined strategies: The result of combination of replication and erasure codes is shown by Fig. 9. We evaluated 4 erasure codes (4,5), (4,6), (8,10) and (8,12), with redundancy factor of 1.25 or 1.5. Obviously, 2 replicas lead to a good durability. 5) The impact of SIT/FTP on durability: So as to varied SIT/FTP, Fig. 10 show the durability when FTP=3*SIT and FTP=6*STI, respectively. The results indicate that the same as availability, the longer the SIT/FTP is, the worse the result is. C. System Utilization Ratio In order to evaluate the storage system overhead, we measure the number of repairs that were made, and the total repaired data chunk size, the hosted data number and total data size cached by host, and host storage utilized rate, and the overall redundancy factor. BitDew provides two host local cache options: Cache in Memory and Write to File, evaluate the redundancy rate during there two options. Storage cost

1.2055x109

1.2060x109

1.2065x109 1.2070x109 Simulation Clock

1.2075x109

Figure 11. Data redundancy rate versus FR Strategy and FE-RR Strategy.

evaluation for replica repair due to replica. We evaluated two situations: 1) replicas with FT; 2) erasure code with FT. 106257 MB data storage load of the system, are stored by the system. Figure 17 shows the real-time redundancy rate versus the simulation clock. Here, we choose two erasure codes, (m, n)=(4,8) and (4,12) which also equals to replica=2, and 3. In this figure, we observed that with FT it is really so big redundancy, and replication strategy is better than erasure codes strategy. (4,12) too big, use (4,5) (4,8) is better. 1. Robust Replication Strategy, data Write to File host local cache option 2. Simple Replication Strategy, data Write to File option don’t do this Cache in Memory, Considering host storage limit, 1000 nodes, 20*1024 MB—–60*1024 MB average 40960 MB, Insert 10000 files, 8MB-1024MB, average 512 MB insert in one month no data chunks utilized ratio: 23% Evaluate insert failure ratio, and system utilization ratio, Host churn, fault-tolerance ability, Data read, Availability and Durability. How many regenerated

D. Integrating Stable Storage Utility Motivated by Amazon S3 relaxing durability by costsaving, There are several choices: stable node, replicas, erasure codes, ft, in order to ensure 99.X%, how to give parameters. Users require a nine 9’s reliability, how to realize, how to satisfy users’ cost-saving requirement, which is defining strategy to decree a little reliably, relaxing availability and durability, but save cost. Cost-efficient with SLA trade-off could be flexible realized. we will add to the API so that users only need to specify targeted availability levels as a percentage of time (instead having to specify replication levels for example), ensure that data will be durable and can meet the lifetime as specified by the user. We would like to augment BitDew with higher level of abstractions for durability, availability, and access performance. Each application may need any subset of these requirements and at different degrees. For example, an application should be able to specify that its data should be completely durable, available 80% of the time, and accessible with 1Mbit speeds. If we consider computation cost for erasure coding, replicas with FT could be a good and simple solution. the online storage service on Desktop Grids gives users access to remote data storage using web services gains a considerable popularity thanks to its simplicity and reliability. Cost-effective Storage Faster, more reliable and costeffective disk-based backup and recovery with Deduplication solutions inline deduplication technology enables you to easily harness the power of deduplication to retain data longer, protect data more efficiently and reliably, and save money by reducing energy, and maintenance requirement. 1) Cost Effective Storage Management: 2) Availability and Durability Evaluation: As previously demonstrated in Section 2, the proposed S3-like storage system composed of one stable node and multi-volatile nodes. Up to now, we have not discussed it. We assume that collectively, the dedicated nodes have enough aggregate storage for at least one copy of all active data in the system. We argue that this solution is made practical by the decreasing price of commodity servers and hard drivers with large capacity. From the probabilistic point view, if replica is not available on volatile nodes, it will retrieve from stable node, which elevates reliability absolutely. In previous simulations, we run simulations in which we remove the stable node (by setting a Stable Node Flag to false), and varied strategies. To quantify the benefit of adding a stable node, when a stable node is added, how to realize efficient stable storage node space utilized, how much replication traffic and disk space can be saved, and how will the availability be improved is left for future.

Replication strategy (r) Stable: 1 Volatile: r-1 Erasure codes strategy (m-of-n) Stable: m, m-1, m-2, .. 1 Volatile: n-m, n-m+1, n-m+2, .. n-1 Replication repair strategy (r) Stable: Volatile: Erasure codes repair strategy (m-of-n) Stable: Volatile: E. Implementation The prototype is give a breif introduction to BitDew [33]. The overview of bitdew. BitDew1 is a data management middleware which provides a programmable environment for cloud, grid and desktop grid, which has been developed by INRIA [?]. It can be easily integrated to volunteer computing and desktop grid middleware, such as BOINC, XtremWeb, Condor, and so on. BitDew proposes some attribute keys to control the data distribution, and offers programmers a simple API for creating, accessing, storing and moving data with ease, even on highly dynamic and volatile environments. Recently, a MapReduce programming model for desktop grid also implemented on top of BitDew [?]. Here is a brief overview of BitDew and its work principle. The BitDew programming model relies on 5 abstractions to manage the data: i) replication, which indicates how many occurrences of a data should be available at the same time on the network; ii) fault-tolerance, is a flag which controls the policy in presence of machine crash; iii) lifetime, is an attribute absolute or relative to the existence of other data, which decides the life cycle of a data in the system; iv) affinity, which drives movement of data according to dependency rules, v) protocol, which gives the runtime environment hints about the protocol to distribute the data (http, ftp or BitTorrent). Programmers define for every data these simple criteria, and let the BitDew runtime environment manage operations of data creation, deletion, movement, replication, and fault-tolerance operation BitDew drives a scheduler algorithm, and a periodic host synchronization mechanism. Unlike the general host permanently determination, no matter short-term failure or long-term failure, it use a simple timeout threshold approach for node failure detection. For local cache of host, there are two options for data cached by host, cache in memory and write to file. VI. C ONCLUSION In this paper, we characterized data availability and durability in distributed peer-to-peer storage system based on three data redundancy strategies, and the proposed cost efficiently to satisfy users’ store requirements. To realize high reliable data storage for cloud data storage just as Amazon S3 does, In the context of limited space contributed by volunteered node, to maximize the data availability and durability, and efficiently space usage volatile nodes and stable nodes, analysis the impact of redundancy strategies on availability and durability. 1 http://www.bitdew.net

Our main contributions are that we separate availability and durability, and give a definition, and indicate the measurement of these two metrics, and also studied the redundancy strategies to improve reliability. The contributions of this work are as follows: 1) First, we propose a low cost and reliable data store architecture. We demonstrate its feasibility through a simulator prototype implementation. 2) Second, we provide a simulation tool that allows for detailed availability and durability analysis. 3) Third, we evaluated the impact on availability for three data redundancy scheme. A systemic study of availability and durability, and through simulation to discover the factor that may affect the availability and durability, and The future work are presented here, 1) Deploy a prototype of the proposed storage system, and evaluate real data store performance of BitDewS3, including data write, read, retrieve, delete, and also compute time for coding/decoding. Other efforts of MapReduce application or data sharing application can be implemented to utilize the aggregated storage system space. 2) Study how to efficiently use the storage space of stable nodes is a tough problem to be solved in the future. 3) Study how to leverage long-term and short-term node availability prediction and failure detection to enhance scheduling decision under certain environment, as well as to improve availability and durability. 4) Ensuring access performance via network distance, designing a network distance-aware scheduler to improve data upload/download performance, through network distance estimation. give the suggestion about what kinds of strategies. S/(S+V) (1) If a little stable storage, then (2) If enough volatile storage, then (3) If enough stable R EFERENCES [1] J. Kubiatowicz, D. Bindel, Y. Chen, S. E. Czerwinski, P. R. Eaton, D. Geels, R. Gummadi, S. C. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Y. Zhao, “Oceanstore: An architecture for global-scale persistent storage,” in ASPLOS, 2000, pp. 190–201. [2] A. I. T. Rowstron and P. Druschel, “Storage management and caching in past, a large-scale, persistent peer-to-peer storage utility,” in SOSP, 2001, pp. 188–201. [3] F. Dabek, M. F. Kaashoek, D. R. Karger, R. Morris, and I. Stoica, “Wide-area cooperative storage with cfs,” in SOSP, 2001, pp. 202–215. [4] “Amazon Simple Storage Service (Amazon S3),” Website, http://aws.amazon.com/s3/.

[5] S. C. Rhea, P. R. Eaton, D. Geels, H. Weatherspoon, B. Y. Zhao, and J. Kubiatowicz, “Pond: The oceanstore prototype,” in FAST. USENIX, 2003. [6] R. Bhagwan, K. Tati, Y.-C. Cheng, S. Savage, and G. M. Voelker, “Total recall: System support for automated availability management,” in NSDI. USENIX, 2004, pp. 337– 350. [7] A. Muthitacharoen, R. Morris, T. M. Gil, and B. Chen, “Ivy: A read/write peer-to-peer file system,” in OSDI, 2002. [8] A. Adya, W. J. Bolosky, M. Castro, G. Cermak, R. Chaiken, J. R. Douceur, J. Howell, J. R. Lorch, M. Theimer, and R. Wattenhofer, “Farsite: Federated, available, and reliable storage for an incompletely trusted environment,” in OSDI, 2002. [9] I. Clarke, O. Sandberg, B. Wiley, and T. W. Hong, “Freenet: A distributed anonymous information storage and retrieval system,” in Workshop on Design Issues in Anonymity and Unobservability, ser. Lecture Notes in Computer Science, H. Federrath, Ed., vol. 2009. Springer, 2000, pp. 46–66. [10] W. J. Bolosky, J. R. Douceur, D. Ely, and M. Theimer, “Feasibility of a serverless distributed file system deployed on an existing set of desktop pcs,” in SIGMETRICS, 2000, pp. 34–43. [11] P. H. Carns, W. B. Ligon, III, R. B. Ross, and R. Thakur, “Pvfs: A parallel file system for linux clusters,” in Proceedings of the 4th Annual Linux Showcase and Conference. USENIX Association, 2000, pp. 317–327. [12] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The google file system,” in SOSP, M. L. Scott and L. L. Peterson, Eds. ACM, 2003, pp. 29–43. [13] “Hadoop Distributed File System (Hadoop HDFS),” Website, http://hadoop.apache.org/hdfs/. [14] S. Vazhkudai, X. Ma, V. W. Freeh, J. W. Strickland, N. Tammineedi, and S. L. Scott, “Freeloader: Scavenging desktop storage resources for scientific data,” in SC. IEEE Computer Society, 2005, p. 56. [15] Z. Yang, B. Y. Zhao, Y. Xing, S. Ding, F. Xiao, and Y. Dai, “Amazingstore: Available, low-cost online storage service using cloudlets,” in the 9th International Workshop on Peerto-Peer Systems (IPTPS’10), 2010. [16] A. Gharaibeh and M. Ripeanu, “Exploring data reliability tradeoffs in replicated storage systems,” in HPDC, D. Kranzlm¨uller, A. Bode, H.-G. Hegering, H. Casanova, and M. Gerndt, Eds. ACM, 2009, pp. 217–226. [17] A. Gharaibeh, S. Al-Kiswany, and M. Ripeanu, “Thriftstore: Finessing reliability trade-offs in replicated storage systems,” IEEE Trans. Parallel Distrib. Syst., vol. 22, no. 6, pp. 910– 923, 2011. [18] C. A. Miller, A. R. Butt, and P. Butler, “On utilization of contributory storage in desktop grids,” in IPDPS. IEEE, 2008, pp. 1–12.

[19] H. H. Huang, J. F. Karpovich, and A. S. Grimshaw, “A feasibility study of a virtual storage system for large organizations,” in the Second International Workshop on Virtualization Technology in Distributed Computing (VTDC 2006), 2006. [20] Y. Gu and R. L. Grossman, “Sector: A high performance wide area community data storage and sharing system,” Future Generation Comp. Syst., vol. 26, no. 5, pp. 720–728, 2010. [21] R. Bhagwan, D. Moore, S. Savage, and G. M. Voelker, “Replication strategies for highly available peer-to-peer storage,” in Future Directions in Distributed Computing, ser. Lecture Notes in Computer Science, A. Schiper, A. A. Shvartsman, H. Weatherspoon, and B. Y. Zhao, Eds., vol. 2584. Springer, 2003, pp. 153–158. [22] H. Weatherspoon and J. Kubiatowicz, “Erasure coding vs. replication: A quantitative comparison,” in IPTPS, ser. Lecture Notes in Computer Science, P. Druschel, M. F. Kaashoek, and A. I. T. Rowstron, Eds., vol. 2429. Springer, 2002, pp. 328–338. [23] R. Rodrigues and B. Liskov, “High availability in dhts: Erasure coding vs. replication,” in IPTPS, ser. Lecture Notes in Computer Science, M. Castro and R. van Renesse, Eds., vol. 3640. Springer, 2005, pp. 226–239. [24] E. Sit, A. Haeberlen, F. Dabek, B. gon Chun, H. Weatherspoon, R. Morris, M. F. Kaashoek, and J. Kubiatowicz, “Proactive replication for data durability,” in Proceedings of the 5th International Workshop on Peer-to-Peer Systems (IPTPS’06), 2006. [25] B.-G. Chun, F. Dabek, A. Haeberlen, E. Sit, H. Weatherspoon, M. F. Kaashoek, J. Kubiatowicz, and R. Morris, “Efficient replica maintenance for distributed storage systems,” in NSDI. USENIX, 2006. [26] G. Lefebvre and M. J. Feeley, “Separating durability and availability in self-managed storage,” in ACM SIGOPS European Workshop, Y. Berbers and M. Castro, Eds. ACM, 2004, p. 28. [27] W. K. Lin, D. M. Chiu, and Y. B. Lee, “Erasure code replication revisited,” in Peer-to-Peer Computing, G. Caronni, N. Weiler, and N. Shahmehri, Eds. IEEE Computer Society, 2004, pp. 90–97. [28] A. Datta and K. Aberer, “Internet-scale storage systems under churn – a study of the steady-state using markov models,” in Peer-to-Peer Computing, A. Montresor, A. Wierzbicki, and N. Shahmehri, Eds. IEEE Computer Society, 2006, pp. 133– 144. [29] C. Blake and R. Rodrigues, “High availability, scalable storage, dynamic peer networks: pick two,” in Proceedings of the 9th Conference on Hot Topics in Operating Systems (HotOS’03). Berkeley, CA, USA: USENIX Association, 2003. [30] R. Bhagwan, S. Savage, and G. M. Voelker, “Understanding availability,” in IPTPS, ser. Lecture Notes in Computer Science, M. F. Kaashoek and I. Stoica, Eds., vol. 2735. Springer, 2003, pp. 256–267.

[31] S. Ramabhadran and J. Pasquale, “Analysis of durability in replicated distributed storage systems,” in IPDPS. IEEE, 2010, pp. 1–12. [32] D. Kondo, B. Javadi, A. Iosup, and D. H. J. Epema, “The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems,” in CCGRID. IEEE, 2010, pp. 398–407. [33] G. Fedak, H. He, and F. Cappello, “Bitdew: A data management and distribution service with multi-protocol file transfer and metadata abstraction,” J. Network and Computer Applications, vol. 32, no. 5, pp. 961–975, 2009.