Hierarchical Data Distribution Scheme for Peer-to ...

3 downloads 172 Views 224KB Size Report
P2P is a powerful paradigm, which provides a large-scale and cost-effective mechanism for data sharing. ... dedicated servers used for the P2P systems. So.
Hierarchical Data Distribution Scheme for Peer‐to‐Peer Networks Shashi Bhushan, M. Dave, and R. B. Patel Citation: AIP Conference Proceedings 1324, 332 (2010); doi: 10.1063/1.3526226 View online: http://dx.doi.org/10.1063/1.3526226 View Table of Contents: http://scitation.aip.org/content/aip/proceeding/aipcp/1324?ver=pdfcov Published by the AIP Publishing Articles you may be interested in A new lossless digital image encryption scheme AIP Conf. Proc. 1414, 183 (2011); 10.1063/1.3669953 Ensuring Data Storage Security in Tree cast Routing Architecture for Sensor Networks AIP Conf. Proc. 1298, 639 (2010); 10.1063/1.3516388 Testing Nested Distributions AIP Conf. Proc. 1281, 1881 (2010); 10.1063/1.3498279 Information Retrieval Techniques for Peer-to-Peer Networks Comput. Sci. Eng. 6, 20 (2004); 10.1109/MCSE.2004.12 Peer‐to‐Peer Networks Comput. Sci. Eng. 3, 75 (2001); 10.1109/5992.919270

Hierarchical Data Distribution Scheme for Peer-to-Peer Networks Shashi Bhushan*, M. Dave** and R. B. Patel*** *

Dept of Computer Science & Engineering, H.E.C. Jagadhari, Haryana, INDIA Dept of Computer Engineering, National Institute of Technology Kurukshetra, Haryana, INDIA *** Faculty of Engineering, Mody Institute of Technology & Science, Laxmangarh, Sikar, Rajasthan, INDIA **

Abstract— In the past few years, peer-to-peer (P2P) networks have become an extremely popular mechanism for large-scale content sharing. P2P systems have focused on specific application domains (e.g. music files, video files) or on providing file system like capabilities. P2P is a powerful paradigm, which provides a large-scale and cost-effective mechanism for data sharing. P2P system may be used for storing data globally. Can we implement a conventional database on P2P system? But successful implementation of conventional databases on the P2P systems is yet to be reported. In this paper we have presented the mathematical model for the replication of the partitions and presented a hierarchical based data distribution scheme for the P2P networks. We have also analyzed the resource utilization and throughput of the P2P system with respect to the availability, when a conventional database is implemented over the P2P system with variable query rate. Simulation results show that database partitions placed on the peers with higher availability factor perform better. Degradation index, throughput, resource utilization are the parameters evaluated with respect to the availability factor.

The first category is peers having high availability I. INTRODUCTION approximately 0.9. They are connected with high speed networks. Another category is work stations with the Napster [1] and Gnutella [2] are popular for music availability 0.7 to 0.8 and good bandwidth. Third file sharing. They earned their popularity with category is home PCs with the availability (0.2 to 0.5) unrestricted music distribution and with high-quality and connected with poor bandwidth. This means that video distribution [3]. Some other areas are making there is a wide variation in the availability of the peers. groups with same interest. These are some application Data availability from the systems is directly concerned areas where end users may use the P2P networks [4]. with the availability of the peers participating on to the It has been observed that traditional P2P networks system. Given such a wide variability, the system should handle static data, which does not change while being not place database replicas blindly. shared, e.g., MP3, Video files, software, etc. The Data replication is a solution that is used in traditional question which arises is: Should a powerful computing distributed environments to increase data availability paradigm such as P2P be limited to just file-sharing and improve system performance. But we cannot applications or for just sharing static data? Given that implement data replication directly in the environment, P2P systems provide a large-scale and cost-effective where there are variations of data availability as well as mechanism for data sharing in general, can more availability of peers. More replicas are required when complex data types also be shared via P2P systems? Can placing on peers with low availability, and fewer on we implement a conventional database on P2P system? highly available peers [9, 10]. ‘Yes’ we can, but till date the conventional databases are not fully exploited over the P2P networks and Generally the P2P application runs as secondary successful implementation of databases is not yet application, as the primary work of the machine on reported. The major reasons identified are: which they are installed, is different. There are no dedicated servers used for the P2P systems. So (a) No one can trust on the untrusted peers, as they may depending upon the available CPU time, bandwidth, destroy or alter the data in the database stored at various disks space or input/output operations, the query peer. response rate of the peer varies at particular time. Query (b) The owner of peer can use or misuse the private response rate of the peers is directly proportional to the information stored at the peer with little efforts. available CPU time on that peers and bandwidth with (c) In P2P networks, peers are dynamic in nature. Each which peer is connected to the network, and is inversely peer can join or leave the system at any time without proportional to the disks utilization and input/output any prior information. Therefore data freshness is not operations. guaranteed, i.e., due to dynamic leaving and joining of Based on above discussion several challenges in peers may cause the stale data on the peers. implementing conventional database on P2P system are (d) Availability of data at particular time is not identified. These are data availability, query response guaranteed. rate, resource utilization, replica updation mechanism in It is identified that on basis of availability factor, P2P systems. LINE (BELOW) TO BE peers can be dividedCREDIT into various categories [5, INSERTED 6, 7, 8]. ON THE FIRST PAGE OF EACH PAPER CP1324, International Conference on Methods and Models in Science and Technology (ICM2ST-10) edited by R. B. Patel and B. P. Singh © 2010 American Institute of Physics 978-0-7354-0879-1/10/$30.00

332

In this paper we have presented a mathematical model of hierarchical based data distribution scheme for the P2P networks. This scheme cop up with above said challenges in implementing conventional database on P2P system. This scheme is analyzed through simulation. The throughput and resource utilization are considered as performance metrics. This paper is a step to implement the conventional databases over the P2P systems. We also analyze the brute degradation index, which is measure of the amount of unavailability of partition’s replicas [11]. We also analyzed the affect of low availability, computational power and resource utilization factor of the peers on the throughput of the system. Rest of the paper is organized as follows. Section II reviews related work. Section III presents system Model. Section IV gives data distribution and peer selection criteria. Section V explores the implementation and performance study and finally paper is concluded in Section VI.

shared object according to the current demand for it is presented. In this work expansion and contraction APRE couples lookup indices together with an aging technique to identify query concentrated areas within the P2P overlay is defined. Dynamic replication Scheme is presented in [18], which is used in superpeer P2P architecture, takes the cost of searching a data item and successfully replicates the most frequently accessed data files based on the access probabilities. periodic pushbased replication (PPR) and on-demand replication (ODR) is proposed in this work. Pull-Then-Push (PtP) replication is presented in [19], after a successful search, the requesting node enters a replicate-push phase where it transmits copies of the item to its neighbors in order to obtain square root replication. The problem with SR replication is that it requires knowledge of the query rate for each item. Solution for this problem is, in PtP, after each successful search, the item is copied to a number of nodes equal to the number of probes. The pull phase refers to searching for a data item. In order to reach SR replication, number of replicas equal to the number of probed nodes is created.

II. RELATED WORK In [12] the file partitions are placed on the various sites, but the availability factor of the site is not considered. In [13] authors presented a load balancing scheme in this scheme load from heavy virtual servers is transferred to the light virtual servers, but this scheme is implemented on the static databases. This scheme is not useful in the dynamic environment as dynamic environment required periodical updates and the replicas which are not having updated information, can not responds correctly. The database partition must be stored on updated replicas. In [14] Local Relational Model (LRM) is presented. In this scheme author assumed that the set of all data in a P2P network consists of local (relational) databases. Each peer can exchange data and services with a set of other peers called acquaintances. Peers are fully autonomous in choosing their acquaintances. In this model Local relational database is stored at the peer, the complete information stored at peer, which may be on the target of intruder/hacker. Peer can misuse the information stored at that peer. In [15] authors presented a read only cooperative file system for storage. This file system provides robustness, load balancing and scalability. In this system we cannot updates the data entries, i.e., data is static in nature. The decentralized replication algorithms are proposed in [16], which deal with storage allocation and replica placement. In this work process of storage allocation decide how many replicas can be produced for each file in the environment of limited of storage space, and replica placement procedure decides the set of peers that are going to store those replicas of each file to achieve a reasonable level of file availability. For providing sufficient file availability, various algorithms are presented. The success of the algorithms depends on the failure rate of peers in the network. Adaptive Probabilistic Replication (APRE) is presented in [17], in which distributed protocol that automatically fine-tunes the replication ratio of each

III. SYSTEM MODEL In our presented scheme, the database may be partitioned horizontally, vertically or both by Master Peer (MP) depending upon the requirement. The MP decides the partition factor for the database. Depending upon this partition factor MP subdivides the query into subqueries and partial results of the subqueries are further combined to produce the result of the executed query. Each database partition is placed on different set of replicas. The information regarding the partition and replicas are stored with the MP. The reason for placing the database partitions on different sets of replicas is due to the security purpose, as the owner of peer may use/misuse the data stored at his peer. The authorized user of the database may send the query for accessing the database to MP. There exist multiple replicas of the data partitions depending upon the load requirements and maintaining the target data availability factor of the replica. To match the requirements of the end user of the system replicas are accessed in multiple ways i.e., (a) the case when query response rate of the set of replica is greater than the query arrival rate to the set of replica. In this case multiple replicas may access against single query. The partial result may be compared, and fresh data is accessed through this comparison. The advantage of this case is, it is easy to implement, e.g., quorum based technique. The updation of the replicas is according to read one, write all basis. (b) When query response rate of the set of replica is lower than the query arrival rate to the set of replica. In this case multiple replicas are accessed against multiple queries to match the target response rate of the system. Particular replicas are authorized to respond to particular query depending upon the access algorithm used. We are using the prioritized method to access the replicas. In this case replica updation overhead is comparatively more. The

333 Reuse of AIP Publishing content is subject to the terms at: https://publishing.aip.org/authors/rights-and-permissions IP: 154.70.154.170 On: Sun, 21 Aug 2016 10:09:40

Pk is kth peer in the set Rci

request can be fulfill by the multiple replicas if and only if all responding replicas are updated. In our replica access algorithm the replicas are given a priority. The updation of the replicas is according to the assigned priority to the replica. High priority replica is chosen for responding to the incoming subquery. Assumptions: (a) The technique may be used for any database model, but in this work, we have considered the relational model for the scheme. (b) Initially, all the replicas are initialized from the Leader Peer (among the set of replicas) & it is assumed that initially all replicas having same data or all replicas are synchronized with Leader Peer. (c) Prioritized method is used for selecting the peer for responding to the incoming subquery. (d) Priority is given to the responding peers over the peer who are not selecting for responding to the subquery, in case of updating the peer nodes. (e) Version number is provided to each replica, whenever any replica updates itself, it will automatically increment its version number.

Analysis of Scheme: Each peer can be in just one of the two states, active or inactive. The state of the system is determined by the set RC of the active peers. Each partitions of the database must be present at least on one active peer among set of all replicas of that partition. The system works till the last copy of each partitions present on the active peer. Due to the churn rate of the peers, a peer may leave the systems and rejoin the system at any time. This may affect the availability of the partition, which is stored on that peer. A P2P system fails when the last replica of the partition become unavailable. The mathematical model for the above said problem is: Let P = { p1, p2 , p3 ,..., pn } be the set all peers participating in to the system. Pi is the ith peer in the system. DB = {db1, db2 , db3 ,..., dbm } is a set of database partitions (The arrangement of the peers is shown in Figure 1).

QR

MP0

MP1

m

U dbi = DB

i =1

MPm-1

Where Ri is the set of all the peers holding the replicas of the dbi . This set contains entire peers whether they are active or inactive. Rci is the set of all active peers holding the replicas of the dbi . RC = {Rc1 , Rc2 , Rc3 ,..., Rcn } all the active peers. Rci ⊆ Ri ⊂ P the active peers are the subset of the total peers holding the replicas which is again subset of the set P which is the set of all the peer taking parts in the system. Correctness condition of the system is, for all partition of the database there should be at least one active peer holding that partition. This is mathematically represented by: ∀dbi : ∃pk | pk ∈ Rci and Rci ≠ {φ} ... (3)

Qrn Qr0

Qr1

R0,0

R1,0

Rm,0

R0,1

R1,1

Rm,1

R0,n

R1,n

Rm,m

Figure 1 Peer Arrangement (f) Each partition is stored at four replicas. These replicas are arranged as shown in figure 1. (g) To store the replicas, geographically close peers are choosen, which is identified using the longest prefix in the IP addresses. Query is subdivided by MP into subqueries, such that if operation U is performed on the partial results obtained by subqueries (received from the set of replicas) are equivalent to the result of the main query. These subqueries are routed to the set of replicas holding the corresponding data.

The system fails when for any of the partition gone down, i.e., set of peers holding the partition gone empty. The mathematical notation is: ∃dbi : ∃pk | pk ∈ Ri , pk ∉ Rci and Rci = {φ} ... (4) The probability of system failure is increased with increase in rate of peers leaving the system. Degradation index may be the performance metric for above said problem. We can define degradation index which measure the amount of unavailability of the peers in Rci . We define degradation index as.

n

Q = U Qri

...

(1)

i =1

The data item required by the sub query is available if and only if at least one replica of the partition is active at that time. This is defined as follows:

∀Qri → Di ∈ {U m k = 0 pk | pk ∈ Rci } ...

#{Working Peers in set Rci } )% #{all Peers in set Rci } … (5) This gives the percentage of the number of working peers with respect to the total numbers of peers in a particular set Rci .

(2)

Degradation Index = (1 −

Where Qri is the sub query i . Di Data required by the sub query Qri .

m is the number of peers participating in the system Rci is the set of all active peers

334

Reuse of AIP Publishing content is subject to the terms at: https://publishing.aip.org/authors/rights-and-permissions IP: 154.70.154.170 On: Sun, 21 Aug 2016 10:09:40

considered when we choose the peer out of a set of peers into the network. {W1,W2 ,W3 ,W4} are the constants using which we can provide the weight to the parameter which is considered while selecting a peer. Using weight we can change the importance of the particular parameter while selecting a peer to store the partition, to decide the peer shadow etc. Trust of a peer may also be considered as one of the parameter. But in our case trust is not considered as a performance metric.

DATA DISTRIBUTION AND PEER SELECTION CRITERIA It is assumed that database satisfies all the normal forms so that no anomalies are there, while dealing with the database. Data placement is done by the MP which is responsible for deciding the divide factor of database partitioning (e.g. horizontal, vertical, or both). Also MP will take care of ACID properties of the database. The MP is decided by the availability factors of the peer and query response rate of the peer at that particular time. The MP having complete information regarding the Leader Peers and Leader Shadows (in case of multiple Leader Shadows) holding the partitions. The partition of the fields may not be of the same size. As the length/size of the different fields are different. The MP provides sequence number to every incoming query, this same sequence number represents all the decomposed sub queries throughout the system. Leader Shadow take charge in case of master fails. Otherwise it will only look after the incoming and outgoing queries. It also maintains the log as master does. This is useful in case of master fails and some queries are in the waiting queue. Second master also receive the acknowledgement messages from all the answering peers. It is assumed that only authorized owner will send the queries to the MP. Leader will send update message to all the replicas when it got the acknowledgement from the answering peer and from the MP commits. Leader Peer will be responsible for the serializability of the sub queries. Concurrent processes will be dealt by the Leader Peer. All the peers update itself with the Leader Peer, when they rejoin the system after failure as Leader Peer have updated replica of the partition. Simultaneously Leader Shadow of the Leader maintains the log of incoming queries. There is a Leader Shadow, ready to take the charge of the Leader Peer in case of Leader fails/leave the system. The new Leader immediately decides its Leader Shadow depending upon the activation time and bandwidth of the peer. Each peer has some predefined availability factors, which is decided at the time of peer selection. Prioritized method is used to update all the replicas. To update replica, assigned priority of the replica is checked and high priority replicas are updated first than comparatively low priority replicas. A peer is selected for storing the partition by the formula: (W1 * AAT + W2 * BOP + W3 * SSOP + W4 * CP ) . IV.

V.

IMPLEMENTATION AND PERFORMANCE STUDY We have considered variation in query response rate for our simulation. We have taken two cases. (I) only Leader Peers respond to any incoming subquery. (II) The two peers, Leader and Leader respond to the incoming sub query.

Degradation Index

Degradation

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Degradation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Availabilty

Figure 2. Partition vs Degradation Index with variation in Availability factor.

From the degradation graph Figure 2, it is observed that the degradation index decreases with decrease in the availability of the peers. The peers having at least 0.7 availability, are suitable to participate into the system. The throughput lies between 43% and 97% in case of two peers responding to the sub query, where as 0.33 to 0.53% in case of 1 Leader Peer responding the sub queries with variation in the availability. The resource utilization is also varies from 0.5 to 0.85 % in first case and 0.3 to 0.65 %in the second case.

Throughput

Throughput Vs Availability 1 0.8 0.6 0.4 0.2 0 0

0.2

0.4

0.6

0.8

1

Availability

1 master

Where: AAT Average Activation Time: is the time for which a peer stays into the network, as a peer can leave or join at any time. The peer with long stay time is better for the storage the partition. BOP Bandwidth of Peer: is the bandwidth of the peer through which it is connected with the networks. The peer with good bandwidth is better for the storage of partition. SSOP Storage Space of Peer and CP Computation Power of the peer is also an important factor which is

2 masters

Figure 3. Throughput vs variation in Availability factor.

We observe that the throughput is increased with multiple peers responding to the subquery shown in Figure 3. Another observation is the throughput of the system varies with the availability of the peers. The throughput increases with increase in the availability of the peers.

335 Reuse of AIP Publishing content is subject to the terms at: https://publishing.aip.org/authors/rights-and-permissions IP: 154.70.154.170 On: Sun, 21 Aug 2016 10:09:40

The utilization of the system resources increases, when the two Leader Peers respond to the sub queries as shown in Figure 4. It is also observed that utilization of resources varies with the availability factor of the peers.

[5]

[6]

Resource Utilization Vs Availability

R. Utilization

1

[7]

0.8 0.6 0.4 0.2

[8]

0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Availability

1 Master

2 master

[9]

Figure 4. Utilization graph with variation in availability factor.

VI.

CONCLUSION AND FUTURE WORK

[10]

We have presented a hierarchical scheme using which we can partition and place the dynamic data over the P2P networks. We analyzed the performance of the system at variable query arrival rate and with variable availability of the peers. Two cases are considered to check the performance of the system that is only one peer among the set of replicas are chosen to respond the subquery. Second the two peers simultaneously respond to the subqueries. Our simulation demonstrates that the degradation index goes on decreasing with the increase in availability factor. With high availability factor, the variation in degradation index is minimum. That means the peers with the higher availability factor then 0.7 is suitable to participate into the system. It is observed from the simulation that throughput of the system is increased with increase in the availability of the peers and the number of the responding peers. It is also observed from the simulation that resource utilization in P2P system can be increased by increasing the number of responding peers into the system. We are in the process of extending this work in the environment where all the active replicas of the partition respond to the sub query to improve the throughput and utilization of resources. We will try to develop the new methodology in which we can utilize maximum resources of P2P networks.

[11]

[12]

[13]

[14] [15]

[16]

[17]

[18]

REFERENCES [1] [2] [3]

[4]

[19]

Napster homepage, http://www.napster.com. Gnutella homepage, http://gnutella.wego.com. S. Gribble, A. Halevy, Z. Ives, M. Rodrig, and D. Suciu., “What can databases do for peer-to-peer?”, in Proceedings of the fourth International Workshop on the Web and Databases (WebDB ’2001), Santa Barbara, May 2001. Rudiger Schollmeier, “A Definition of Peer-to-Peer Networking for the classification of Peer-to-Peer Architecture and Applications”, in Proceedings of the first International Conference on Peer-to-Peer Computing (P2P.01), Linkoping , Sweden, pp. 101-102 August 2001.

R. Bhagwan, S. Savage, and G. Voelker, “Understanding availability”, in Proceedings of IPTPS'03, LNCS, Springer, Heidelberg, vol. 2735, pp. 135-140, 2003. Jing Tian, Zhi Yang, Yafei Dai, “A Data Placement Scheme With Time-Related Model for P2P Storages”, in Proceedings of Seventh IEEE International Conference on Peer-to-Peer Computing, pp.151-158, 2007. R. Bhagwan, S. Savage, and G. Voelker, “Replication strategies for highly available peer-to-peer storage systems”, in Proceedings of FuDiCo: Future directions in Distributed Computing, June, 2002. S. Saroiu, P. Gummadi, and S. Gribble, “A Measurement Study of peer-to-peer File Sharing Systems”, in Proceedings of Multimedia Computing and Networking (MMCN’02), San Jose, CA USA, 2002. J. Kubiatowicz, D. Bindel, Y. Chen, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H.Weatherspoon, W.Weimer, C.Wells, and B. Zhao, “Oceanstore: An architecture for global-scale persistent storage”, in Proceedings of ACM SIGPLAN Notices, vol. 35(11), pp.190-201, November 2000. J. Tian and Y. Dai, “Understanding the Dynamic of Peer-toPeer Systems”, in Proceedings of 6th International Workshop on Peer-to-Peer Systems, California, November 26-30, 2007. Fabio A. Schreiber, “Notes on Real –Time Distributed Database Systems Stability”, in Proceedings of the 5th Jerusalem Conference on Next Decade in Information Technology, Jerusalem, Israel, 560–564, 22-25 October 1990. Horea Adrian Grebla and Calin Cenan, “Distributed Databases Replication-A Game Theory?”, in Proceedings of the seventh symposium on symbolic and numeric algorithms for scientific computing (SYNASC’05), Timisoara, Romania , pp. 4, 25-29 September 2005. Ananth Rao, Karthik Lakshminarayan, Sonesh Surana, Richard Karp and Ion Stoica, “Load Balancing in Structured P2P Systems”, in Proceedings of ACM 2006, 63(3), pp. 217-240, 2006. Dr. Ing, “HiPeer: An Evolutionary Approach to, P2P Systems”, PhD Thesis, Berlin, 2006. F. Dabek, M. F. Kaashoek, D. Karger, R. Morris and I. Stoica, “Wide area Cooperative Storage with CFS”, in Proceedings of the eighteenth ACM symposium on Operating systems (Usenix SOSP-2001), Banff, Alberta, Canada Pages: 202 - 215 2001. W.K. Lin, C. Ye and D. M. Chiu, “Decentralized Replication Algorithms for Improving File Availability in P2P Networks”, in Proceedings of Fifteenth IEEE International Workshop on Quality of Service, pp. 29-37, 2007. Dimitrios Tsoumakos and Nick Roussopoulos, “An Adaptive Probabilistic Replication Method for Unstructured P2P Networks,” Lecture Notes in Computer Science, Springer Berlin / Heidelberg, vol. 4275, pp. 480-497, 2006. S Rajasekhar, B Rong, K Y Lai, I Khalil and Z Tari, “Load Sharing in Peer-to-Peer Networks using Dynamic Replication,” Proceedings of the 20th International Conference on Advanced Information Networking and Applications, vol. 1, pp. 10111016, 2006. Elias Leontiadis, Vassilios V. Dimakopoulos, and Evaggelia Pitoura, “Creating and maintaining replicas in unstructured peer-to-peer systems,” Lecture Notes in Computer Science, Springer Berlin/ Heidelberg, vol. 4128, pp. 1015-1025, 2006.

336 Reuse of AIP Publishing content is subject to the terms at: https://publishing.aip.org/authors/rights-and-permissions IP: 154.70.154.170 On: Sun, 21 Aug 2016 10:09:40