Distributed Leader Election in P2P Systems for Dynamic ... - CiteSeerX

2 downloads 5080 Views 150KB Size Report
per, we present a distributed leader election algorithm that solves this problem without additional associated commu- nication cost and also provides a fair ...
Distributed Leader Election in P2P Systems for Dynamic Sets Dominic Heutelbeck and Matthias Hemmje University of Hagen Computer Science Dept. [email protected] [email protected]

Abstract The collection of and search for location information is a core component in many pervasive and mobile computing applications. In distributed collaboration scenarios this location data is collected by different entities, e.g., users with GPS enabled mobile phones. Instead of using a centralized service for managing this distributed dynamic location data, we use a peer-to-peer data structure, the so-called distributed space partitioning tree (DSPT). A DSPT is a general use peer-to-peer data structure, similar to distributed hash tables (DHTs), that allows publishing, updating of, and searching for dynamic sets. In this paper we present an efficient distributed leader election algorithm that can be used in DSPTs to eliminate redundant network traffic.

1. Introduction In this paper we discuss certain problems in Distributed space partitioning trees (DSPTs), which solve the problem of distributed location data management. DSPTs are peerto-peer data structures that allow publishing, updating of, and searching for dynamic geometrical objects. DSPTs are described in depth in [2]. When realizing DSPTs, the problem of redundant replies occurs, as different peers in the system are able to report the same match to a given query, potentially leading to an overload of the peer that isued the query originally. In this paper, we present a distributed leader election algorithm that solves this problem without additional associated communication cost and also provides a fair workload distribution.

2. Problem Statement and Related Work The fundamental idea of structured peer-to-peer systems like distributed hash tables (DHTs) (e.g., Chord [6], content-addressable network (CAN) [5] and distributed space partitioning trees (DSPTs) (like RectNet [2]) is to par-

tition the search space into clusters and to assign a peer to each of these clusters. The search space S of a distributed space partitioning tree (DSPT), the so-called context space, is an ndimensional interval. Instead of using a hash value, DSPTs use a subset of the search space, the so-called location. For practical reasons, locations are limited to elements from a so-called location set. A location set L is a subset of the power set P(S), i.e., a set of subsets of S, where all elements of L are measurable, have a compact representation, and for which certain efficient set operations are available. A DSPT supports the operations to publish, update, and remove so-called located objects (lo , o), lo ∈ L, where o is an application specific object. In addition, it supports query to lookup all located objects whose location intersects, is contained by, or is equal to a given query location q ⊆ S. A more in depth description of DSPTs, locations, and located objects can be found in [2]. In a DSPT, a located object (lo , o) is stored by the peers whose cluster contains or intersect lo . It is important to note, that this is a key difference between the current DSPT realization presented in [2] and DHTs. In DHTs, a tuple is identified by the hash value of the key, which is a single element of the search space. In DSPTs, a tuple is identified by lo which may be an arbitrary subset of the search space. As a consequence, compared to DHTs it is not possible to assign a tuple to a unique peer, as there does not necessarily exist a single peer whose cluster completely contains lo . Thus, the located object is replicated by all peers whose cluster intersects lo . The query location q of a query operation is also an arbitrary subset of the search space, and the different queries are realized by routing a query message to all peers whose cluster intersects q. Each peer receiving the query message can then locally check if it stores a located object that matches the query and directly send a reply message the peer who originated the query. This is the point where the so-called redundant reply problem occurs. Figure 1 illustrates the situation for an intersection query. Here, the search space is a 2-dimensional interval which is partitioned into a number of clusters, and

lo

0000 1111 0000 1111111 000 0000 0001111 111 0000 0001111 111 000000 111111 0000 0001111 111 000000 111111 0000 0001111 111 000000 111111 000000 111111

lo ∩ q

q

Figure 1. The problem of redundant replies. a single peer is assigned to each of the clusters. Each of the peers stores a replicated version of a located object with the location lo , and each peer whose cluster also intersects the query location q receives a query message to resolve the intersection query. In this case, all peers whose cluster intersects lo ∩ q can individually see that the stored object matches the incoming query and send a reply to the peer originating the query. Of course, it would be sufficient, if one of the peers sends a reply reporting the matching located object, as additional messages reporting the same matching object to the same peer are redundant. This is called the redundant reply problem. This problem is of central importance for a DSPT implementation. A high number of redundant replies can easily saturate the network connection of a peer. The results can be the same as of a denial of service attack. The redundant reply problem also occurs in other places: In the DSPT realization presented in [2], called RectNet, the problem of redundant replies also occurs in routing optimization and when realizing higher level DSPT based services. Our solution of the redundant reply problem is to select a single peer that reports a located object matching a query. This selection should be fair and scalable. In all cases, the selection has to be made by a set of peers whose clusters intersect a certain area. This area is called the selection area. In the example in figure 1, the selection area is lo ∩ q. The basic approach of a DSPT is to distribute workloads by assigning clusters with variable size to peers. Thus, a fair selection means that the probability to select a peer should correlate with the part of the selection area covered by cluster of the peer. For a fair selection of the peer for the elimination of redundant query replies one has to consider the fact, that a single query can match an arbitrary number of objects. When a large number of objects match a query, reporting these matches causes a considerable amount of traffic. Thus, in this case, a fair distribution of the load means that the selection of the replying peer has to be performed independently for each match, too. The following require-

ments summarize the properties of an ideal selection algorithm: a) All peers whose cluster intersects a selection area A should consistently select one peer among them to send a message. b) The selection should not cause additional traffic. c) The selection should be fair and scalable, i.e., the probability that the peer of cluster C is selected is approximately |A ∩ C|/|A|, where |x| indicates the Lebesguemeasure of x. d) When processing a query, the selection of the peer should be made independently for each matching object. In this paper, we present different approaches to solve this problem, taking into account different location sets. The redundant reply problem is a problem original to DSPTs. In DHTs, the problem does not occur, as the keys are points in the search space, and by partitioning the search space it is possible to identify a unique peer that is responsible for the key. In addition, the prerequisites here differ significantly from the consensus and leader election algorithm discussed in classical distributed algorithms as discussed in [4]. The different peers do not know the number and addresses of the peers participating in the leader election, and the leader election happens after all participating peers received certain messages (e.g., query messages).

3 Interval Location Sets For some applications, location sets do not need to be able to represent complex geometrical objects. The simplest location set is the set of points in S. In this case, most of the problems outlined above do not occur, since a point in S is always contained in the cluster of exactly one peer, thus no selection is necessary. If a DSPT is expected to consist of many weak hosts, i.e., hosts with limited RAM and a weak CPU, a location set has to be selected where the geometrical operations are simple to implement and efficient to execute. Therefore, the set LI of n-dimensional intervals in S is a good candidate. An interval [a, b], a = (a1 , . . . , an ), b = (b1 , . . . , bn ) ∈ Rn , is defined as {x = (x1 , . . . , xn ) ∈ S : ai ≤ xi ≤ bi , i = 1, . . . , n}. Geometrically, intervals in Rn are cuboids aligned with the coordinate axis. By limiting the location set to LI , the locations of real world objects can only be modeled with a limited precision, but the representation is very compact and geometrical algorithms are efficient and simple.

3.1 Point Selection for Intervals Selecting a peer for intervals is simple. The idea is to select it implicitly before updating the object. The peer sending a query for an area q does not know the resulting selection area in advance. The only thing known is that the selection area is an interval, because the intersections of intervals are intervals, too. The peer sending the query randomly selects a point p in the unit cube [0, 1]n . This point is included in the query message. The peers finding a matching located object (lo , o) can now calculate the interval A := lo ∩ q. With a simple affine transformation, all points from the unit cube are mapped to points in A. The peers use the transformation to map p to p ∈ A. Finally, only the peer of the cluster containing p sends a reply reporting the matching object. The probability for a peer to send the reply is equal to the part of the selection area covered by its cluster. By using the transformation from the unit cube to A, it is possible to agree on a random point in A, even if the host does not know the selection area in advance. Requirements a) and c) are fulfilled by this algorithm. Requirement b) is fulfilled in the sense, that no additional messages are required to perform the selection, but additional traffic is generated by the random point sent along with the query. While the intersections between q and the locations of different objects are not the same, the locations may intersect the clusters of the same peers. The transformation of p into these intersections will result in points that are approximately in the same region, and thus it is likely that the different matches are reported by the same peers. Thus, requirement d) is not completely fulfilled. The selections of the reporting peer for the different matching objects are not independent from each other. The following variation of this approach fulfils all four requirements.

3.2 Hash Based Point Selection for Intervals The number of located objects that could match can be very high. Thus, it does not make sense to send a large set of random points from the unit cube along with each query, just to ensure that the reporting is distributed fairly among the peers. This would be a waste of resources. How can the peers come to an independent agreement for each match? The solution presented here makes full use of the common knowledge of the peers that can report the same match. Let q be the query area, and lo the location of a matching object. All peers whose cluster intersects the area A := q ∩ lo are candidates to be selected, and each of them knows lo , q and thus A. Since all peers in the DSPT regularly exchange location data across the network, they share a common data format to represent locations. Each query message contains the network address a of the sender of the query and a query ID i. Now each candidate can calculate

a cryptographic hash, e.g., SHA-1, of the concatenation of a, i, and A. For example, the result of SHA-1(a, i, A) is a 160bit String. The hash can be cut into n blocks of 160/n bits. Let v(i) be the integer value in {0, . . . , 2160/n − 1} represented by the i-th block of the hash. Then p=

1 · (v(1), . . . , v(n)) (2160/n − 1)

is a point in the unit cube [0, 1]n . For each result of the same query, the hash is a new random bit string. Thus, p is also a new random point for each matching object. The number of bits returned by SHA-1 is sufficient for typical applications in a 2- or 3-dimensional context space. After calculating p, each candidate maps it to a point p ∈ A, as described above. Then the candidate can check, if its cluster contains p and decide, if it has to send a reply or not. This algorithm fulfils all four requirements. No additional traffic is generated and the selection is performed independently for each matching object.

4 Complex Location Sets The algorithms presented in the previous section only work for the simple location set LI of intervals. For location sets that are more complex, it is more difficult to select a point in the location uniformly randomly. This section develops an approximate solution that is efficient for more complex location sets. For the selection, the idea is to model the location by a sufficiently fine-grained n-dimensional regular interval partitioning, i.e., a grid, in S. These intervals can be numbered, and it is straight forward to select one of the intervals with a probability proportional to the amount of the location covered by it. Of course, it is not efficient to calculate the complete set of these intervals for each selection. The following algorithm illustrates the idea for an efficient random selection in a 1-dimensional context space. The approach can easily be translated into the n-dimensional case. The algorithm is based on a randomized binary search on the regular interval partitioning of S. Definition 4.1 The regular interval partitioning J d of S ⊆ R with depth d ∈ N0 is defined as: J d := {J (d,k) , k = 0, . . . , 2d − 1}, where J (d,k) := [a + k ·

(b − a) (b − a) , a + (k + 1) · ), 2d 2d

k = 0, . . . , 2d − 1.

A J0 r1 = 0.8 0.6

J1

0.4 r2 = 0.1 1.0

0.0

J2

r3 = 0.4 J3

0.625 0.375 J3 ⊆ P i Pi

Figure 2. An example calculation of BIIS.

of A intersecting each half. The set of all possible values of Jj is J j . Figure 2 illustrates an example calculation. Figure 2 illustrates an example calculation of BIIS for Pi , A, d = 3, and a context space S = [0, 1). The number under each interval denotes the probability to select it in the next step. The probability to reach the interval selected in this example run is 0.4 · 1.0 · 0.625 = 0.25. This equals |A ∩ J3 |/|A| = 0.125/0.5 = 0.25. As J3 ∩ Pi = ∅, the algorithm returns true. Lemma 4.1 The probability P (J (j,k) ) that J (j,k) is selected as Jj after calculating step j equals |J (j,k) ∩ A|/|A|. Proof: The proof is done by induction. For j = 1, the probability P (J (1,0) ) that J (1,0) is selected is equal to pj .

In the 1-dimensional case, the clusters are intervals in S. The clusters Pi of the m peers in a DSPT for S define a partitioning of S into half open intervals, i.e., S=

m 

Pi , and Pi ∩ Pj = ∅ for 1 ≤ i, j ≤ m and i = j.

i=1

In each step, the algorithm randomly chooses one half of the interval Jj−1 to become Jj , based on the size of the part

|J (1,0) ∩ A| |A|

=

If J (1,0) is not select, J (1,1) is selected. I.e., P (J (1,1) ) = 1 − P (J (1,0) ).

The following algorithm is executed by a peer of cluster Pi ⊆ S, when it has to decide, if it is the peer that is selected, e.g., it has to report a matching located object for a selection area A. The parameter d denotes the precision of the calculation, i.e., the depth of the interval partitioning. The algorithm returns true, if the peer of Pi is selected. Algorithm 4.1 BIIS (Basic Interval InterSection) Input: S = [a, b), Pi the cluster of a peer, A ⊆ S, d ∈ N0 . Output: true or false. Algorithm: j := 0 J0 := S δ0 := b − a while (j < d) { j := j+1 δj := δj−1 /2 l := lower bound of Jj−1 u := upper bound of Jj−1 pj := |[l, l + δj ) ∩ A|/|A ∩ Jj−1 | rj := random value in (0, 1] if (rj ≤ pj ) Jj := [l, l + δj ) else Jj := [l + δj , u) } if ((Jj ∩ Pi ) = ∅) return true else return false

|[l, l + δj ) ∩ A| |A ∩ J0 |

P (J (1,0) ) =

P (J (1,1) ) =

1 − P (J (1,0) ) |J (1,0) ∩ A| |A|

=

1−

=

|A| − |J (1,0) ∩ A| |A|

(1)

=

|J (1,1) ∩ A| |A|

(2)

Equation 1 equals 2, because J (1,0) and J (1,1) are a partitioning of S. Thus, the induction hypothesis holds for j = 1. For the induction step, assume that the induction hypothesis holds for j − 1, and P (J (j−1,k/2) ) =

|J (j−1,k/2) ∩ A| . |A|

To reach J (j,k) after step j, the algorithm has to choose J (j−1,k/2) in step j − 1. Then it will select J (j,k) with probability |J (j,k) ∩ A| . (j−1,k/2) |J ∩ A| Thus, P (J (j,k) ) is the product of these probabilities. P (J (j,k) )

= = =

|J (j,k) ∩ A| · P (J (j−1,k/2) ) |J (j−1,k/2) ∩ A| |J (j,k) ∩ A| |J (j−1,k/2) ∩ A| · |A| ∩ A|

|J (j−1,k/2) |J (j,k) ∩ A| |A|



Theorem 4.1 The probability PBIIS := P (BIIS(Pi , A, d) returns true) for a cluster Pi , a selection area A, and a d with |S|/2d ≤ |Pi | for all i is bounded as follows: |S| |Pi ∩ A| |Pi ∩ A| ≤ PBIIS ≤ + d−1 |A| |A| 2 · |A| j

Proof: J j := {J (j,0) , . . . , J (j,2 −1) }. Let J ∗ := {J ∈ J d : J ∩ Pi = ∅}. The algorithm returns true, when the selected interval Jd intersects with Pi , thus P (BIIS(Pi , A, d) returns true ) = P (The algorithms selects a J ∈ J ∗ ). P rob(The algorithms selects a J ∈ J ∗ ) =

 J∈J ∗

P rob(J is selected) = 

In the best case, lowing lower bound:

J∈J ∗

 |J ∩ A| |A| ∗

J∈J

J = Pi . This leads to the fol-

P rob(The algorithms selects a J ∈ J ∗ ) ≥

|Pi ∩ A| |A|

 If J∈J ∗ J does not exactly match Pi , the J ∈ J ∗ containing the borders of Pi may be selected. Each is selected with probability |J| |S| |S| |J ∩ A| ≤ = d , since |J| = d . |A| |A| 2 · |A| 2 Thus, this upper bound can be given. P rob(The algorithms selects a J ∈ J ∗ ) ≤

|S| |Pi ∩ A| |S| |Pi ∩ A| +2· d = + d−1 |A| 2 · |A| |A| 2 · |A| 

Theorem 4.1 shows that for increasing depth d the probability of the algorithm to return true converges toward |Pi ∩ A|/|A|, which is exactly the value from requirement c). The algorithm does not require any additional messages to be exchanged, thus it fulfils requirement b). BIIS does not meet requirement a), i.e., different peers do not select the same Jj , since every peer independently makes random decisions. To reach an agreement, the hashing approach from section 3 is used. For each decision a peer has to make it can calculate a cryptographic hash based on the query and the matching object, or on the update message. For all peers participating in the selection, the hash value is the same. It now can be used in two ways to generate a consistent sequence of random numbers in [0, 1] for each peer. The

first one is that all hosts agreed on an algorithm to generate pseudo-random numbers. In this case, the hash is used as the seed random number generator. The second alternative is to interpret a sequence of blocks of the hash as a sequence of random numbers, starting from the beginning, if all bits are used. Both approaches ensure that the participating peers make the same selection, and are appropriate for practical applications. However, even when all peers agree on the selection of the same Jj , BIIS does not guarantee that exactly one peer is selected. In the rare case, that a Jj is selected that intersects the border of two peers with adjacent clusters, i.e., two blocks Pi , Pi+1 ∈ P , the algorithm returns true for both of them. Because of the size of Jj , this happens with a probability of less than |S|/(2d · |A|). Only returning true for the peer whose clusters upper bound intersects with the selected Jj can easily break the tie. In this way the algorithm also fulfills requirements a) and d). In an actual implementation of the algorithm, using finite precision floating-point arithmetic implies a value for d. For a floating-point representation, there exists a value ε that is the smallest non-zero number representable by the datatype. The sequence δj converges towards zero. Thus for some j, δj < ε holds, and the variable storing δj becomes zero. Then with probability 1/2, one can choose Jj to become either {l} or {u} to terminate the algorithm. As dividing δj by two equals a right shift in the bit representation, this means that the implicit value of d is proportional to the number of bits used to represent δj . This leads to algorithm 4.2. The conditions in the while loop ensure to stop the calculation as soon as it is clear that the selected area will be either completely inside or outside of Pi . The termination is ensured by the fact that δj becomes zero when calculating with finite precision floating-point numbers. In that case, Jj becomes a point, thus it will be either inside our outside of Pi and trigger the termination rule in the while loop. The impact of these optimizations is highly dependent on the chosen location set and the nature of the actual locations used. Here, IIS is only described for the 1-dimensional case, but the approach can easily be translated into higher dimensions. The core modification for the n-dimensional case is how to calculate the Jj . In step j, Jj−1 is cut in half at a hyperplane in the (j mod n)-th dimension. In this way, Jj remains an n-dimensional interval. Figure 3 illustrates how IIS works in the plane using the example situation from figure 1. In step a), the peer of Pi receives the query q. Then in step b), the peer starts algorithm IIS with J0 = S. The dashed line outlines J0 , and the dotted line indicates the two sub intervals that could become J1 . Each of them contains half of the selection area A and will become J1 with probability 0.5. The calculation continues in c). Finally, in step d), J2 is selected. As J2 ∩ A ⊂ Pi , the while loop of IIS terminates and the algorithm returns true. The peer of Pi reports the match.

Algorithm 4.2 IIS (Interval InterSection) Input: S = [a, b), Pi a block of an interval partition P , A ⊆ S, d ∈ N. Output: true or false. Algorithm: j := 0 J0 := S δ0 := b − a while (((Jj ∩ A) ⊆ Pi ) ∧ ((Jj ∩ A ∩ Pi ) = ∅)) { j := j+1 δj := δj−1 /2 (* floating-point operation! *) l := lower bound of Jj−1 u := upper bound of Jj−1 if (δj > 0) then { pj := |[l, l + δj ) ∩ A|/|A ∩ Jj−1 | with probability pj do Jj := [l, l + δj ) else Jj := [l + δj , u) } else { with probability 1/2 do Jj := {l} else Jj := {u} } } if ((Jj ∩ A) ⊆ Pi ) then return true else return false

J0 Pi

lo

000 111 111 000 000 111 000 111 000 111 000 111 000 111 000 111 q

A := lo (∩q

000 111 111 000 000 111 000 111 000 111 000 111

0.5

a) J1

0.5

b) J2

000 111 111 000 000 111 000 111 000 111 000 111

0.9

0.1

c)

J2 ∩ A ⊆ Pi

000 111 111 000 000 111 000 111 000 111 000 111 d)

Figure 3. Algorithm IIS in the plane.

5 Evaluation From the perspective of communication costs, the algorithm IIS is an ideal algorithm, as no additional communication is necessary to perform the leader election. The local computation time depends on the precision of the approximation of the probabilities, i.e., the maximal depth of the computation. In addition, the location set has a major impact on the computation time. However, the heuristic of IIS minimizes the number of necessary iterations. First experiments with real-world data indicate the effectiveness of the heuristic. E.g., in our experiments we used the geometry of approx. 8000 ATKIS [1] objects covering the urban region of Hagen in Germany. This data set includes objects likes lakes, rivers, streets, forests, housing areas, etc. The average number of iterations needed for the algorithm to terminate was about 12 when using a DSPT with 1000 peers covering the entire area.

6 Conclusion DSPTs are relevant for many interesting application in mobile computing, as discussed in [3]. In this paper we described the algorithm IIS to solve the redundant reply problem in DSPTs. As the algorithm makes no assumptions on the way the search space is partitioned by the DSPT, we expect that it will be possible to apply the algorithm in new future DSPT implementations, going beyond the DSPT realization RectNet [2] by the authors of this paper. The algorithm presented in this paper solves the redundant reply problem without introducing new communication costs to the DSPT. We also conducted some experiements that indicate that the heuristic used in IIS significantly improves the running time of the algorithm for real-world location data.

References [1] A. der Vermessungsverwaltungen der Länder der Bundesrepublik Deutschland (AdV). AdV Homepage. www.advonline.de. [2] D. Heutelbeck. Distributed Space Partitioning Trees and their Application in Mobile Computing. PhD thesis, FernUniversität in Hagen, 2005. [3] D. Heutelbeck and M. Hemmje. A Research Platform for Location-Based Applications. In Proc. of 2nd GI/ITG KuVS Fachgespräch Ortsbezogene Anwendungen und Dienste, 2005. [4] N. A. Lynch. Distributed Algorithms. morgan Kaufmann, 1996. [5] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A Scalable Content Addressable Network. Technical Report TR-00-010, Berkeley, CA, 2000. [6] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan. Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications. In Proceedings of the 2001 ACM SIGCOMM Conference, pages 149–160, 2001.