Keyword Fusion to Support Efficient Keyword-based ... - CiteSeerX

37 downloads 0 Views 475KB Size Report
P2P file sharing applications, such as Napster, Gnutella, and. KaZaa, have attracted a huge population of users, and are being used by millions of users every ...
Keyword Fusion to Support Efficient Keyword-based Search in Peer-to-Peer File Sharing Lintao Liu

Kyung Dong Ryu

Kang-Won Lee

Dept. of Computer Science & Engineering Arizona State University Tempe, AZ 85287, USA

IBM T.J. Watson Research Center Hawthorne, NY 10532, USA

{lintao.liu, kdryu}@asu.edu

[email protected]

Abstract—Peer-to-Peer (P2P) computing has become a popular distributed computing paradigm thanks to abundant computing power of modern desktop workstations and widely available network connectivity via the Internet. Although P2P file sharing provides a scalable alternative to conventional server-based approaches, providing efficient file search in a large scale dynamic P2P system remains a challenging problem. In this paper, we propose a set of mechanisms to provide a scalable keyword-based file search in DHT-based P2P systems. Our proposed architecture, called Keyword Fusion, adaptively unburdens the peers overloaded with excessive storage consumptions and reduces network bandwidth consumption by transforming users’ queries to contain more focused search terms. Through trace-driven simulations, we show that Keyword Fusion can reduces the storage consumption of the top 5% most loaded nodes by 50% and decrease the search traffic by up to 81% even in modest scenarios. Keywords—P2P file sharing, keyword-based file search, distributed hash table (DHT)

I.

INTRODUCTION

Peer-to-Peer (P2P) computing has become a popular distributed computing paradigm thanks to abundant computing power of modern desktop workstations and widely available network connectivity via the Internet. In recent years, P2P file sharing applications, such as Napster, Gnutella, and KaZaa, have attracted a huge population of users, and are being used by millions of users every day. Although P2P file sharing provides a scalable alternative to conventional serverbased approaches, providing efficient and robust file search in P2P systems has been a key challenge. The problem is especially difficult in a large-scale dynamic P2P system, consisting of hundreds of thousands of peers, where peers enter, exit, or fail in an unpredictable manner. To provide effective search for desired files in such a large-scale P2P network, several approaches have been taken. The directory-based approach adopted by Napster [1] uses file location information stored at a central directory server. Despite its simplicity and ease of management, this approach suffers from poor scalability and has a potential for a single point of failure. Flooding-based approach was designed for fully decentralized P2P networks having only user-driven neighbor tables, such as Gnutella [2]. While its search performance is relatively effective, the flooding nature of this approach consumes substantial network bandwidth. In

addition, remotely located files may not be found due to the practical limits imposed on search range (e.g. using TTL). The third approach applies to structured P2P networks, such as Tapestry [3], Pastry [4], Chord [5] and CAN [6]. Typically these systems use a distributed hash table (DHT) as a substrate, in which a file or its location information is placed deterministically at certain peers. These DHT-based approaches guarantee efficient discovery of an existing file in a small bounded number of network hops (O(logN)) for a network consisting of N nodes. Although DHTs provide efficient lookup service, files can be located only through their globally unique IDs. Oftentimes, however, users may wish to search for files using a set of descriptive keywords. Besides, users do not always have the exact ID of the files they want to locate. In this respect, augmenting DHT with a keyword-based search capability is a valuable extension and several researchers have studied this problem [7-10]. In general, keyword search in a DHT is supported using an ‘inverted distributed hash table’ technique, which uses keywords as indices of a DHT to locate files [7]. However, the related work so far has not considered an important problem which is introduced by the inverted hash table: the common keywords problem. It has been shown that certain keywords are commonly associated with a very large number of files [11], compared to most other keywords. Therefore DHT nodes that are responsible for such common keywords will consume excessive amount of storages than other peer nodes. This induces a severe unfairness problem and may discourage users from participating in P2P networks. In addition, user queries involving these common keywords will cause a large amount of network traffic. In this paper we propose a novel keyword-based file search mechanism called Keyword Fusion. This mechanism provides a scalable and efficient solution for DHT-based P2P file sharing systems, where files are annotated with descriptive keywords. In particular, Keyword Fusion is a fully decentralized architecture with the following features: • Utilizing a distributed data structure, called Fusion Dictionary, which stores most common keywords in the system, Keyword Fusion transforms queries to more specific search terms and thereby improves search efficiency. • Safely deleting excessively large lists of files containing common keywords or redistributing them over the entire

network, Keyword Fusion can address the discriminated high storage consumption at the overloaded peers. Based on a set of distributed algorithms, which can be easily incorporated into an existing DHT-based P2P lookup service, Keyword Fusion offers a low overhead solution to the inefficiency in search and unfairness amongst peers in current keyword-based search mechanisms. This mechanism is designed for searches of files annotated with related keywords, such as images and movies, rather than full-text searches of documents. Many existing P2P file sharing systems have observed that most shared files are multimedia files and binary data files. The rest of this paper is organized as follows. Section II describes a vanilla extension for keyword searching to a DHTbased P2P system and identifies its limitations. Section III introduces Keyword Fusion architecture and addresses the problem of common keywords. Section IV evaluates the effectiveness of Keyword Fusion using our extended Chord simulator and annotated image files. Section V describes related work, and finally, Section VI concludes this paper. II. KEYWORD SEARCH IN DHT-BASED P2P SYSTEMS This section presents a vanilla extension to a distributed hash table mechanism, using Chord [12] as a reference DHT, to support keyword search capability. The algorithm presented in this section abstracts the key ideas of the inverted distributed hash table method proposed in [7, 8]. To describe briefly, in DHT-based P2P systems, locating a node that contains a particular file is simply done by querying a distributed lookup table that stores mapping, where file ID denotes the globally unique ID of the file and the value represents the location of the file. To accommodate typically huge ID space, this mapping information is distributed and stored over multiple DHT nodes. When assigning a file ID to a node, consistent hashing is used so that the load is evenly distributed across the entire nodes. To facilitate efficient routing, DHT-based P2P systems organize the participating nodes into a logical overlay structure, e.g. Chord organizes the nodes into a logical ring. When a user wants to locate a file, an exact file ID must be provided as the key so that the DHT algorithm can route the request to the corresponding DHT node that stores the location of the file. In Chord, when a node receives a query message, it looks up the local routing table (called “finger table”) that contains the next node information in the logical ring and forwards the message to the next node. Chord organizes the finger table in such a way that routing a message to the destination resembles a distributed binary search. As a result it can efficiently route messages in O (log N) steps for a network consisting of N nodes. We can extend Chord to support keyword-based queries in a straightforward manner by maintaining information at each DHT node, in place of . Note that because the same keyword can appear in multiple files, unlike in the case of file ID, the right hand side of the mapping is extended to store a list of values to include

Sunse t:{a.jpg}

N1

N6

Mountain:{ a.jpg, b.jpg}

Mountain? Boat:{c.jpg}

N5

Mountain:{b.jpg} N2

Apple :{d.jpg}

Tre e :{b.jpg} Tre e ?

Rive r:{b.jpg} N4 Rive r:{a.jpg, b.jpg, c.jpg}

File ID a.jpg b.jpg c.jpg d.jpg e.jpg

N3

Tre e :{a.jpg, b.jpg, d.jpg, e .jpg} Inse rt Me ssage Q ue ry Me ssage

Keywords Tree, River, Mountain, Sunset Tree, River, Mountain Boat, River Apple, Tree Tree

Figure 1. Chord extension for keyword-based searches the locations of all files containing that keyword. Figure 1 presents an illustration of such an extension. In this example, there are five files (a.jpg, b.jpg, c.jpg, d.jpg, e.jpg) with corresponding keywords describing the contents of the files. The location information of the files is now distributed using the keyword as the key for consistent hashing. For example, consider b.jpg with a set of keywords {Tree, River, Mountain}. To assign the location information of the file in the extended Chord, each of its keywords is first hashed and assigned to a DHT node. In the example, Tree is assigned to N3, River is assigned to N4, and Mountain is assigned to N1. Since any of these keywords can be used by a query to locate b.jpg, the location information of b.jpg must be stored at all three nodes, N3, N4, and N1. In this way, we can guarantee that b.jpg’s location is returned for any query that searched for Tree, River, or Mountain. When a user searches files with a keyword, the extended Chord forwards the query to the node that contains the location of the files annotated by that keyword. If a user specifies multiple keywords to locate a file, the extended Chord should take the intersection of the results for each keyword before returning the results to the user. This query processing can be done in a distributed manner by filtering the results at each node hosting a keyword in the query [7][8]. For example, in Figure 1, if a user at N2 wants to find files containing both Tree and Mountain, N2 can send out a query message to N3 which is responsible for Tree. N3 then sends the intermediate result set {a.jpg, b.jpg, d.jpg, e.jpg} to N1, where the file list of Mountain is stored. By intersecting the intermediate results from N3 with the file list for Mountain, N1 will generate the final result, {a.jpg, b.jpg}. We call this chained query processing.

2

While this extension with inverted hash table and chained query processing is conceptually simple and will operate correctly, we notice a few drawbacks. First, unlike the original mapping of DHT, the mapping demands unbalanced space requirements to DHT nodes. In particular, it is well known that certain keywords are commonly associated with a large number of files, while other keywords appear in only a few files [11] (see Section IV for dataset analysis). Consequently, storage consumption is highly skewed among peers and a small number of nodes will be unintentionally discriminated. Second, as a corollary of the first problem, a search query containing these common keywords will generate a huge volume of network traffic since they are associated with a very long list of file location information. What’s worse, not all the search results may be used to answer user’s query since these common keywords tend to be too generic and thus may contain a large amount of irrelevant information. Thus, we need a mechanism to reduce the return traffic. In the next section, we present our solution architecture called Keyword Fusion to address these problems of keyword search in DHTs. III. KEYWORD FUSION ARCHITECTURE One of the main challenges in designing a keyword search mechanism for DHT is to address the problem of common keywords. Common keywords are those keywords that frequently appear in the keyword lists of a large number of files. They are very generic words such as “music”, “city”, and “tree”. In this section, we describe the concepts of Fusion Dictionary and Keyword Fusion which handle the problem of common keywords. A. Preliminaries Before describing the Keyword Fusion architecture, we first define a few notations. Let h(k) denote the hosting DHT node which stores the mapping for keyword k, and K(f) denote the set of keywords associated with file f. Also, let F(k) denote the set of files which contains keyword k. Throughout this paper we use file f and the location of file f interchangeably. Using these notations, we can concisely denote the mapping for keyword k as . As an example, in Figure 1, h(Tree) = N3, F(Tree) = {a.jpg, b.jpg, d.jpg, e.jpg}, and K(c.jpg) = {Boat, River}. For query processing, we consider only conjunctive queries, i.e. keywords in the query are AND-ed. Supporting disjunctive queries (OR operation) is straightforward by issuing multiple queries. Once a user issues a query, it is routed to the DHT nodes that are responsible for the keywords in the query. We call them destinations. For multiple-keyword query, we use chained query processing, as explained in section II. B. Fusion Dictionary & Partial Keywrod List Consider a query searching for “music AND rock and roll AND Joe’s band”. Note that, in this query,

“music” is the most generic keyword, and “Joe’s band” is the most specific keyword. Thus clearly the following inequality holds: | F(Joe’s band) | < | F(rock and roll) | < | F(music) |. Therefore, when searching for files that contain all three keywords (“music” “rock and roll” “Joe’s band”) it is advantageous to search for the most specific keyword (“Joe’s band”) first, and then filter the results using the other keywords “music” and “rock and roll”. This example suggests that identifying common keywords and processing the query with uncommon keywords can optimize chained query processing. Common keywords are associated with a large number of files, which not only causes a large amount of network traffic, but also consumes a considerable storage space on the hosting nodes. Removing the mapping data of common keywords from the overloaded nodes can efficiently solve these two problems, as long as all queries can still be answered. This observation provides the insight into file list consolidation using Fusion Dictionary. In a nutshell, Fusion Dictionary is a distributed data structure that contains common keywords. DHT-based P2P systems guarantee that each node will be responsible for almost the same number of keywords. So, when a DHT node determines that its storage consumption is excessive, it must be hosting common keywords. Then the node registers common keywords into Fusion Dictionary and removes the file entries for the common keyword from its storage. In other words, Fusion Dictionary contains a list of common keywords whose entries have been deleted from their hosting nodes. Once the mappings for common keywords have been removed from the hosting nodes, how could a query containing such keywords be answered precisely? To address this case, we introduce another data structure, called partial keyword list. Each file f in the mapping now builds and stores a small partial keyword set PK(f) = K(f)∩FD, along with other meta-data. When a node issues a search query, it first consults Fusion Dictionary to select only the keywords which are not in the dictionary, and access their hosting nodes in a chain for query processing. Now, with partial keyword lists added to file lists, the common keywords in the query can be processed at any of those nodes. This will make query processing more efficient by omitting the nodes hosting common keywords and avoiding transferring large intermediate results. Since the partial keyword list PK(f) is determined by the current Fusion Dictionary, it is also generated and maintained dynamically. When keyword k is added into Fusion Dictionary, the node h(k) just removes all the entries in F(k) and propagates the dictionary update to other nodes. When a node receives a dictionary update that k is added into the Fusion Dictionary, it first checks its local database. If this node has published a file f which contains k as one of its keywords, it re-publishes the same file f into the network by sending it to the nodes hosting keywords other than k in K(f). Then, the destination nodes will add k into the appropriate partial keyword lists. 3

1. User issues a query: search “rock and roll & music & Joe’s band”

3. Transforms the query: search “Joe’s band”

User 8. Returns the result to the user

Initiator 7. Receives the result

Extended DHT-based P2P Network

5. Filter the results using the partial keyword list

4. Receives query for “Joe’s band”

Destination 6. Returns the refined result for “music & rock and roll & Joe’s band”

2. Looks up local Fusion Dictionary and finds “music” and “rock and roll” already there

Figure 2. Keyword search using Keyword Dictionary containing generic keywords In order to minimize the lookup overhead, the content of Fusion Dictionary is replicated and propagated across DHT nodes. Managing Fusion Dictionary and partial keyword lists is a fully decentralized operation. The local Fusion Dictionary periodically exchanges heartbeat messages carrying updates with other Fusion Dictionaries of the neighbors. The query messages can also be used to piggyback the dictionary updates. Since these query messages are traveling around the whole network, it will greatly accelerate the propagation progress. Thus after a well-defined time period T, the registration of a keyword k to Fusion Dictionary will be propagated to all DHT nodes 1 . After waiting for time T, the hosting node h(k) removes the file location information from k’s mapping. This operation is repeated until the level of storage consumption at the DHT node is below a threshold. The storage overhead of our mechanism is determined by the keyword set size of each file and the Fusion Dictionary size. Zipf-like distribution of keyword popularity (See section IV for trace data) indicates that only a small fraction of the keywords used for annotations are very common and would be inserted into Fusion Dictionary. Consequently, partial keyword lists would also be small enough to be stored in the inverted DHT. Partial keyword lists can be compressed in implementation using Bloom Filter [13]. We now revisit our example and illustrate the operation of keyword-based search using Fusion Dictionary and partial keyword lists (See Figure 2). When a user generates a query for “music AND rock and roll AND Joe’s band”, the initiating DHT node first looks up its local Fusion Dictionary. From the Fusion Dictionary, the initiating node discovers that keywords “music” and “rock and roll” are not good search terms since they are too common and have been added into the Fusion Dictionary. Thus the node modifies the query to contain only “Joe’s band” as the search term, while containing the original query in the query body. It then sends to the destination node that hosts the mappings for “Joe’s band”. At the destination node, the query result for

“Joe’s band” is further refined using other search terms in the query, with the help of partial keyword lists. The refined result set will only contain the files with all three search terms. C. Keyword Fusion The key insight behind the Fusion Dictionary algorithm with partial keyword lists is that when a file is associated with multiple keywords a and b, we can safely remove this file’s information from node h(a) as long as the entry for keyword b is maintained because the file is still searchable using the remaining keyword b. Now what happens when h(b) decides that keyword b too is generic and must be removed from its hosting DHT node? Such situations are handled by Keyword Fusion. To describe Keyword Fusion, we first define a function combine that generates a new keyword by concatenating a set of keywords in the Alphabetic order. Let K denote a set of keywords {k1, k2, …, kn}. Then combine(K) generates a new keyword k’ = “k1&k2&…kn” where k1, k2, …, kn are enumerated in the Alphabetic order. 2 For example, combine (music, rock and roll) generates a new keyword “music&rock and roll”. We call the new keywords as synthetic keywords to distinguish them from the original keywords or the prime keywords. After a synthetic keyword has been generated a mapping for this new keyword is defined as: . In other words, the value part of the mapping for the synthetic keyword is an intersection of all file lists in the original mapping, i.e. a list of the files that contain all k1, k2, …, kn in their keyword lists. The operation of Keyword Fusion is as follows. Assume Fusion Dictionary contains keywords, a1, a2, …, am. Now suppose a keyword b is added into the Fusion Dictionary from its hosting node h(b). New keywords are generated by

2

1

The time window T can be computed from the update period and the diameter of the DHT network.

We maintain the Alphabetic order during keyword fusion to ensure that combine function is commutative, i.e., combine (k1, k2) = combine (k2, k1). In this way, we ensure that combine (k1, k2) and combine (k2, k1) are hashed to the same value.

4

combining b with all the keywords in the Fusion Dictionary and the new synthetic keywords are inserted into the P2P network using consistent hashing along with their mappings. More precisely, Keyword Fusion ensures that all the keywords in the Fusion Dictionary that are combined in a pair-wise manner do exist in DHT. For example, if Fusion Dictionary = {a, b, c}, Keyword Fusion guarantees that synthetic keywords a&b, b&c, and a&c exist in the DHT. Note that synthetic keywords can be further synthesized to generate new keywords if the synthetic keywords are still too common. In this case, they first need to be decomposed and recombined to generate a new keyword. More precisely, define decompose(k) to be the inverse function of combine, which returns a set of the prime keywords in the parameter k. Then the Keyword Fusion is defined as combine(decompose (k1) … decompose(kn)). For example, combine(“Everest&Mountain&History”, “Himalaya&Mountain&Trail”) “Everest&History&Himalaya&Mountain&Trail”.

=

Starting from this basic Keyword Fusion algorithm, we can easily employ a few optimization techniques. First, when two keywords do not have common file entries, we do not need to insert the synthetic keyword into the network. For instance, we are not going to combine “music” and “mona lisa”, if there exists no file annotated with both keywords. So, we only need to combine k with the keywords which appears in PK(f) where f ∈ F(k), because only those keywords are common keywords and appear in the same files with k. Second, when the keyword combination involves a synthetic keyword, the Fusion needs to happen only when all the substrings of the new synthetic keyword already exist in the Fusion Dictionary. For example, suppose a synthetic keyword “a&b” is inserted into the Fusion Dictionary and keyword c is already in the partial keyword list of F(a&b). Keyword Fusion doesn’t automatically generate a new synthetic keyword “a&b&c”. This is because even if the mappings in DHT for both keywords “a&b” and “c” have been deleted, a search for “a AND b AND c” still can be served by synthetic keywords “a&c” or “b&c” (and their existence is guaranteed by Keyword Fusion), with the help of their partial keyword lists. In other words, a new synthetic keyword “a&b&c” gets created only when all “a”, “b”, “c”, “a&b”, “b&c”, and “a&c” are already in the Fusion Dictionary. Formally stated, a combined keyword “k1&k2&…&kn” is generated only when all members of 2K (except K), which is a set of synthetic keywords generated from the power set of K = {k1, k2, …, kn}, are all in the Fusion Dictionary. This property greatly reduces the number of new synthetic keywords to be generated. Experiment shows that more than 90% of the synthetic keywords consist of only two prime keywords. Now we describe the Keyword Fusion algorithm using an example. Suppose “music” has been removed from its host. Now when keyword “rock and roll” has to be removed from its host h(rock and roll), the hosting node first looks

up the partial keyword lists of F(rock and roll) and finds “music” there. Then before removing the entry for “rock and roll”, the node h(“rock and roll”) create a new keyword “music&rock and roll” and inserts the new keyword along with the files which contains “music” in their partial keyword list, i.e. the mapping . Once the synthetic keyword and its corresponding mapping information have been successfully inserted, the entry for keyword “rock and roll” can be removed from its host. After this Keyword Fusion, suppose a user generates a query for “music AND rock and roll”. The initiating node first looks up its local Fusion Dictionary. Since both “music” and “rock and roll” are registered with the Fusion Dictionary, it knows that both keywords have been removed from their original hosts. However, Keyword Fusion guarantees that a new synthetic keyword has been generated for all keywords registered with Fusion Dictionary. Thus, the initiating node modifies the query to “music&rock and roll” and sends it out to the Chord network. The query will be answered by the node hosting the synthetic keyword “music&rock and roll” and the result F(“music”) ∩ F(“rock and roll”) will be returned to the initiating node. Figure 3 presents a pseudo code for the Keyword Fusion algorithm. Basically, when the mappings for common keyword k is removed from h(k), it is first registered with Fusion Dictionary (line 3) and tries to combine with the keywords in the partial keyword lists of K(f) (line 4, 5). Line 7 is used to check whether a combination is required. If it is necessary, the new synthetic keywords and their corresponding files are inserted to DHT (line 8) and the entry for the combined files gets removed (line 12). As a result, Keyword Fusion scatters file lists for a common keyword to other nodes. Note that the above algorithm only used the local Fusion Dictionary and the partial keyword list of each file. To summarize, this section has presented two main ideas of our Keyword Fusion architecture: Keyword Dictionary and Keyword Fusion. The role of Keyword Dictionary is to help peers with excessive storage consumption reduce their loads 1 Keyword Fusion (k) 2 { 3 FD ← FD U {k} // fusion dictionary update 4 for all files f in F(k) // iterate through all files containing k 5 for all keywords j in PK(f) 6 l ← decompose (k) U {j} 7 if (all members of 2l but l are in FD) 8 insert file f using keyword combine (l). 9 end // if 10 end // for all keywords 11 end // for all files 12 remove F(k); 13 }

and help users generate efficient search queries by avoiding Figure 3. A pseudo code for the Keyword Fusion algorithm 5

Key word Count Distribution

Keyword Popularity Distribution

Keyword Popularity Distribution 400

150

350

8000

Occurrences

90 60

Occurrences

300

120 Num of Files

10000

250 200 150 100

30

6000 4000 2000

50

0

0

0

0

10

20

30

Num of Key words Per File

(a) Keyword Count Distribution for dataset A

0

150 300 450 Keyword (Sorted by Popularity)

600

(b) Keyword Popularity Distribution for dataset A

0

2000

4000

6000

Keyword (sorted by Popularity)

(c) Keyword Popularity Distribution for dataset B

Figure 4. Keyword count and popularity distributions in annotated image files common keywords. The role of Keyword Fusion is to ensure that conjunction of deleted keywords can be searched by users by creating synthetic keywords. Both methods, when jointly applied, will reduce the degree of storage imbalance and the possibility of generating large volume query-return traffic. IV. PERFORMANCE EVALUATION In this section, we present a simulation-based evaluation of the proposed Keyword Fusion algorithms. In particular, we are interested in the effectiveness of the proposed scheme in terms of traffic reduction, fairness in resource consumption at each peer. For performance evaluation, we have implemented the proposed data structures and algorithms in the Chord simulator [14]. To drive the simulator, we used two data sets: one with 1,000 image files from Corel’s image database and the other with 40,000 image files [15, 16].

datasets. As we expected, these two graphs show that popularity of keywords used for annotation roughly follows a Zipf-like distribution. In dataset A, the highest rank keyword is used to annotate more than 300 files among total 1,000 files. We observe that the top 5% most frequent keywords (i.e. 50 words) appear 6,608 times, which is about a half the total annotations. In dataset B, top 5% most frequent keywords appear 124,534 times out of the total 161,051 keyword occurrences. Recall that, in this paper, these high frequency keywords are called as common keywords since they appear commonly in very many files. Our analysis shows that hosting nodes for these 5% most frequent keywords will consume excessive amount of storage and extra processing cycles to handle frequent updates in the keyword information due to file addition and deletion.

B. Impact of Keyword Fusion A. Dataset Analysis In our architecture, sending search queries for synthetic Before evaluating our scheme, we first characterize the dataset keywords instead of searching the original prime keywords to be used in our simulations. The first dataset (we refer to as should reduce network bandwidth consumption. This results dataset A) includes 1,000 image files that were selected from from the fact that synthetic keywords are typically mapped to a Corel’s image database and manually annotated with relevant much smaller file list than the original ones. In this subsection, keywords by a media lab at the City University of Hong Kong we briefly illustrate the impact of this Keyword Fusion in terms [15]. This image file set contains a total of 1,000 unique of reduction in file information. Table 1 presents several sample keywords and each image is annotated with 12 keywords on keyword pairs and the size of resulting file list when Keyword the average. The second dataset (dataset B) is obtained from Fusion has been applied to dataset A. From the table, we observe Digital Library Project in the Unversity of California at that Keyword Fusion can reduce the search result traffic by up to Berkley [16] and contains 40,000 images files. More than 81%. Note that our example is a conservative case of combining 38,000 of these files are annotated with four keywords only two common keywords that are correlated. In practice, selected from 6,510 unique keywords. synthesizing a new keyword from multiple keywords that are Figure 4 presents the characteristics of the two datasets. weakly correlated will further improve the reduction factor. The graph (a) shows the keyword count distribution for dataset A, i.e. number of files vs. number of keywords Keyword 1 (K1) #Files Keyword 2 (K2) #Files K1+K2 K1&K2 Reduction attached to each file. From the figure, we observe that the PLANT 116 FLOWER 101 217 101 53% distribution follows a bell shape in general with average of PLANT 116 LEAVES 147 263 72 73% 12 or 13 keywords per file. We also note that as an exception, SNOW MOUNTAIN 91 SKY 153 244 67 73% more than 60 files have only two keywords. On the other SUNSHINE 85 PEOPLE 145 230 46 80% hands, in dataset B we observed that almost all the files have FLOWER 91 FRAGRANCE 99 190 91 52% four keywords. BEACH 101 SKY 153 254 49 81% Figure 4 (b) and (c) present the frequency of each keyword occurring in each file’s annotation for the two Table 1. Reduction of search result traffic using Keyword Fusion

6

Storage Load Distribution 1000

Storage Load Distribution 5000

No KF T = 450

4000

T = 350 Number of Files

Number of Files

800

NO KF T=1500 T=1000 T=500

T = 330 600 400

3000 2000 1000

200

0

0 0

20 40 60 80 Nodes (Sorted by Num ber of Files)

0

100

100

200

300

400

500

Node (Sorted by Number of Files)

(b) Dataset B with 1,000 nodes

(a) Dataset A with 100 nodes

Figure 5. File list redistribution using Keyword Fusion C. Balancing Storage Demand In this subsection we present the effectiveness of Keyword Fusion in terms of reduction of resource consumption at DHT nodes that host common keywords. Figure 5 presents the amount of file information that has been assigned to each participating peers using the vanilla extension of Chord, and using Keyword Fusion with various threshold values. Figure 5 (a) presents the case of dataset A under extended Chord consisting of 100 nodes while Figure 5 (b) is the case that dataset B is inserted into 1000 nodes. Note that the nodes are sorted according to their load and only a portion of the nodes are shown in the graph as the curves are slowly and constantly tapering down for the rest of the nodes in Figure 5 (b). In the figures, “No KF” represents the case when Keyword Fusion has not been used. In this case, we observe that the load is highly skewed and concentrated on a few nodes; in Figure 5 (a), top 5% highly-loaded nodes store 3,307 file entries (660 files per node) and top 10% highlyloaded nodes store 5,121 file entries (512 file entries per node) where the average is only 120 file entries per node. In Figure 5 (b), the case is even worse; top 5% highly-loaded nodes stores 87,884 file entries, which is about 55% of total file entries. In each graph, the other three curves correspond to the case of Keyword Fusion with three different thresholds. For example, in Figure 5 (a), for three cases where T is set to 450, 350, and 330, respectively, the number of files hosted by overloaded nodes successfully reduces below the threshold. More specifically, when T is set to 350, the top 5% nodes now store only 1,698 files, about 50% reduction compared to the “No KF” case. Notice that in the case of T = 330, the distribution is more leveled, but the average number of files per node increases. This is the case when the system is overcorrected by setting threshold value too low. Similarly, Figure 5 (b) presents the results with dataset B having threshold set to 1500, 1000 and 500. This graph shows a similar result as with dataset A. However, in this case a low threshold will not generate a large number of new instances. Overall the simulation results demonstrate that Keyword Fusion can effectively reduce the excessive storage consumption at overloaded peers and redistribute the storage

demand to the file sharing system, without significantly increasing the storage consumption on the other peers. To summarize, our preliminary evaluation using an extended Chord simulator has illustrated that Keyword Fusion is highly effective in addressing the common keyword problem. Using annotated image file datasets, we have shown that Keyword Fusion effectively removes high peaks in storage skew and hotspots in the P2P network. In addition, by transforming search queries to contain more specific search terms, Keyword Fusion significantly reduces the network usage for query processing. V. RELATED WORK The P2P environment recently has become one of the most popular information sharing architectures. Although search technologies on the Web have been studied extensively, their centralized indexing scheme is not suitable for highly dynamic P2P systems. Providing searching capability is quite different for two different P2P models: unstructured and structured. Unstructured P2P file sharing systems, such as Gnutella [2], uses flooding as its links are naturally constructed by user selected neighbors. While it can easily provide keyword or attribute-based search, its efficiency and coverage are limited. To overcome this limitation of blind flooding, Yang et al. [17] proposed a set of mechanisms that adaptively increase the search range and prune search space using iterative deepening, directed breadth-first traversal, and local indices. Another approach by Crespo et al. [18] maintains routing indices at each node and forwards queries selectively to the neighbors that are more likely to have answers. However, with unstructured P2P model, search coverage is limited to avoid excessive search traffic. On the other hand, structured P2P systems, such as Tapestry [3], Pastry [4], Chord [5], and CAN [6] guarantee locating existing files, regardless of their physical and logical locations. In this model, distributed hash tables are used to route a location request to the destination node. However, they are all designed to locate a file with its unique index key and are not capable of searching with keywords. A few studies [7, 8] extended the DHT scheme to support

7

keyword search. The main idea is to use inverted hash tables that replace a unique file ID with associated keyword IDs. With such extension, keyword-based search can return the list of all relevant files, but its query traffic can cause a significant burden on the network. Reynolds et al. [7] addressed this problem by processing queries cooperatively at destination nodes and using bloom filters to compress intermediate file lists. They also cache the results of previous queries to further reduce network traffic and response time. Panache [8] proposed another system that uses inverted hash tables and bloom filters. To reduce traffic, the system truncates results using file popularity information. File popularity is measured in a distributed manner by counting query hits at each node. However, these approaches do not address the storage hotspots introduced by common keywords. In addition, the query path is still proportional to the number of keywords in the query while Key Fusion can significantly reduce the path using combined keywords. Gnawali [19] proposed Keyword-Set Search (KSS), which groups multiple keywords into a keyword-set and uses it as a hash index. This approach also reduces the query path and network traffic. This is similar to our work in that keywords are combined. However, KSS can create excessive redundant file lists as all possible keyword combinations are generated for each file insertion. In contrast, our Keyword Fusion combines keywords adaptively depending on their changing popularity. Nakauchi et al. [9] proposed to expand P2P search by incorporating semantics. The idea is to take into account popularity of files and relationships between different keywords by maintaining a Keyword Relation DataBase (KRDB) at each node. This work applies the query expansion technique, which has long been investigated in Information Retrieval area, to P2P search. Our work has a similar effect as highly correlated keywords are combined to provide more selective queries. VI. CONCLUSION In this paper, we have proposed a set of mechanisms to provide a scalable keyword-based file search in a DHT-based P2P system. Our proposed architecture, Keyword Fusion, balances unfairly skewed storage consumptions at peers, transforms users’ queries to contain focused search terms. In particular, based on distributed data structures, called Fusion Dictionary that contain common and popular keywords, Keyword Fusion guides the query initiator to generate more specific queries and direct queries to less loaded peers. In this manner, traffic and storage load gets distributed and balanced over the DHT network, and the degree of unfairness amongst peers is reduced. To evaluate the performance, we have implemented the Keyword Fusion algorithm by extending the Chord Simulator. The experiments are driven by two sets of data: 1,000 and 10,000 annotated image file sets from the Corel image database. The results show that Keyword Fusion can reduce the search traffic by up to 81% even in modest scenarios of combining two relatively generic keywords. We have also

shown that Keyword Fusion can effectively unburden overloaded peers and distribute the file storage load across the entire DHT network. For example, Keyword fusion reduces the storage consumption of the top 5% most loaded nodes by 50% without significantly increasing the storage consumption level in the other peers. We are currently conducting more extensive experiments using various datasets, query patterns, and keyword correlations. We are also developing a distributed algorithm to dynamically adapt the threshold value for Key Fusion. Finally, we are investigating a way to construct and provide quality or ranking information for the search results in a decentralized manner.

REFERENCES [1] Napster, "http://www.napster.com." [2] Gnutella, "http://genutella.wego.com." [3] B. Y. Zhao, J. Kubiatowicz, and A. D. Joseph, "Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and Routing," IEEE Journal on Selected Areas in Communications (c) 2003 IEEE, 2001. [4] A. Rowstron and a. P. Druschel, "Pastry: Scalable, distributed object address and routing for large-scale peer-to-peer systems," presented at IFIP/ACM Int 뭠 Conf. Distributed Systems Platforms, Heidelberg, Germany, 2001. [5] I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. Kaashoek, F. Dabek, and H. Balakrishnan, "Chord: A Scalable Peerto-peer Lookup Protocol for Internet Applications," presented at SIGCOMM'01, San Diego, CA, 2001. [6] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker, "A Scalable Content-Addressable Network," presented at ACM SIGCOMM 2001, San Diego, CA, 2001. [7] P. Reynolds and A. Vahdat, "Efficient Peer-to-Peer Keyword Searching," presented at Middleware. 2003, Rio de Janeiro, Brazil, 2003. [8] T. Lu, S. Sinha, and A. Sudan, "Panache: A Scalable Distributed Index for Keyword Search," 2003. [9] K. Nakauchi, Y. Ishikawa, H. Morikawa, and T. Aoyama, "Peerto-Peer Keyword Search Using Keyword Relationship," presented at CCGRID 2003, Tokyo, Japan, 2003. [10] B. Bhattacharjee, S. Chawathe, V. Gopalakrishnan, P. Keleher, and B. Silaghi, "Efficient Peer-To-Peer Searches Using ResultCaching," presented at IPTPS '03, Berkeley, CA, 2003. [11] G. Zipf, "Selective Studies and the Principle of relative Frequency in Language," C. Harvard University Press, MA, 1932, Ed., 1932. [12] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan, "Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications," presented at ACM SIGCOMM 2001, San Diego, CA, 2001. [13] B. Bloom, "Space/time trade-offs in hash coding with allowable erros," Communications of the ACM, vol. 13, pp. 422-426, 1970. [14] Chord, "http://www.pdos.lcs.mit.edu/chord/." [15] Benjiman, "http://abacus.ee.cityu.edu.hk/~benjiman/corel_1/." [16] Digital-Library-Project, "http://elib.cs.berkeley.edu/photos/corel." [17] B. Yang and H. Carcia-Molina, "Efficient Search in Peer-toPeer Networks," presented at ICDCS 2002, Vienna, Austria, 2002. [18] A. Crespo and H. Garcia-Molina, "Routing Indices For Peer-toPeer Systems," presented at ICDCS 2002, Vienna, Austria, 2002. [19] O. D. Gnawali, "A Keyword-Set Search System for Peer-toPeer Networks," in MIT. Massachusetts, 2002.

8