Scalable Keyword Search on Large Data Streams - CiteSeerX

15 downloads 503 Views 93KB Size Report
m-keyword queries in a relational data stream environment. Such needs stem ..... [15] give a novel IR ranking strategy for effective keyword search. There are ...
Scalable Keyword Search on Large Data Streams Lu Qin, Jeffrey Xu Yu, Lijun Chang, Yufei Tao The Chinese University of Hong Kong, Hong Kong, China {lqin,yu,ljchang}@se.cuhk.edu.hk, [email protected]

Abstract— It is widely realized that the integration of information retrieval (IR) and database (DB) techniques provides users with a broad range of high quality services. A new challenging issue along the same direction is IR-styled m-keyword query processing in a RDBMS framework over an open-ended relational data stream. The capability of supporting m-keyword queries over a relational data stream makes it possible for users to monitor events, that are implicitly interrelated, over a relational data stream in a timely manner. In brief, the problem is to find all connected trees whose size is less than or equal to a user-given threshold in terms of number of nodes for a m-keyword query, {k1 , k2 , · · · , km }, over a relational data stream on a database schema GS . The difficulty of the problem is related to the number of costly joins to be processed over time, which is affected by the parameters such as the number of keywords (m), the maximum size of connected trees (Tmax), as well as the complexity of the database schema when it is viewed as a schema graph (GS ). In this paper, we propose a new demand-driven approach to process such a query over a high speed data stream. We show that we can significantly reduce the number of intermediate results when processing joins over a data stream, and therefore can achieve high efficiency.

I. I NTRODUCTION The IR-styled m-keyword query processing has been recently studied in RDBMS. Such query results, represented as connected trees that are connected through primary/foreign key relationships over tuples, provide users with the insights on their data that they can not find easily using SQL. It is highly desirable to further enhance the capability to support m-keyword queries in a relational data stream environment. Such needs stem from real applications where users need to monitor important or interesting events that are not explicitly related over a large amount of data that come as a data stream. Such a data stream can be generated using sensors or by RFIDpowered equipment. For example, consider an E-Commerce application. Users may care how their customer order pattern changes. For example, customers in a certain area may change their patterns of ordering fashion design products which come from different vendors over time. The m-keyword query processing over a relational data stream was first studied by Markowetz, Yang and Papadias in [1]. Given a database schema as a graph GS where there are primary/foreign key relationships among relations, the mkeyword query is to find all connected trees among tuples via primary/foreign key references that are minimal and total. By total, it means that all m keywords must be included, and by minimal, it means that any tuple removed from the connected tree makes the result miss some keywords. Over an RDBMS , the hardness is that it needs to generate a join plan that

needs a large number of joins to process an m-keyword query. Consider a database schema of |V | relations with reasonable primary/foreign key references. It needs to generate a plan that includes up to |V | · 2m projected relations for handling all the possible subsets of m-keywords that may appear in any possible way and in any relation over a data stream. It is worth noting that the number of joins is also heavily affected by the maximum size of connected trees a user prefers to observe. The larger the size the harder to process. In addition, the join processing itself is expensive and the low efficiency is caused by computing a large number of intermediate results that are not eventually used to form any connected trees. Consider an example, (R 1 S) 1 T . Suppose that there are many incoming tuples for R and S, the intermediate results can be very large. However, the intermediate results of R 1 S may not be able to join any T tuple. The costs to join and to maintain all the intermediate results are very high. The main contributions in this paper are summarized below. We propose a new join processing approach. We attempt to avoid using joins directly, and instead process m-keyword queries using selection/semijoin [2] to fully reduce the number of intermediate results first, followed by joins. In other words, we only maintain those tuples that can be possibly joined to output some connected trees. We conducted extensive performance studies, and confirmed that our approach significantly outperforms the up-to-date approaches. II. P ROBLEM D EFINITION We consider a database schema in a relational database as a directed graph GS (V, E), called a schema graph, where V represents the set of relation schemas {R1 , R2 , · · · } and E represents foreign key references between two relation schemas. Given two relation schemas, Ri and Rj , there exists an edge in the schema graph, from Ri to Rj , denoted Ri → Rj , if the primary key defined on Ri is referenced by the foreign key defined on Rj . Parallel edges may exist in GS if there are several foreign keys defined on Rj referencing to the primary key defined on Ri . To distinguish one foreign key X references among many, we use Ri → Rj , where X is the foreign key attribute names. A relation on relation schema Ri is an instance of the relation schema (a set of tuples) conforming to the relation schema, denoted r(Ri ). A tuple can be inserted into a relation, and deleted from a relation. Below, we use Ri to denote r(Ri ) if the context is obvious. And we use V (GS ) and E(GS ) to denote the set of nodes and the set of edges of GS , respectively.

An m-keyword query is a set of keywords of size m, {k1 , k2 , · · · , km }. A result of an m-keyword query is a minimal total joining network of tuples, denoted MTJNT [3], [1]. First, a joining network of tuples (JNT) is a connected tree of tuples where every two adjacent tuples, ti ∈ r(Ri ) and tj ∈ r(Rj ) can be joined based on the foreign key reference defined on relational schemas Ri and Rj in GS (either Ri → Rj or Rj → Ri ). Second, by total, it means that a joining network of tuples must contain all the m keywords. Third, by minimal, it means that a joining network of tuples is not total if any tuple is removed. The minimal condition implies that every leaf tuple in the tree must contain at least one keyword. The size of a MTJNT is the number of nodes in the tree, and a user-given parameter Tmax is used to specify the maximum number of nodes in MTJNTs, in order to avoid a MTJNT to be too large in size, because it is not meaningful if two tuples are connected by a long chain of tuples. Keyword Query Processing over a Data Stream: The problem of m-keyword query processing we study in this paper is to find all MTJNTs of size ≤ Tmax, for a given continuous m-keyword query, {k1 , k2 , · · · , km }, on a schema graph GS , over a high speed large data stream, in the framework of RDBMS . It reports new MTJNTs when new tuples are inserted, and, in addition, reports the existing MTJNTs become invalid when tuples are deleted. A sliding window (time interval), W , is specified. A tuple, t, has lifespan since it is inserted from time t.start to W + t.start− 1, if t is not deleted before then. Two tuples can be joined if their lifespans are overlapped. In the framework of RDBMS, the two main steps of processing an m-keyword query over a graph schema GS are candidate network generation and candidate network evaluation. • In the first candidate network generation step, it generates a set of candidate networks over GS , denoted C = {C1 , C2 , · · · }, to be evaluated in the second step. In brief, a candidate network (CN), Ci , corresponds to a relational algebra that joins a sequence of relations with selections of tuples for keywords over the relations involved. The set of CNs shall be sound/complete and duplication-free. The former ensures all MTJNTs must be able to be found, and the latter is mainly for efficiency consideration. • In the second candidate network evaluation step, all Ci ∈ C generated will be evaluated over a high speed data stream dynamically. One of the main factors is how to reduce the number of tuples that need to be joined in sliding windows while tuples can be inserted/deleted over a high speed data stream in a large sliding window. III. A N EW A PPROACH In this paper we propose a new novel scalable approach to process m-keyword queries. We also take the same two steps in our RDBMS framework: candidate network generation and candidate network evaluation with a focus on the latter. For evaluating all CNs generated, we propose a new novel demand-driven evaluation approach that fully reduces

P{}

P{}

G{}

O{}

P{Dress} C{Texas}

(a) CN Fig. 1. P{K3}

P{}

P{}

G{}

G{}

O{}

P{Dress} C{Texas}

P{Dress} C{Texas}

(b) Filter A Two-Step CN Evaluation P{K3}

O{}

(c) Join

P{}

L−Node Table PID PID PID G{} O{K3} SID

C{K2}

P{K1}

PID

G{}

O{}

SID

P{K2}

(a) Structure Fig. 2.

L−Edge Table

Vid Rname KSet m 1 ...

O ...

001 1 ... ...

Fid Cid Attr 1 3 SID ... ... ...

C{K1}

(b) Storage L-Lattice

the intermediate join results. Our evaluation is a two-phase approach. In the first phase, we use low-cost selection and semijoin [2] to filter the tuples that cannot be joined. In the second phase, we only join tuples that can be possibly joined. We explain it using a CN: P {Dress} 1 G{} 1 P {} 1 O{} 1 C{T exas}. (1) We construct a rooted tree for a given CN (Fig. 1(a)). Here, a node represents a projected relation and an edge represents a join operator. Suppose there are already some tuples in the projected relations, and consider query processing when a new tuple arrives. (2) In the filter phase, when a new tuple arrives, for example gi , in G{}, we first check if its child node, P {Dress}, has a tuple that can join gi using a selection against P {Dress}. If there is no tuple in P {Dress} that can join gi , then the processing of the newly arrival gi will stop. If there is at least one tuple in P {Dress} that can join, we then use a semijoin to inform its parent node, P {}, of the new arrival tuple gi . Assume we find that there is a tuple, pj in P {} that can join gi , then we further check if pj can join a tuple in the other child node of P {}, O{}, using a selection. If there is no tuple in O{} that can join pj , the processing will stop. Note: O{} does not need to check further its child nodes. (3) In the join phase, suppose we find that pj in P {} can be joined by some tuples in both its child nodes, namely, G{} and O{}, it starts joining process in a top-down manner, as indicated in Fig. 1(c). When we join, all the tuples must be able to join, and there is no unnecessary intermediate results. By this, we mean that we can achieve full reduction in terms of intermediate results. It is important to know that it is the main reason that we can achieve high efficiency when handling high speed data streams. A.

L-Lattice

Given a set of CNs, C. We construct a lattice, L, in order to share its query processing cost among all CNs. The procedure is given below. When a new rooted CN, Ci , is inserted to L, we generate the canonical codes for all its rooted subtrees of the rooted CN tree, Ci , including Ci itself. A canonical code is a string. Two trees, Ci and Cj , are identical iff their canonical codes are identical. We index all subtrees in L using

A Partial Lattice P{}

O{K3}

C{K2}

P{K3}

P{K3}

PID

PID PID PID G{} SID

C{K2} O{K3} P{K1}

G{}

O{}

SID

P{K2}

P{K1}

P{}

G{}

P{K2}

C{K1}

O{}

C{K1}

Inputs

Fig. 3.

Lattice and Its Inputs from a Stream

their canonical codes over L, while constructing L. For a given rooted CN Ci , we attempt to find largest subtrees in L that Ci can share with using the index, and link to the roots of such subtrees. Fig. 2(a) illustrates a partial lattice. The entire lattice, L, is maintained in two relations: L-Node relation and L-Edge relation (Fig. 2(b)). Let a bit-string represent a set of keywords, {k1 , k2 , · · · , km }. In the L-Node relation, for any node in L, it maintains a unique Vid in L, the corresponding relation name (Rname) that appears in the given database schema, GS , a bit-string (KSet) that indicates the keywords the node in L associated with, and the size of the bit-string (m). The L-Edge relation maintains the parent/child relations among all the nodes in L with its parent Vid and child Vid (Fid/Cid) plus its join attribute, Attr, (either primary key or foreign key). The two relations can be maintained in memory or on disk. Several indexes are build on the relations to fast search given nodes in L. B. Candidate Network Evaluation In our approach we only maintain |V (GS )| relations in total to process an m-keyword query K = {k1 , k2 , · · · , km }. However, the approach in [1] needs to maintain |V (GS )| · 2m projected relations separately. That is 2m projected relations for every relation Ri in GS . The main reason for us to have only |V (GS )| relations is due to the lattice structure we used. In our approach, a node, v, in the lattice L is uniquely identified with a node id. The node v represents a projected relation Ri {K ′ }. By utilizing the unique node id, we can easily maintain all the 2m projected relations for a relation Ri together. Next, we discuss how to implement the event-driven evaluation. As shown in Fig. 3, there are multiple nodes labeled the same Ri {K ′ }. For example, G{} appears in two different nodes in the lattice. For each Ri {K ′ }, we maintain 3 lists named Rlist (Ready), Wlist (Wait) and Slist (Suspend). The three lists contain all the node ids in the lattice. A node in the lattice L labeled Ri {K ′ } can only appear in one of the three lists for Ri {K ′ }. A node v in L appears in Wlist, if the projected relations represented by all child nodes of v in L are non-empty, but the projected relation represented by v is empty. A node v in L appears in Rlist, if the projected relations represented by all child nodes of v in L are nonempty, and the projected relation represented by v itself is non-empty too. Otherwise, v appears in Slist. When a new tuple t of relation Ri with keyword set K ′ is inserted, we only insert it into all relations in the nodes v, in L, on Rlist and

Wlist specified for Ri {K ′ }. Each insertion may notify some father nodes of v to move from Wlist or Slist to Rlist. The node v may also be moved from Wlist to Rlist. When a tuple t of relation Ri with keyword set K ′ is about to be deleted, we only remove it from all relations associated with node v, in L, on Rlist specified for Ri {K}. Each deletion may notify some father nodes of v to be moved from Rlist or Wlist to Slist and v may also be moved from Rlist to Wlist. IV. P ERFORMANCE S TUDIES We compare our algorithms with the up-to-date algorithms given in [1]. There are two evaluation algorithms in [1]: FullMesh and Partial-Mesh. We only compare ours with Full-Mesh because it is faster when the memory space is allowed. All algorithms are implemented in C++. We conducted all the experiments on a 2.8GHz CPU and 2GB memory PC running XP. We test the algorithms to process m-keyword queries on the condition that the size of a MTJNT is up to Tmax, in terms of nodes in a MTJNT. We use a synthetic dataset to conduct our tests. The synthetic dataset is specified in [4] and is the same synthetic dataset used in [1]. In the synthetic dataset, the schema graph, GS (V, E), is a tree-structured schema graph. A relation can join with up to 4 relations. In all relations, all attribute values are randomly and independently generated in the range of [1, sel]. Then, the join selectivity between two relations that have a primary/foreign relationship is 1/sel. As indicated in [1], a tuple may contain several different keywords where each keyword is with an independent probability KWF. The sliding window size is W minutes. Over a data stream, a tuple can be inserted/deleted into/from every relation in every second. If there are |V | relations, at a single second, there are |V | insertions/deletions simultaneously. The entire time window for the whole data stream is 5 hours. The parameters with their default values (bold) are shown in Table I for the synthetic dataset. TABLE I PARAMETERS FOR S YNTHETIC D ATASET Parameter W KWF |V | sel m Tmax

Range & Default 5, 10, 20, 40, 80 minutes 0.003, 0.007, 0.01, 0.013, 0.016 5, 10, 15, 20, 25 500, 750, 1000, 1250, 1500 2, 3, 4, 5 2, 3, 4, 5, 6

We test our CNEvalDynamic (denoted Dynamic for short) with Full-Mesh (denoted FM) over a data stream from the first time a tuple arrives for 5 hours. We tested CPU time and memory consumption. Due to space limit, we do not report memory consumption. The unit for CPU time is second. In Fig. 4, the CPU time when varying W from 5 minutes to 80 minutes are shown in Fig. 4(a), and the CPU time when varying KWF from 0.003 to 0.016 are shown in Fig. 4(b). Fig. 4(c) and (d) show the CPU times when varying |V | (from 5 to 25) and when

Dynamic FM

Dynamic FM

10

15

CPU (sec)

CPU (sec)

20

10

5

5 0

0 5

10

20

40

80

0.003

(a) Vary W Dynamic FM

0.01

0.013

0.016

Dynamic FM

10

15

CPU (sec)

CPU (sec)

20

0.007

(b) Vary KWF

10

VI. C ONCLUSION

5

5 0

0 5

10

15

20

25

1500

Dynamic FM

100

CPU (sec)

CPU (sec)

1250

1000

750

500

(d) Vary sel

(c) Vary |V |

100

10

Dynamic FM

2

3

4

(e) Vary m Fig. 4.

5

In this paper, we studied m-keyword query processing on large relational data streams. We proposed a CNEvalDynamic algorithm to significantly reduce the large number of intermediate results to be computed. Our algorithm significantly outperforms the up-to-date algorithms. ACKNOWLEDGMENT

10

1

1

search. There are also reported studies on continuous keyword search in a data stream environment. The most relative work to ours is [1] which deals with relational data streams, as we extensively discussed in this paper. Other systems such as [16] and [17] focus themselves on a single textual document stream, where different documents do not need to be joined when processing streams. Hristidis et al. in [18] study continuous keyword search on multiple text streams.

2

3

4

5

6

(f) Vary Tmax CN Eval (Synthetic Dataset)

varying 1/sel (from 1/1500 to 1/500), respectively. The CPU times when varying m (from 2 to 5) and when varying Tmax (from 2 to 6) are shown in Fig. 4(e) and (f), respectively. Our CNEvalDynamic significantly outperforms FM. Among all the testings with various parameters, our CNEvalDynamic is not significantly affected by the changes of W , KWF, |V |, sel, and m. The reason that it is not affected by m significantly is that in our approach we only maintain |V | projected relations, whereas FM needs to maintain |V | · 2m relations. As shown in Fig. 4(f), Tmax has great impacts on m-keyword query processing in terms of CN evaluation. It is mainly because it implies the number of joins needed to be conducted. V. R ELATED W ORK Many solutions are proposed for static keyword search on relational databases including DBXplorer [5], DISCOVER [3], and Mragyati [6]. They aim at processing keyword queries using a series of SQL queries. Luo et al. in [7] propose a new algorithm to compute top-k minimal connected trees with a new ranking function in RDBMS. Kite in [8] studies efficient keyword search across heterogeneous relational databases. BANKS [9] models the database as a large weighted graph, and finds top-k minimal cost connected trees over the weighted graph. BANKS-II [10] further improves the efficiency of BANKS. DPBF [11] uses a dynamic programming approach to find optimal exact top-1 answer in databases, and proposes an incremental method to compute approximate top-k answers. Kimelfeld et al. in [12] give some theoretical analysis to the top-k answers in keyword proximity search. BLINKS in [13] studies top-k keyword search on graphs by partitioning a data graph into blocks. Hristidis et al. in [14] propose a model that can handle queries with both “and” and “or” semantics. Liu et al. [15] give a novel IR ranking strategy for effective keyword

This work was supported by grants of the Research Grants Council of the Hong Kong SAR, China (No. 419008, CUHK 4161/07, and CUHK 4173/08). R EFERENCES [1] A. Markowetz, Y. Yang, and D. Papadias, “Keyword search on relational data streams,” in Proc. of SIGMOD’07, 2007. [2] P. A. Bernstein and D.-M. W. Chiu, “Using semi-joins to solve relational queries,” J. ACM, vol. 28, no. 1, 1981. [3] V. Hristidis and Y. Papakonstantinou, “Discover: Keyword search in relational databases,” in Proc. of VLDB’02, 2002. [4] J. Kr¨amer and B. Seeger, “Pipes - a public infrastructure for processing and exploring streams,” in Proc. of SIGMOD’04, 2004. [5] S. Agrawal, S. Chaudhuri, and G. Das, “Dbxplorer: A system for keyword-based search over relational databases,” in Proc. of ICDE’02, 2002. [6] N. L. Sarda and A. Jain, “Mragyati : A system for keyword-based searching in databases,” CoRR, vol. cs.DB/0110052, 2001. [7] Y. Luo, X. Lin, W. Wang, and X. Zhou, “Spark: top-k keyword query in relational databases,” in Proc. of SIGMOD’07, 2007. [8] M. Sayyadian, H. LeKhac, A. Doan, and L. Gravano, “Efficient keyword search across heterogeneous relational databases,” in Proc. of ICDE’07, 2007. [9] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan, “Keyword searching and browsing in databases using banks,” in Proc. of ICDE’02, 2002. [10] V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar, “Bidirectional expansion for keyword search on graph databases,” in Proc. of VLDB’05, 2005. [11] B. Ding, J. X. Yu, S. Wang, L. Qin, X. Zhang, and X. Lin, “Finding top-k min-cost connected trees in databases,” in Proc. of ICDE’07, 2007. [12] B. Kimelfeld and Y. Sagiv, “Finding and approximating top-k answers in keyword proximity search,” in Proc. of PODS’06, 2006. [13] H. He, H. Wang, J. Yang, and P. S. Yu, “Blinks: ranked keyword searches on graphs,” in Proc. of SIGMOD’07, 2007. [14] V. Hristidis, L. Gravano, and Y. Papakonstantinou, “Efficient ir-style keyword search over relational databases,” in Proc. of VLDB’03, 2003. [15] F. Liu, C. T. Yu, W. Meng, and A. Chowdhury, “Effective keyword search in relational databases,” in Proc. of SIGMOD’06, 2006. [16] T. W. Yan and H. Garcia-Molina, “The sift information dissemination system,” ACM Trans. Database Syst., vol. 24, no. 4, 1999. [17] F. Fabret, H.-A. Jacobsen, F. Llirbat, J. Pereira, K. A. Ross, and D. Shasha, “Filtering algorithms and implementation for very fast publish/subscribe,” in Proc. of SIGMOD’01, 2001. [18] V. Hristidis, O. Valdivia, M. Vlachos, and P. S. Yu, “Continuous keyword search on multiple text streams,” in Proc. of CIKM’06, 2006.