Containment of Inequality Queries Revisited - Semantic Scholar

3 downloads 0 Views 185KB Size Report
Abstract. The study of the containment of conjunctive queries contain- ing inequalities (denoted inequality queries in this paper) was a thor- oughly studied and ...
Containment of Inequality Queries Revisited Jos´e M. Barja, Nieves R. Brisaboa, Jos´e R. Param´a, and Miguel R. Penabad Departamento de Computaci´ on, Universidade da Coru˜ na E-15071 A Coru˜ na, Spain {jmbarja,brisaboa,parama,penabad}@udc.es

Abstract. The study of the containment of conjunctive queries containing inequalities (denoted inequality queries in this paper) was a thoroughly studied and long standing problem. In [3], we offered an exact condition, along with a procedure, to test query containment of this type of queries, using the idea of canonical databases. In this work, we present a different approach that is sensibly more efficient, based on the idea of testing query containment by using the theory of subsumption of formulas.

1

Introduction

The study of the equivalence and containment problem among queries is a fundamental aspect of database theory, especially concerning query optimization. The containment problem in general is undecidable [10], but it was shown to be decidable for conjunctive queries by Chandra and Merlin [4], who solved it. Conjunctive queries are a class of queries that allow only for Selection, Projection and Cartesian product operators. The mentioned studies considered conjunctive queries without inequalities (the only possible comparison was the equality operator). The studies were afterwards extended to consider inequality queries, that is, conjunctive queries with inequalities or built-in predicates of the form XθY , being θ ∈ {=, 6=, , ≥}. Klug [8] solved the containment problem for inequality queries. However, his solution worked only for dense domains, but not when the constants of the databases took their values from a nondense domain, like the integers. Other works related to the containment problem are [13, 7]. They considered different problems that are equivalent to the containment problem for inequality queries, and proved their decidibility, but they did not offer a procedure to test the containment. In [3], we offered a necessary and sufficient condition, along with a procedure, to test containment of inequality queries. In this work, we present a different approach, that is sensibly more efficient. It uses the concept of subsumption of formulas as a way to test the containment. The rest of this paper is organized as follows. Section 2 offers some preliminary definitions. Section 3 describes with some detail the work that has been done in the problem of containment of queries with inequalities. Section 4 fully describes our method to test the containmnet of conjunctive queries, and our conclusions are shown in the last section. Y. Manolopoulos and P. N´ avrat (Eds): ADBIS 2002, pp. 31–40, 2002.

32

2

Jos´e M. Barja et al.

Preliminary definitions

A conjunctive query, under relational algebra theory, is a query that uses only Selection, Projection and Cartesian product operations. Using deductive databases notation, a conjunctive query is a safe, nonrecursive rule with the predicates of the body defined over EDB (Extensional Databases) predicates [12]. Depending on the presence or absence of inequalities, or built-in predicates, in the bodies of the rules, we distinguish two types of conjunctive queries: Equality queries: They are conjunctive queries where built-in predicates, except the equality, are not allowed. The general form of an equality query is q(X) :- p1 (Y1 ), . . . , pn (Yn ). where q(X) is the query predicate and every pi (Yi ) is an ordinary predicate defined over EDB predicates. Inequality queries: They are conjunctive queries with built-in predicates. An inequality query, in its general form, is a Datalog rule of the form q(X) :- p1 (Y1 ), . . . , pn (Yn ), K1 , . . . , Km .

where q, X, pi ’s, and Yi ’s are defined as above, and every Kj is a built-in predicate of the form XθY , being X and Y either variables that appear in an ordinary predicate, or constants of the domain (but not both constants), and θ ∈ {=, 6=, , ≥}. The application of a query written as a Datalog rule uses the well known concept of assignment mapping to derive new facts. An assignment mapping [11] τ from a query Q to a database D is a function from the symbols of Q to those of D; τ is the identity in the predicate names and constants, and it must map every ordinary predicate in the body of Q to a fact in D. If the query has built-in predicates, the application of the assignment mapping to them must produce a formula that evaluates to true. The derived fact corresponds to the application of the mapping to the head of the rule. Given a database D and a query Q, Q(D) represents the derived facts obtained by applying Q to D. A conjunctive query Q1 is set contained in a conjunctive query Q2 (represented Q1 ≤ Q2 ) if and only if, for all databases D, Q1 (D) ⊆ Q2 (D), that is, the set of facts obtained by Q1 is a subset of the set of facts obtained by Q2 : Q1 ≤ Q2 ⇐⇒ ∀D, Q1 (D) ⊆ Q2 (D) . Equivalence among queries is defined via mutual inclusion: two conjunctive queries Q1 and Q2 are equivalent, Q1 ≡ Q2 , iff Q1 ≤ Q2 and Q2 ≤ Q1 . Q1 ≡ Q2 ⇐⇒ Q1 ≤ Q2 ∧ Q2 ≤ Q1 In this work, we shall treat queries as formulas, writing them in a clausal form. Let Q1 : q(X) :- p1 (Y1 ), . . . , pn (Yn ), K1 , . . . , Km . This inequality query S can be rewritten in a clausal form as Q1 = Q1 p Q1 b , where

Containment of Inequality Queries Revisited

33

Q1 p = {q(X), ¬p1 (Y1 ), . . . , ¬pn (Yn )} Q1 b = {¬K1 , . . . , ¬Km }

That is, Q1 is divided into two parts: Q1 p , which contains the query predicate and the negation of the ordinary predicates, and Q1 b , which contains the negation of the built-in predicates of Q1 . Let us also define at this point the operator ~, that will be used later. This operator negates all literals in a clause. Let C = {L1 , . . . , Lk }. Then, ~C = {¬L1 , . . . , ¬Lk }. For example, C = {(a > b), ¬(c = d))}. Then, ~C = {¬(a > b), (c = d))}.

3

Previous work

This section presents the work that has been done in the field of set containment of conjunctive queries. Note that containment under different underlying semantics, such as bag semantics [1, 2], is not considered in this paper. The containment of equality queries was fully solved by Chandra and Merlin [4], using the concept of containment mapping: An equality query Q1 is contained in another equality query Q2 if and only if there is a containment mapping from Q2 to Q1 . A containment mapping γ from a query Q2 to a query Q1 , as defined in [4], is a mapping from the symbols of Q2 to those of Q1 such that: – The mapping γ is the identity for constants and predicate names. – It must map the head of Q2 to the head of Q1 : γ(q(V )) = q(W ). – Every atom in the body of Q2 must be mapped to an atom in the body of Q1 : ∀i ∃j (1 ≤ i ≤ k, 1 ≤ j ≤ l) γ(pi (Zi )) = pj (Yj ). Note that it is not necessary that every atom of Q1 is reached by an atom in Q2 . With this result, the set containment of equality queries was fully solved. There has been extensive work on the problem of set containment of inequality queries (see [8, 12, 13, 7]), but the proofs given by these authors either lack a procedure or are applicable only when the underlying domain is dense. The following theorem [12] provides one of the first results in set containment of inequality queries, giving a sufficient condition for it. Theorem 1. Let Q1 and Q2 be the following inequality queries: Q1 : q(X) :- p1 (Y1 ), . . . , pn (Yn ), K1 , . . . , Km . Q2 : q(T ) :- p1 (Z1 ), . . . , pk (Zk ), F1 , . . . , Fl .

where Ki ’s and Fi ’s are built-in predicates, i.e., inequalities. Then, Q1 ≤ Q2 if there is a containment mapping γ from Q2 to Q1 such that: 1. γ(q(T )) = q(X), that is, the head of Q2 is mapped to the head of Q1 . 2. ∀i, 1 ≤ i ≤ k, ∃b, 1 ≤ b ≤ n such that γ(pi (Zi )) = pb (Yb ), i.e., every ordinary subgoal of Q2 is mapped to an ordinary subgoal of Q1 .

34

Jos´e M. Barja et al.

3. Every built-in predicate of Q2 , once γ is applied (i.e., every γ(Fi )) is implied by the built-in predicates of Q1 (the Ki ’s). Klug [8] also sketches a proof for this theorem and shows that the existence of this containment mapping provides a necessary and sufficient condition for the set containment of a subclass of queries: left semiinterval queries and right semiinterval queries. Left semiinterval queries only admit inequalities of the form Xθc, where X is a variable, c is a constant, and θ is one of ≤, < or =. Right semiinterval queries only admit inequalities of the form cθX, being X, c and θ defined as above. Besides, Klug gives a theorem that provides a necessary and sufficient condition to check whether, given two inequality queries Q1 and Q2 whose variables range over any dense and totally ordered domain, Q1 ≤ Q2 holds. However, Klug stated in his paper that most of his results (including this theorem) do not hold for nondense domains like the integers. Ron van der Meyden [13] studied the problem of querying indefinite data over linear ordered domains. He demonstrated that one of the problems he dealt with was equivalent to the set containment of queries with inequalities, showing that it was decidable and Π2p -complete. Using a different technique, based on counter machines, Ibarra and Su [7] showed that the containment/equivalence problem is decidable for linear constraint queries (having an exponential time lower bound and an exponential space upper bound), but no effective procedure to test it is given. In [3], we gave an exact condition, as well as a procedure, to test set containment of inequality queries, using the concept of canonical databases [1]. The full description and proof of this work can be found in [9]. The present work presents a different approach that sensibly increases the efficiency of this method.

4

Testing the containment

We shall prove in this section that the containment of inequalities can be tested using the theory of formula subsumption. It uses the same idea as the theorem by Chandra and Merlin [4] that proved that the presence of a containment mapping from Q2 to Q1 is a necessary and sufficient condition to demonstrate that Q1 ≤ Q2 , Q1 and Q2 being equality queries. It is also a generalization of the theorem by Ullman [12] shown in the previous section, which conforms a sufficient condition, but our method offers a necessary and sufficient condition. This work presents a different procedure, based on the theory of subsumption of formulas, to test the containment among conjunctive queries. The rest of this section first shows an overview of the procedure, and then its detailed description, as well as the proof of its correctness. For this procedure, we shall rewrite the queries in clausal form, as described in Section 1. 4.1

General outline of the procedure

The steps of the algorithm to test if Q1 ≤ Q2 , by testing if Q2 subsumes Q1 , are the following:

Containment of Inequality Queries Revisited

35

1. Test if Q2 p subsumes Q1 p . That is, find all the substitutions λi and θ such that Q2 p λi ⊆ Q1 θ. As we shall see, we will use the resolution method to test this subsumption. If Q2 p does not subsume Q1 p , the algorithm stops and the answer is that Q1 is not contained in Q2 . 2. Let λ1 , . . . , λj be the set of substitutions that prove that Q2 p subsumes Q1 p . The second step of the algorithm tests whether the following set of clauses C is unsatisfiable. [ C = ~Q1 b θ {Q2 b λ1 , . . . , Q2 b λj } The method of Davis and Putnam [6] is particularly accurate to perform this test. If using this method the empty clause is found, then C is unsatisfiable and that means that Q1 ≤ Q2 . If, after the application of this algorithm, C remains with a non empty set of clauses, the satisfiability must be tested depending on the domain of the constants of the underlying EDB databases, as seen in [9]. 4.2

Detailed description

We shall consider an inequality query in its clausal form, separating the built-in predicates from the ordinary predicates. That is, each query Q : q(X) :- p1 (Y1 ), . . . , pn (Yn ), K1 , . . . , Km

is divided into Qp , the set of ordinary predicates plus the query predicate, and Qb , the built-in predicates: Qp = {q(X), ¬p1 (Y1 ), . . . , ¬pn (Yn )} Qb = {¬K1 , . . . , ¬Km }

Step 1 We shall first test if Q2 p subsumes Q1 p , using resolution. The algorithm presented in [5] is adapted to perform this test: Step 1: Build W as the set of the following unit clauses: W = {{¬q(X)θ}, {p1 (Y1 )θ}, . . . , {pn (Yn )θ}}

That is, W is a set of clausal units, each one being the complementary (negation) of a predicate in Q1 p . The variables of the query are renamed using the ground substitution θ. Step 2: Let k = 0, and Rk = {Q2 p }. Step 3: If Rk contains the empty clause, then Q2 p subsumes Q1 p ; else build Rk+1 as the resolvent of C1 and C2 , where C1 ∈ W and C2 ∈ Rk . Step 3: If Rk+1 is empty, terminate. Q2 p does not subsume Q1 p (and, therefore, Q1 6≤ Q2 ). Otherwise, set k = k + 1 and go to step 3. Example 1. Let Q1 and Q2 be the following inequality queries: Q1 : q(X, Y ) :- p(X, Y ), r(U, V ), r(V, U ), Y ≤ X.

36

Jos´e M. Barja et al. Q2 : q(X, Y ) :- p(X, Y ), r(U, V ), U ≤ V. Q1 p = {q(X, Y ), ¬p(X, Y ), ¬p(U, V ), ¬(r(V, U )} Q2 p = {q(X, Y ), ¬p(X, Y ), ¬r(U, V )}

Following the previous algorithm (the ground substitution used for this case is θ = {X ← a, Y ← b, U ← c, V ← d}): W = {{¬q(a, b)}, {p(a, b)}, {r(c, d)}, {r(d, c)}} R0 = {q(X, Y ), ¬p(X, Y ), ¬r(U, V )}

Since R0 does not contain the empty clause, we build R1 resolving the following: R1 = R({¬q(a, b)}, {q(X, Y ), ¬p(X, Y ), ¬r(U, V )})

Using the substitution λ1 = {X ← a, Y ← b}, R1 = {¬p(a, b), ¬r(U, V )}

R1 is not empty, so we build R2 : R2 = R({p(a, b)}, {¬p(a, b), ¬r(U, V )}) = {¬r(U, V )}

R2 is not empty, so we build R3 . We have two choices: R3 = R({r(c, d)}, {¬r(U, V )})

Using λ1 = {X ← a, Y ← b, U ← c, V ← d}, R3 = R({r(c, d)}, {¬r(c, d)} = {{}})

R3 contains the empty clause. Therefore, Q2 p subsumes Q1 p . R30 = R({r(d, c)}, {¬r(U, V )})

Using λ2 = {X ← a, Y ← b, U ← d, V ← c}, R30 = R({r(d, c)}, {¬r(d, c)} = {{}})

R30 also contains the empty clause. Therefore, Q2 p also subsumes Q1 p with this new substitution. Step 2 The following step is to consider the built-in predicates of the queries, that is, Q1 b and Q2 b . The idea is to test the unsatisfiability of the following set of clauses: [ ~Q1 b θ {Q2 b λ1 , . . . , Q2 b λm } where λ1 ,. . . ,λm are all the substitutions that can be used resolution process in the previous step. The method by Davis and Putnam [6], also shown in [5, p. 63], is particularly effective for this test.

Containment of Inequality Queries Revisited

37

Example 2. (Continued from Example 1). In this case,Q1 b θ = {¬(b ≤ a)}, and the λi ’s available are λ1 = {X ← a, Y ← b, U ← c, V ← d} and λ2 = {X ← a, Y ← b, U ← d, V ← c}. Therefore, we must check if the following set of clauses is unsatisfiable: {{(b ≤ a)}, {¬(c ≤ d)}, {¬(d ≤ c)}} Using the method by Davis and Putnan, this test is performed. By using the one literal rule, we can delete all clauses containing {¬(c ≤ d)} and all the accurrences of the complementary of this literal, giving {{(b ≤ a)}, {}} Since the empty clause is obtained, it means that the formula is unsatisfiable. This is a sufficient proof to affirm that Q1 ≤ Q2 . 4.3

Validation

The following theorem provides the proof that the method presented in this work effectively checks if a query Q1 is contained in another query Q1 , by testing if, for any renaming of variables θ, there exists a set of substitutions {λ1 , . . . , λj } such that ∀i, 1 ≤ i ≤ j, Q2 p λi ⊆ Q1 p θ, and the set of clauses S ~Q1 b θ {Q2 b λ1 , . . . , Q2 b λj } is unsatisfiable. The subsumption algorithm [5] is the operative justification of the first part of the algorithm (the verification of the first condition), while the algorithm of Davis and Putnam [6] justifies the second part of the algorithm, testing the second condition. Theorem 2. Let Q1 and Q2 be two inequality queries of the form Q1 : q(X) :- p1 (Y1 ), . . . , pn (Yn ), K1 , . . . , Km . Q2 : q(T ) :- p1 (Z1 ), . . . , pk (Zk ), F1 , . . . , Fl .

The queries are rewritten as Q1 = Q1 p

S

Q1 b and Q2 = Q2 p

S

Q1 p = {q(X), ¬p1 (Y1 ), . . . , ¬pn (Yn )};

Q1 b = {¬K1 , . . . , ¬Km }

Q2 p = {q(T ), ¬p1 (Z1 ), . . . , ¬pn (Zk )};

Q2 b = {¬F1 , . . . , ¬Fl }

Q2 b , where

Then, Q1 ≤ Q2 if and only if, for any renaming of variables θ, there exists a set of substitutions {λ1 , . . . , λj } such that ∀i, 1 ≤ i ≤ j, Q2 p λi ⊆ Q1 p θ and the following set of clauses C = ~Q1 b θ is unsatisfiable

[

{Q2 b λ1 , . . . , Q2 b λj }

38

Jos´e M. Barja et al.

Proof. If: Consider any ground database D from which Q1 obtains a fact u using an assignment mapping τ . Thus, u = τ (Q1 (X)), and every τ (pi (Yi )) is a fact in D. Let us consider first the ordinary predicates of the queries. By hypothesis, we have that ∀k, 1 ≤ k ≤ j, Q2 p λk ⊆ Q1 p θ. Then, each τ ◦ θ−1 ◦ λk is an assignment mapping from Q2 to D, which applies every ordinary predicate of Q2 to a fact in D. We obtain τ (θ−1 (λk (q(T )))) = τ (Q1 (X))=u. The application of τ ◦ θ−1 ◦ λk to every predicate pi (Zi ) is always possible, because θ−1 (λk (pi (Zi ))) = pj (Yj ), for some j (1 ≤ j ≤ n). Then, τ (θ−1 (λk (pi (Zi )))) = τ (pj (Yj )), which is a fact in D. Therefore, every fact u derived by Q1 using the assignment mapping τ , is also derived by Q2 using the assignment mapping τ ◦ θ−1 ◦ λk . Regarding the built-in predicates, the formula τ (Kr ), 1 ≤ r ≤ m must evaluate to true so τ can be applied from Q1 to D to derive the fact u. Having that [ ~Q1 b θ {Q2 b λ1 , . . . , Q2 b λj } is unsatisfiable, that means that there is some k such that F1 λk ∧ . . . ∧ Fl λk is true. Thus, τ ◦ θ−1 ◦ λk is an assignment mapping from Q2 to D such that the built-in predicates of Q2 hold, so it derives the fact u. Therefore, Q1 ≤ Q2 . Only if: If Q1 ≤ Q2 , then it is clear that Q1 p ≤ Q2 p . Using the result of Chandra and Merlin [4], there exists a substitution (or containment mapping) γ such that Q2 p γ ⊆ Q1 p , and for any renaming of variables θ, Q2 p (γθ) ⊆ Q1 p θ. Let us denote {λ1 , . . . , λj } the set of substitutions such that Q2 p λi ⊆ Q1 p θ. For every assignment mapping τ from Q1 to D that derives a fact u, there is an assignment mapping τ ◦ θ−1 ◦ λi from Q2 to D that derives the same fact, so each assignment mapping that makesS~Q1 b θ true, it makes some of the clauses in Q2 b λi false. Therefore, ~Q1 b θ {Q2 b λ1 , . . . , Q2 b λj } is unsatisfiable. The following corollaries also provide interesting results about containment of inequality queries. Corollary 1. Let Q1 and Q2 be two inequalitySqueries. If there exists some substitutuion λ such that Q2 p λ ⊆ Q1 p and ~Q1 b {Q2 b λ} is unsatisfiable, then Q1 ≤ Q2 . S Proof. The unsatisfiability of ~Q1 b {Q2 b λ} implies the unsatisfiability of [ ~Q1 b {Q2 b λ, Q2 b λ1 , . . . , Q2 b λj } where {λ1 , . . . , λj } are the remaining substitutions such that Q2 p λi ⊆ Q1 p . Then, by Theorem 2, Q1 ≤ Q2 . This result is the same as that offered by Theorem 1, and can be found in [11]. t u Corollary 2. If Q2 subsumes Q1 , that is, for a renaming of variables θ, there exists a substitution λ such that Q2 λ ⊆ Q1 θ, then Q1 ≤ Q2 .

Containment of Inequality Queries Revisited

39

Proof. Having that Q2 subsumes Q1 , there exists some substitution λ and a renaming of variables θ such that Q2 λ ⊆ Q1 θ. This implies that Q2 p λ ⊆ Q1 p θ b b (that is, Q2 p subsumes Q1 p ) and QS 2 λ ⊆ Q1 θ, which is equivalent (as seen in [5, b b p. 96]) to the set of clauses ~Q1 θ {Q2 λ} being unsatisfiable. By Corollary 1, this implies that Q1 ≤ Q2 . t u

5

Conclusions

We have presented in this work an algorithm to test containment among inequality queries, by using the theory of subsumption of formulas, improving our previous aproach that used canonical databases [3]. This algorithm works also for conjunctive queries without inequalities, using a method similar to that presented by Chandra and Merlin [4]. In our future work, we will try to adapt this algorithm to test other types of containment, including the containment of queries under bag semantics, and its extension to consider other types of queries, such as queries with negated subgoals.

References 1. Nieves R. Brisaboa. Inclusi´ on de Consultas Conjuntivas en la sem´ antica de bolsas. PhD thesis, Departamento de Computaci´ on, Facultade de Inform´ atica, Universidade da Coru˜ na, A Coru˜ na, Spain, May 1997. 2. Nieves R. Brisaboa and H´ector J. Hern´ andez. Testing bag-containment of conjunctive queries. Acta Informatica, 34:557–578, 1997. 3. Nieves R. Brisaboa, H´ector J. Hern´ andez, Jos´e R. Param´ a, and Miguel R. Penabad. Containment of conjunctive queries with built-in predicates with variables and constants over any ordered domain. In Advances in Databases and Information Systems. Second East Sympsium (ADBIS’98), number 1475 in Lecture Notes in Computer Science, pages 46–57, Poznan, Poland, September 1998. Springer-Verlag. 4. A. K. Chandra and P. M. Merlin. Optimal implementation of conjunctive queries in relational databases. In Proc. 9th ACM SIGACT Symp. on the Theory of Computing, pages 77–90, New York, 1977. 5. Ching-Liang Chang and Richard Char-Tung Lee. Symbolic Logic and Mechanical Theorem Proving. Computer Science Classics, 1987. 6. M. Davis and H. Putnan. A computing procedure for quantification theory. Journal of the ACM, 7:201–215, 1960. 7. Oscar H. Ibarra and Jianwen Su. On the containment and equivalence of database queries with linear constraints. In PODS’97, pages 32–43, Tucson, Arizona, 1997. 8. Anthony Klug. On conjunctive queries containing inequalities. Journal of the ACM, 35(1):146–160, 1988. 9. Miguel R. Penabad. A general procedure to test conjunctive query containment. PhD thesis, Departamento de Computaci´ on, Facultade de Inform´ atica, Universidade da Coru˜ na, A Coru˜ na, Spain, May 2001. 10. B. A. Trakhtenbrot. The imposibility of an algorithm for the decision problem for finite models. Doklady Akademii Naurk SSR, 70:569–572, 1950.

40

Jos´e M. Barja et al.

11. Jeffrey D. Ullman. Principles of Database Systems. Computer Science Press, second edition, 1982. 12. Jeffrey D. Ullman. Principles of Database and Knowledge-base Systems, volume 1 and 2. Computer Science Press, 1988-1989. 13. Ron van der Meyden. The complexity of querying indefinite data about linearly ordered domains. Journal of Computer and System Sciences, 54(1):113–135, 1997.