String Operations in Query Languages Abstract 1 ... - CiteSeerX

0 downloads 0 Views 454KB Size Report
for all queries, but is quite limited in expressive power. The latter handles many natural string operations, but includes queries with high data ..... be f1;:::;kg, and each number i is represented in bi- nary. When we deal with interpreted elements ...
String Operations in Query Languages Michael Benedikty Bell Labs

Leonid Libkinx

U. Toronto and Bell Labs

Abstract

Thomas Schwentick{ U. Mainz

Luc Segou nk INRIA

1 Introduction

We study relational calculi with support for string operations. Most prior proposals were based on adding the operation of concatenation to rst-order logic. Such an extension is problematic as the relational calculus becomes computationally complete, which in turn implies strong limits on the ability to perform optimization and static analysis of properties such as query safety. In contrast, we look at extensions of relational calculus that have nice expressiveness, decidability, and safety properties, while corresponding to sets of string operations used in SQL. We start with an extension based on the string ordering and LIKE predicates. We then extend this basic model to include string length comparison. While both of these share some of the attractive properties of relational calculus (low data complexity for generic queries, e ective syntax for safe queries, correspondence with an algebra), there is a large gap between these calculi in expressive power and complexity. The smaller `basic model' has low data complexity for all queries, but is quite limited in expressive power. The latter handles many natural string operations, but includes queries with high data complexity. We thus also explore the space between these two extremes. We consider two intermediate languages: the rst extends our base language with functions that trim/add leading characters, and the other extends it by adding the full power of regular-expression pattern matching. We show that both these extensions inherit many of the attractive properties of the basic model: they both have corresponding algebras expressing safe queries, and low complexity of query evaluation.

One of the strong points of the standard relational algebra and calculus is their support of the `dataindependence principle': from the point of view of relational algebra, the stored data items themselves are indistinguishable grains of sand, with only the input structure of the database itself to lend them individuality. The data independence principle re ects the fact that the basic relational operators apply uniformly over all datatypes, rather than having to be re-de ned in ad-hoc ways for each type. There are datatypes, however, which come with their own natural logical structure that interacts nontrivially with the relational operators. A vector ~r in a numerical domain, for example, de nes a solution space f~x : ~r  ~x = 0g, and these solution spaces have an algebraic structure { they admit products, projections, complements, etc. It is both natural and useful to access the algebraic structure within a query language, and to seek to extend relational algebra to handle the interaction of the standard relational operators with these new domain-speci c ones. This consideration of the interaction between speci c algebraic data structures and relational algebra is what motivates the constraint database model [25]. The constraint model introduces a general framework where query languages can be parameterized by some speci c algebraic structure. But work in constraint databases has concentrated primarily on numerical domains, with an eye towards applications in GIS. Yet an even more ubiquitous and interesting example of the phenomena described above is the case of strings. Strings come with a set of basic operations such as concatenation, pre x, and length. Furthermore, given a string we can use these operations to form solution spaces just as in the numerical case: for a string s, one can ask, for example, for all the strings in the database that belong to the set fy : y LIKE a s%g. SQL does allow both sorts of structure to interact: from individual strings one can form new strings with operations such as concatenation and TRIM TRAILING, and also form languages from strings using LIKE. A

 Part of this work was done while the second and the third authors visited INRIA, and the second and the fourth authors visited Mainz. y Bell Laboratories, 263 Shuman Blvd, Naperville, IL 60566, USA. E-mail: [email protected]. x Department of Computer Science, University of Toronto, 6 King's College Road, Toronto, Ontario M5S 3H5, Canada. Email: [email protected]. { Johannes Gutenberg-Universit at Mainz, Institut fur Informatik, 55099 Mainz, Germany. Email: [email protected]. k INRIA-Rocquencourt, B.P. 105, 78153 Le Chesnay Cedex, France. E-mail: Luc.Segou [email protected].

1

natural question is what can be said about SQL considered as a string language, and our paper explores this issue. SQL, however, gives several somewhat adhoc restrictions on the way string operations can be used { we can not freely mix relational and string operations. Our main goal, then, is to investigate query languages that allow a uniform treatment of relational algebra and string operations. We introduce several such languages in this paper. Our technique is to use the constraint database setting (applied to nite relational databases) to de ne query languages which extend the relational calculus with di erent string operations. We then study the expressive power, complexity, and static analysis properties of these languages. Some approaches toward unifying string algebras with relational algebra have been developed in the prior literature. [16] studied the consequences of adding pattern-matching features to SQL. [18, 21, 17] propose an extension of the relational calculus with alignment logics and studied their complexity and expressive power, while [11, 12] considered Datalog extended with appropriate transducers for string operations, proving a number of completeness results. In [14] arbitrary regions (substrings) can be queried; this, when coupled with relational calculus, gives the power of string concatenation. Closer to our approach, [19, 27] study the relational calculus/algebra extended with an operation for concatenating strings. One problem faced in all this work is that queries that manipulate string languages may return an in nite number of strings. This is the standard issue of safety. The authors tackle this problem by identifying safe fragments of their languages, using a number of syntactic restrictions | see, e.g., [18, 21, 17, 19, 27] | but they cannot capture the safe fragment of the language syntactically. A second problem concerns expressive power. Many query languages designed in the prior literature turn out to be Turing complete, a feature that in turn makes many sorts of analysis and optimization impossible. Indeed, as noted in [19], adding just concatenation to the relational calculus already yields a query language which is Turing complete. This immediately implies that there is no e ective syntax for the corresponding safe fragment [30]. In contrast to the above, in our work we seek languages that ful ll the following criteria:

sider relational calculus, RC, over a number of structures on strings. The rst one, RC(S), is the language over the structure with some basic string operations that correspond to SQL's LIKE pattern matching and lexicographic ordering. We then extend the language to RC(Slen ) by introducing string length, more precisely, string length comparisons. This extension enables additional operations such as trimming/adding symbols on both left and right of a string, and the SIMILAR pattern matching for checking membership in a regular language [20]. Both languages satisfy criteria 2 and 3, but RC(S) queries can be evaluated rather ef ciently (AC0 ), while in RC(Slen ) one can express NPcomplete and coNP-complete problems. RC(S) however, is unable to express certain natural queries, e.g., SELECT a  x FROM R, where a is a xed character. This leads us to the consideration of two intermediate languages, RC(Sleft) and RC(Sreg ). The rst one adds operations for trimming/adding leading characters, and the second one gives us regular expression pattern matching. Both languages satisfy all three of the required criteria, while considerably extending the expressive power of RC(S). The paper is organized as follows. The next section presents the notations. Section 3 brie y reviews results on relational calculus with concatenation. Section 4 presents the sets of string operations considered in the paper. In Section 5 we explore the expressive power and complexity of RC(S) and RC(Slen). Query safety for these languages is investigated in Section 6. In Section 7, we propose and study RC(Sleft ) and RC(Sreg ) and extend the previous results to these languages. All proofs are in the appendix.

2 Notations Strings and operations on them For a nite al-

phabet , we write  for the set of all nite strings over , and n for the set of all strings of length at most n. The empty string is denoted by . We shall consider a number of operations on strings; those used most often are:

 x  y is the concatenation of two strings x and y.  x  y is true i x is a pre x of y. x  y is true i

1. Query evaluation is ecient; 2. There is e ective syntax capturing safe queries; 3. There is an algebra equivalent to the language.

x is a strict pre x of y.  la (x), a 2 , is a function that adds a as the last symbol: la (x) = x  a.  fa (x), a 2 , is a function that adds a as the first symbol: fa (x) = a  x.

Inspired by the constraint database approach,we con2

 jxj is the length of string x.  x ? y is de ned to be the relative sux of y in x. That is, if x = y  z , then x ? y = z ; otherwise x ? y = .  x u y is the longest common pre x of x and y.

safety problem is to determine whether a query is safe, and it is known to be undecidable even for the pure relational calculus [1]. The state-safety problem is to decide, for a given ' and D, if ' is safe on D. We say that safe queries in RC(M) have e ective syntax if there exists a recursively enumerable set i ; i < !, of safe queries in RC(M) such that, for every SC, every safe RC(SC ; M) query is equivalent to one of i s. That e ective syntax exists for safe queries in the pure relational calculus is a classical relational theory result. Other { both positive or negative { results have been proved recently [8, 30]. An important restriction of queries is that to quanti cation over the active domain. We use quanti ers 9x 2 adom and 8x 2 adom, whose meaning is as follows: D j= 9x 2 adom'(x; ) if D j= '(a; ) for some a 2 adom (D) (as opposed to for some a 2 U in the case of the usual 9x quanti er), and similarly for the universal quanti er. These restricted quanti ers are de nable in relational calculus, but it is often helpful to have them available separately. A relational calculus formula is called an active-domain formula if all quanti ers in it are of the form 8x 2 adom; 9x 2 adom. We say that RC(M) admits naturalactive collapse [7] if every RC(M) formula is equivalent to an active-domain formula. We say that RC(M) admits restricted quanti er collapse if every RC(M) formula is equivalent to one in which SC-relations appear only under the scope of quanti ers 9x 2 adom and 8x 2 adom. Note that if M admits quanti erelimination, these two notions coincide.

We shall consider a number of rst-order structures

M = h ; i, where is a collection of predicates

and functions on  . Often it is convenient to have all relational symbols in . For that purpose, we introduce the unary relation La (last symbol) which is true of x i the last symbol of x is a, and a binary relation Fa (x; y) which holds i y = fa(x) = a  x. Note that jxj does not return a string, so it is not an operation of  . Instead, we use the binary predicate el(x; y) (equal length) which is true i jxj = jyj. We write x l y to express that y extends x by exactly one symbol. Let pre x (C ) stand for the pre x-closure of C : fs j s  s0 ; s0 2 C g. By # (C ) we denote fs j jsj  js0 j; s0 2 C g. Given C   and x 2  , by x u C we denote the longest string among x u c; c 2 C . Note that this is well-de ned, since all the strings x u c are pre xes of x.

Databases and query languages A database schema SC is a collection of relation names R1 ; : : : ; Rl , Ri being of arity pi > 0. In an instance of SC over a set U , each Ri is interpreted as a nite subset of U p . The active domain of a database D, adom (D), is the set of elements from U that appear in D. The general setting for query languages is that of a nite database and an in nite underlying structure M = hU; i, where is a set of operations (functions and predicates) on U . As our basic language we consider relational calculus, or rst-order logic, over the schema SC and M, denoted by RC(SC ; M). We often omit SC when it is understood, or irrelevant. For example, if M = h ; ; (La )a2 i, the query i

Complexity classes Some complexity results0 in this0

paper refer to parallel complexity classes AC , TC , and NC1 . AC0 is constant parallel time; more precisely, the class of languages accepted by polynomialsize constant-depth unbounded fan-in circuits. TC0 additionally has majority gates of unbounded fan-in. In NC1 , there are no majority gates, the depth is allowed to be logarithmic, but fan-in is bounded. It is known that AC0  TC0  NC1 (parity separates TC0 from AC0 ). We consider uniform versions of these classes [4]; uniform AC0 over nite structures can be characterized via de nability in FO(BIT; k. Then there are in nitely i , and addli . many strings c such that D j= '(c). 2 Theorem 4 safe RC(S) = RA(S); safe RC(Slen) = Lemma 3 Let '(x) be a RC(Slen) query. Then there RA(Slen). exists (and can be e ectively found if ' only uses length sketch. The previous theorem showed that there restricted quanti cation), a number k > 0 such that the isProof a bound outputs of safe queries. To prove this following holds. Assume that D j= '(s) for some s with theorem, it on suces to show that those bounds can be d(s; # D) > k. Then there are in nitely many strings c computed by relational algebra expressions. See the such that D j= '(c). appendix for details. 2 Proofs of these lemmas and an easy proof that the the- One of the operations in RA(Slen ), # i , is very expenorem follows from them are in the appendix. 2 sive, as it may create sets whose size is exponential in the size of the input. It is, however, unavoidable, as Corollary 6 For both RC(S) and RC(Slen), the there are very expensive safe queries in RC(Slen). classes of range-restricted and safe queries coincide, and safe queries have e ective syntax. 2

6.2 Decidability results

Note that for queries in RC(S) and RC(Slen) that use a restricted form of quanti cation (pre x or length), the proof gives us a stronger result: namely, the formula can be e ectively found for a given '. Not only can we get e ective syntax for safe queries, but we can also capture the class of safe queries with an appropriate extension of relational algebra.

It is a classical result that safety of pure relational calculus queries is undecidable. State-safety (given a query ' and a database D, is '(D) nite?) is decidable for the pure relational calculus, and various extensions RC(M) (for example, for the natural numbers with successor [30] or the real eld [8]). For S and Slen , this decidability holds as well: 8

Proposition 7 State-safety is decidable for RC(S) language. Both languages have some nice properties: 2 for example, there is e ective syntax, and even an algebra, for safe queries. However, RC(S) misses a number As query safety is undecidable, one often considers re- of important string functions, while the complexity of strictions for which decidability can be obtained. Here RC(Slen ) can be quite high: we saw how to encode and RC(Slen ).

NP and coNP-complete problems on inputs of a special kind. Thus, a natural question is whether one can add operations to RC(S) while maintaining its nice properties: e ective syntax for safe queries and low data complexity. We give here a positive answer to this question, by considering two extensions. The rst one gives us operations for adding/trimming symbols on the left; for example, TRIMa (s), where a 2 , produces s0 if s = a  s0 , and  if the rst symbol of s is not a. The other extension is by allowing tests for membership in a regular language, without the full power of the equal length predicate. We show that both extensions share most of its properties with RC(S), while adding significantly to the expressiveness of the language. The rst operation we consider is adding one single character on the left: s 7! a  s, and its inverse TRIMa (s) denoted by s ? a. That is, we consider the structure:

we look at one of the most fundamental classes of queries { conjunctive queries. We take their de nition in the context of interpreted operations from [8, 22]. A conjunctive query in RC(M) is a query of the form

'(~x)  9~y

^k

i=1

Si (~ui ) ^ (~x; ~y);

where k  0, each Si is a schema relation, ~ui is a subtuple of (~x; ~y) of the same arity as Si , and is an M formula. A Datalog-like notation for such a query would be '(~x) :{ S1 (~u1 ); : : : ; Sk (~uk ); (~x; ~y). In [8], safety of conjunctive queries was shown decidable for RC(M), for various structures M on the reals with numerical operations. We now show a general result from which the decidability results for string structures S; Slen and those considered in [8] follow. We say that niteness is de nable with parameters in M if for each formula (~x; ~y) in M, there exists and can be e ectively found another formula n (~y) such that M j= n(~a) i the set f~b j M j= (~b;~a)g is nite.

 Sleft = h ; ; (la)a2 ; (fa )a2 i. This is a proper extension of S, as the graph of the function fa , f(s; fa (s)) j s 2  g, is not de nable over S [9]. The graph of the subtraction operation is de nable with fa . We also remark that while the classes of subsets of ( )k , k > 1, de nable in S and Sleft are di erent, over both structures the class of de nable subsets of  is the same, that is, the class of star-free languages [9]. The second extension we consider allows us to model more general regular expression pattern matching. Of course any regular language is de nable over Slen, and thus such pattern matching can be done in the more complex model RC(Slen ). We will add regular expression pattern matching directly to S, without adding the equal length predicate. Recall that S has quanti er elimination in the extension that includes predicates PL (x; y), for each star-free language, whose meaning is x  y and y ? x 2 L. We now de ne Sreg to be the extension of S with all such predicates when L ranges over regular languages. Note that membership of x in any regular language L is de nable by PL (; x). To summarize, we are dealing with RC(Sreg ) where

Theorem 5 Assume that M can be expanded to M0 such that the theory of M0 is decidable, and niteness is de nable with parameters in M0 . Then safety of Boolean combinations of conjunctive queries in RC(M) is decidable. 2 We know that Th(Slen) is decidable [10]. Moreover, niteness is de nable with parameters: for (~x; ~y), V n (~y ) is 9~u(8~x (~x; ~y) ! 9~z i zi  ui el(zi ; xi )). Thus:

Corollary 7 The safety of Boolean combinations of conjunctive queries in RC(S) and RC(Slen) is decidable.

2

7 Tame extensions of RC(S)

In the previous two sections we considered two di erent relational calculi for databases with strings: RC(S) and RC(Slen). The former models operations such as the LIKE pattern matching and lexicographic ordering; the latter adds length comparisons, and enables additional operations such as trimming/adding symbols on  Sreg = h ; ; (La )a2 ; (PL )L regulari. both left and right of a string, and the SIMILAR pattern matching for checking membership in a regular Every set de nable in Sreg is de nable in Slen (as Slen 9

expresses all predicates PL ), but the converse is not trimai , a 2  : On an m-attribute relation R, it retrue since the equal length predicate is not de nable in turns an m + 1-attribute relation that holds the Sreg [9]. Furthermore, the class of subsets of  de ntuples f(s1 ; : : : ; sm+1 ) j (s1 ; : : : ; sm ) 2 R; sm+1 = able in Sreg is exactly the class of regular languages. si ? ag. We start with expressive power. Both RC(Sleft ) and RC(Sreg ) behave similarly to RC(S): We now de ne RA(Sleft) as the extension of relational algebra with the operations  (where ranges over a a Theorem 6 RC(Sleft) and RC(Sreg) admit the re- Sleft formulae), prefix, addfi and trimi . stricted quanti er collapse. 2 The relational algebra RA(Sreg ) is de ned as RA(S) except that the formulae used in selections  range over FO(Sreg ) formulae. 0 Corollary 8 RC(Sleft) queries have AC data com1 plexity, and RC(Sreg ) queries have NC data complexity. Furthermore, every generic query expressible in Theorem 8 safe RC(Sleft ) = RA(Sleft); RC(Sleft) or RC(Sreg ) is expressible in RC(