ANDES: efficient evaluation of NOT-twig queries in ... - Springer Link

2 downloads 66 Views 3MB Size Report
May 16, 2012 - In this case, node n j is called the negative ..... data sets (Fig. 2) in ms sql Server 2008 using Andes and ... ment (l = 1) because it is a trivial case as all leaf elements ...... are required, the optimizer switch to either hash match or.
The VLDB Journal (2012) 21:889–914 DOI 10.1007/s00778-012-0275-9

REGULAR PAPER

ANDES: efficient evaluation of NOT-twig queries in relational databases Kheng Hong Soh · Ba Quan Truong · Sourav S. Bhowmick

Received: 6 June 2011 / Revised: 30 March 2012 / Accepted: 4 April 2012 / Published online: 16 May 2012 © Springer-Verlag 2012

Abstract Despite a large body of work on XPath query processing in relational environment, systematic study of queries containing not-predicates have received little attention in the literature. Particularly, several xm l supports of industrial-strength commercial rdbm s fail to efficiently evaluate such queries. In this paper, we present an efficient and novel strategy to evaluate not-twig queries in a tree-unaware relational environment. not-twig queries are XPath queries with ancestor–descendant and parent–child axis and contain one or more not-predicates. We propose a novel Deweybased encoding scheme called Andes (ANcestor Deweybased Encoding Scheme), which enables us to efficiently filter out elements satisfying a not-predicate by comparing their ancestor group identifiers. In this approach, a set of elements under the same common ancestor at a specific level in the xm l tree is assigned same ancestor group identifier. Based on this scheme, we propose a novel sql translation algorithm for not-twig query evaluation. Experiments carried out confirm that our proposed approach built on top of an off-the-shelf commercial rdbm s significantly outperforms state-of-the-art relational and native approaches. We also explore the query plans selected by a commercial relational optimizer to evaluate our translated queries in different input cardinality. Such exploration further validates the performance benefits of Andes.

K. H. Soh · B. Q. Truong · S. S. Bhowmick (B) School of Computer Engineering, Nanyang Technological University, Singapore, Singapore e-mail: [email protected] K. H. Soh e-mail: [email protected] B. Q. Truong e-mail: [email protected]

Keywords NOT-twig · Xpath · XML query processing · Relational database · Performance · Twig evaluation tree · Plan diagrams

1 Introduction In recent times, the database community has shown tremendous interest in devising innovative solutions to efficiently process XPath queries on large xm l databases. Particularly, evaluating XPath queries over relational framework has gained popularity due to its stability, efficiency, expressiveness, and its widespread usage in the commercial world. On the one hand, there has been a host of work, c.f., [13,19], on enabling relational databases to be tree-aware by modifying the database kernel to support xm l. On the other hand, some completely jettison the invasive approach and resort to a tree-unaware approach, c.f., [3,12,24,25,28,32], where the database kernel is not modified to support XPath queries. Typically, in this approach, an xm l document is shredded into relational table(s) based on the following two schemes: the schema-oblivious and the schema-conscious techniques. In brief, the schema-oblivious method [3,12,24,28,32] consists of a fixed schema, which is used to store xm l documents and does not require existence of an xm l schema/dtd. The schema-conscious method [25], on the other hand, derives a relational schema based on the dtd/xm l schema of the xm l documents. Generally, the tree-unaware approach reuses existing code, has a lower cost of implementation, and is more portable since it can be implemented on top of off-theshelf rdbm ss. This paper presents a novel way of evaluating XPath queries containing not-predicates in a schema-oblivious environment.

123

890

K. H. Soh et al.

(a)

(a)

(b)

(c)

Fig. 1 Examples of not-twig query (b)

1.1 Motivation A wealth of existing literature has extensively studied evaluation of various navigational axes in XPath expressions and optimization techniques in a relational environment [9,10, 12,24,28,32]. However, to the best of our knowledge, no systematic study has been carried out in efficiently evaluating XPath queries with not-predicates in this environment. For example, the query /catalog//publisher [not(location)]/name retrieves all names of all publishers, which have no locations (Fig. 1a). Figure 1b, c show graphical representations of two more XPath queries with not-predicates. Note that the importance of such queries has been highlighted in XPathMark [7], an XPath benchmark on top of the XMark-generated data [23] consisting of a set of queries, which covers the main aspects of the language XPath 1.0. At first glance, it may seem that such lack of study may be primarily due to the fact that we can efficiently evaluate these queries by leveraging on the xm l query processor of an existing industrial-strength rdbm s and relying on its query optimization capabilities. However, our initial investigation showed that fast evaluation of XPath queries with not-predicates on large data sets still remains a bottleneck in several industrial-strength rdbm ss. To get a better understanding of this problem, we experimented with the XBench dcsd [33] and uniprot.1 Characteristics of these data sets are shown in Fig. 2a, b. Figure 2c shows three XPath queries with not-predicates (Q1–Q3) and one normal XPath query Q4 (without any not-predicate) along with their execution times on dc1000 data set by xm l query processors of two industrial-strength rdbm ss (Due to legal restrictions, these processors are anonymously identified as xdb1 and xdb2 in the sequel).2 The queries were executed on an Intel Xeon X5570 2.93 GHz machine running on Windows XP x64 Service Pack 2 with 12 GB RAM. Interestingly, the evaluation times vary significantly for different queries and some of the the queries take more than 4 h on xdb1 and about 14 h on xdb2! This is clearly unacceptable performance in most practical applications. Addi1

Downloaded from http://www.uniprot.org/downloads/.

2

We do not shown the performance on uniprot because both xdb1 and xdb2 could not work on u2843 data set.

123

(c)

Fig. 2 Data sets and query evaluation times (in s)

tionally, observe that Q1 and Q2 are semantically equivalent and return the same results set. In practice, end-users may use either of them. However, with the introduction of one more not-predicate in Q1, the evaluation time of Q1 in xdb1 soars more than 100 times (xdb2’s performance does not change much but is equally poor for both queries). Similarly, Q3 and Q4 are also semantically equivalent but their performance in xdb1 differs by a factor of 100 with the introduction of a not-predicate. An immediate impact of such huge performance disparity is that the end-users may have to bear the burden of choosing “correct” queries that have good performance.3 However, xm l databases are usually consumed by non-technical users and hence such cognitive overhead is not desirable. Clearly, both these industrial-strength rdbm ss fail to provide efficient support for evaluation of XPath queries with not-predicates. Is it possible to design a rdbm s-based scheme that can address this performance limitation? In this paper, we demonstrate that novel techniques built on top of an industrial-strength rdbm s can make up for a large part of the limitation.4 We show that the above queries can be evaluated in less than 2 s using the techniques proposed in this paper. Additionally, the performance disparity between normal XPath queries and those with not-predicates significantly diminishes with the adoption of the proposed techniques. 1.2 Overview We take an alternative strategy that bypasses conventional logical xm l query optimization and relies solely on a novel xm l encoding scheme and relational optimizer to achieve superior performance for evaluating XPath queries with notpredicates. Our proposed Dewey-based encoding scheme 3

Note that these industrial-strength rdbm s do not automatically rewrite a query to its semantically equivalent more efficient form.

4

A preliminary and shorter version of the paper has appeared in [26].

ANDES: efficient evaluation of NOT-twig queries

called Andes (ANcestor Dewey-based Encoding Scheme) provides the pivotal framework for a schema-oblivious XPath query processor used primarily for read-mostly workloads and built on top of a widely available commercial rdbm s (m s sql Server 2008). In Andes, the levels and leaf elements of an xm l tree are explicitly labeled with our novel encoding scheme whereas the non-leaf elements are implicitly labeled. Specifically, two labels among others, namely AncestorValue and AncestorDeweyGroup, are associated with each level and leaf elements, respectively. These labels enable us to efficiently group a set of elements under the same common ancestor at a specific level with the same ancestor group identifier. As we shall see later, this will allow us to efficiently filter out elements satisfying a not-predicate by comparing their ancestor group identifiers. Specifically, this comparison translates to an equijoin, which is efficiently supported by any industrial-strength rdbm ss. Based on our new encoding scheme, we propose a novel sql translation algorithm to translate an XPath query (represented as a twig pattern), potentially with complex structures of not-predicates, to an sql query. We analyze the input twig pattern and transform it into an encoding scheme independent twig evaluation tree (tet). Informally, a tet is a tree that encodes all the essential information (e.g., nodes that are involved in selection operation, structural joins between these nodes) to evaluate a query. The sql query is generated from the tet. Specifically, in this paper, we focus on twig patterns (XPath queries) with both positive and negative predicates; ancestor–descendant (ad) and parent–child (pc) edges; and xm l elements and attributes with or without value conditions. In the sequel, we refer to such queries as not-twig queries. We demonstrate with exhaustive experiments on both synthetic and real data sets that our proposed approach significantly accelerates evaluation of not-twig queries. In particular, our proposed approach is significantly faster than xdb1 and xdb2 (highest observed factor being 1,000 times). Interestingly, it significantly reduces the performance gap with a state-of-the-art column store-based XQuery processor (MonetDB/XQuery [4]) and somewhat unexpectedly, it even outperforms MonetDB/XQuery for several not-twig queries! We also provide insights to the plan choices a relational optimizer makes during not-twig query evaluation in Andes by visually characterizing its behavior over the relational selectivity space using picasso [22]. Such insights further validate the strengths of Andes for not-twig evaluation. The rest of our paper is organized as follows. We formally introduce the notion of not-twig query matching in the next section. We present our proposed encoding scheme called Andes in Sect. 3. In Sect. 4, we present in detail the not-twig query processing strategy by utilizing our encoding scheme. The algorithm to translate a not-twig pattern

891 Table 1 Table of notations D

An xm l document tree

d

A node (element or attribute) in an xm l document

Q(N , E)

A twig pattern with node set N and edge set E

n

A node in the twig pattern

nr

Return node of the twig pattern

L max

The depth of the document tree

Mk

Maximal k-consecutive leaf-node list

A

The AncestorValue at level 

Gi

The Ancestor Group Identifier of di at level 

M(Nσ , E σ )

The twig evaluation tree

Cn

A node in the twig evaluation tree corresponding to n in Q

into an sql query over Andes is detailed in Sect. 5. We evaluate and compare the performance of our proposed techniques through an extensive set of experiments in Sect. 6. We visually depict the plan choices a relational optimizer makes during not-twig query evaluation in Andes in Sect. 7. We compare our approach with related work in Sect. 8. Section 9 concludes the paper. A list of symbols used in this paper is given in Table 1.

2 Preliminaries In this section, we first present the representation of twig queries with not-predicates containing ancestor–descendant (ad) and parent–child (pc) axis. Then, we formally define the not-twig query matching problem. 2.1 XML data and query model We model an xm l document as an ordered labeled tree D. The node set of D stores the elements and attributes of the document.5 Each edge (d1 , d2 ) in D represents the structural relationships between node d1 and node d2 in the document tree. Each node d ∈ D has a string label, denoted as d.tag, representing the tag of the element/attribute. Each node d ∈ D may also optionally have a value, denoted as d.value, storing the data value associating with d. In this paper, for simplicity, we will not consider mixed elements6 and assume d.value is empty for all non-leaf nodes. Twig patterns form an important part of XPath queries. In this paper, a twig pattern is essentially a tree specifying the tag conditions, the value conditions, and the structural conditions of an XPath query. In the remaining part of this paper, 5

In our model, we ignore comments, namespaces, and processing instruction.

6

Mixed elements can easily support by adding a dummy “value” tag.

123

892

we use the term “node” to refer to query node while the term “element” is used to refer to the document elements/attributes. The tag condition associated with a node n in the twig pattern (denoted as n.tag) is a condition that the tags of elements matching n must satisfy. Note that n.tag can be either a label string  or a wildcard “*”. When n.tag = , all elements d matching to n must have d.tag = . A leaf node n in the twig pattern may also have a value condition, denoted as n.value, specifying the required data value of the matching elements. In a twig pattern, the return node, denoted as nr , is a special node representing the result nodes of the XPath query. In the sequel, we shall graphically represent a return node of a twig pattern in italic. When a non-leaf node has more than one children, it is called a branching node; otherwise, it is called a non-branching node. An edge between two nodes n i and n j in a twig pattern, denoted as edge(n i , n j ), specifies the structural condition between matching elements. In this paper, a structural condition represents either a parent–child (pc) or ancestor–descendant (ad) relationship. Consequently, the twig pattern edges can be classified into one of the following two types. – Positive edge: This corresponds to an edge(n i , n j ) without not-predicate in the query expression. It is represented as “|” or “||” in a twig pattern for pc or ad edges, respectively. Node n j is called the positive pc (resp. ad) child of n i . – Negative edge: This corresponds to an edge(n i , n j ) with not-predicate and is represented as “|¬” or “||¬” in the twig for pc or ad edges, respectively. In this case, node n j is called the negative pc (resp. ad) child of n i . A descendant of a node n is called a positive descendant if the path between them contains no negative edges. For example, Fig. 1c, title is a positive descendant of catalog. Since predicates in XPath can be in any order, a twig pattern tree Q(N , E) typically does not have order. However, to simplify our discussions, we will assume for any nodes n ∈ N , its negative children always precedes its positive children. A node n is called a rightmost descendant of a node n  if n is a descendant of n  and all of its ancestors from n to n  are the rightmost child of its parent. We also assume that the return node nr is a rightmost descendant of the root node in Q. For instance, in Fig. 1c, the negative children description and publisher of book precede (on the left) the positive child title and the return node book is a rightmost descendant of the root catalog. Since handling negative edges is the main focus of this paper, we shall specifically refer to a twig query with at least one negative edge as not-twig query. On the other hand, a query without any negative edges is referred to as normal twig query. For example, consider the not-twig query

123

K. H. Soh et al.

Fig. 3 An xm l document

in Fig. 1a. edge(catalog,publisher) and edge(publisher,name) are ad and pc positive edges, respectively, whereas edge (publisher,location) is a negative pc edge. Node publisher has two children, among which name is a positive pc child and location is a negative pc child. Note that the node name is the return node of the query. 2.2 NOT-twig pattern matching Given a not-twig query Q, a query node n, and an xm l tree D, an element dn ∈ D satisfies the subquery rooted at n of Q iff: 1. dn .tag = n.tag, dn .value = n.value; and 2. n is a leaf node of not-query Q; or 3. For each child node n c of n in Q: – If n c is a positive child of n, then there is an element dn c ∈ D such that dn c is a child (when edge(n, n c ) is pc) or descendant (when edge(n, n c ) is ad) element of dn and satisfies the sub-query rooted at n c . – If n c is a negative child of n, then there does not exists any element dn c ∈ D such that dn c is a child (when edge(n, n c ) is pc) or descendant (when edge(n, n c ) is ad) element of dn and satisfies the sub-query rooted at n c . For example, consider the query in Fig. 1c and the xm l tree in Fig. 3. The subtree rooted at the first book element is not a matching answer since it does not satisfy the subquery rooted at publisher node because it has a publisher child without any website children. The second book element is not in the results of the query because it does not satisfy the subquery rooted at title node. However, the last book element satisfies the query. 3 Dewey-based encoding scheme In this section, we present in detail the schema-oblivious encoding scheme called Andes (ANcestor Dewey-based Encoding Scheme). Our evaluation technique for not-twig queries is built on top of this scheme.

ANDES: efficient evaluation of NOT-twig queries

893

iff d1 is a leaf element immediately preceding d2 . For example, the superscript of each leaf element in Fig. 3 denotes its LeafOrder value. In Attribute table, for each attribute da , the stored LeafOrder is the LeafOrder of the parent element of da in PathValue. Note that if the parent is an internal element, then the LeafOrder value of the first leaf descendant of the parent is assigned as the LeafOrder of da (for reasons discussed later).

Fig. 4 An instance of Andes for the document in Fig. 3

The schema of the schema-oblivious xm l storage based on Andes is as follows. – – – – –

Document(DocID; Name) DocLevel(DocID; Level; AncestorValue) StructSummary(PathId; PathExp) PathValue(DocID; LeafOrder; PathId; BranchOrder; AncestorDeweyGroup; LeafValue) Attribute(DocId; LeafOrder; PathId; AncestorDeweyGroup; LeafValue)

Document stores the document identifier DocID and the name Name of a given input xm l document D. Each distinct root-to-leaf path appearing in D, namely PathExp, is associated with an identifier PathId and stored in StructSummary table. Essentially, each path is a concatenation of the labels of the elements in the path from the root to the leaf. An example of the StructSummary table containing the root-to-leaf paths of Fig. 3 is shown in Fig. 4. Note that ‘#’ is used as a delimiter of steps in the paths instead of ‘/’ for reasons described in [32]. Each level of an xm l tree is associated with an attribute called AncestorValue and stored in the DocLevel relation. The PathValue and Attribute relations are used to store the data of the leaf elements and attributes, respectively. Each leaf element is associated with four attributes, namely LeafOrder, BranchOrder , AncestorDeweyGroup, and LeafValue. Each nonleaf element is implicitly assigned the AncestorDeweyGroup of the first descendant leaf element for reasons discussed later. Each attribute element is associated with LeafOrder, AncestorDeweyGroup, and LeafValue. We now elaborate on these attributes. The LeafValue attribute stores the data value of the elements/attributes while PathId is a foreign key to StructSummary table, indicating the root-to-leaf path of the elements/attributes. In PathValue, LeafOrder stores the document order of the leaf element in the document. That is, given two leaf elements d1 and d2 , d1 .LeafOrder < d2 .LeafOrder iff d1 precedes d2 in document order. LeafOrder of the first leaf element of the document is 1 and d2 .LeafOrder = d1 .LeafOrder +1

3.1 BranchOrder attribute Given two consecutive leaf elements d1 and d2 where d1 .LeafOrder + 1 = d2 .LeafOrder, d2 .BranchOrder is the level of the nearest common ancestor (nca) of d1 and d2 (denoted as N C ALevel(d1 , d2 )). For example, the leaf element name with LeafOrder = 3 in Fig. 3 has BranchOrder value equal to 2 as the nca of this element and the preceding price element is at the second level. The BranchOrder of the first leaf element is 0. The BranchOrder values of remaining leaf elements are shown in Fig. 4. Lemma 1 Given an element d whose level is  and dc is a descendant leaf element of d, dc .BranchOrder <  ⇔ dc is the first descendant leaf element of d (i.e., the left-most descendant leaf element). Proof Let dc be the leaf element immediately preceding dc . From the definition of BranchOrder, dc .BranchOrder = N C ALevel(dc , dc ). We will prove the lemma by proving both directions of the inferences. ⇐: When dc is the first descendant leaf element of d, since dc precedes dc , dc is not a descendant of d. Therefore, N C ALevel(dc , dc ) < d.Level = . ⇒: When dc .BranchOrder = N C ALevel(dc , dc ) < , let da is the nca of dc and dc . Since both da and d are ancestors of dc and da .Level < d.Level, da must be an ancestor of d. Therefore, d cannot be an ancestor of dc (otherwise, d is the nca compared to da ). Since the immediately preceding leaf element of dc is not a descendant of d, dc must be the first descendant leaf element of d.   From Lemma 1, for each element d (either internal or leaf) in the document, we can easily locate its first leaf descendant. Observe that the BranchOrder has two key benefits. First, only leaf elements need to be explicitly stored in Andes. Data regarding an internal element can be deduced from its first descendant leaf element dc . Second, when an internal element d needs to be retrieved, it can be “represented” by dc . Consequently, only the latter needs to be retrieved. Note that since each d has only one dc , no redundant elements are retrieved.

123

894

K. H. Soh et al.

3.2 AncestorValue attribute The AncestorValue attribute is associated with each level of an xm l tree and is crucial in Andes for XPath evaluation. We first introduce the notion of maximal k-consecutive leafnode list to facilitate our exposition. Consider a list of consecutive leaf element S : [d1 , d2 , d3 , . . . , dr ] in D. Let k ∈ [1, L max ] where L max is the depth of the document. Then, S is called a k-consecutive leaf-node list of D iff ∀0 < i ≤ r di .BranchOrder ≥ k. S is called a maximal k-consecutive leaf-node list, denoted as Mk , if there does not exist a k-consecutive leaf-node list S  such that |S| < |S  |. For example, M2 in Fig. 3 contains three leaf elements as |S| = 3 for M2 . Definition 1 (AncestorValue) Let L max be the maximum level of an xm l tree. Then, the AncestorValue of level  for 0 <  < L max , denoted as A , is defined as follows: – If  = L max − 1, then A = 1 – If 0 <  < L max − 1, then A = A+1 × (|M+1 | + 1) For example, reconsider the xm l tree in Fig. 3. Here, L max = 4, |M3 | = 1 and |M2 | = 3. Hence, A3 = 1, A2 = 1 × (1 + 1) = 2, and A1 = 2 × (3 + 1) = 8. Next, we discuss certain characteristics of AncestorValue, which will play pivotal role in the evaluation of not-predicates. Lemma 2 Let  be a level in an xm l tree where 0 <  < L max . Then, A is divisible by all A+m where 0 < m < (L max − ). Proof The maximum number of consecutive leaf elements with BranchOrder ≥ k is |Mk |. Given any element at level k, all but one of the descendants of this leaf element has BranchOrder ≥ k (Lemma 1). Hence, any node at level k has at least |Mk | + 1 descendant leaf elements. Recall that each level of AncestorValue is a multiplication of the next level of AncestorValue against the maximum number of descendant leaf elements at the same level. Suppose that A is not divisible by A+m where 0 < m < (L max − ). Based on definition of AncestorValue: A = A+1 × (|M+1 | + 1) = A+2 × (|M+2 | + 1) × (|M+1 | + 1) Then, A = (|M+1 | + 1) A+1 A = (|M+2 | + 1) × (|M+1 | + 1) → A+2 A+1 = (|M+2 | + 1) → A+2 ⇒

+m  A = (|M | + 1) A+m 

123

 where Z is a set of integers and +m (|M | + 1) ∈ Z . ∴ A is divisible by all A+m where 0 < m < (L max − )   and A is larger than A+m . Let us illustrate the above lemma using the previous example. Let  = 1. Then, 0 < m < 3. Hence, based on the above lemma, A1 /A2 = 8/2 = 4 and A1 /A3 = 8/1 = 8. 3.3 AncestorDeweyGroup attribute The AncestorDeweyGroup attribute is used to encode an element’s order information together with its ancestor’s order using a single integer. First, let us elaborate on the ancestor’s order concept. Let par ent (w) denote the parent of an element w. Consider a leaf element d ∈ D at level . Then, for 1 < k ≤ , Or d(d, k) = i iff (i) there exists an element da at level k, which is either an ancestor of d or d itself; and (ii) da is the i-th child of par ent (da ). For example, consider the rightmost leaf element website in Fig. 3 (denoted as d). Or d(d, 2) = 3 as the rightmost book element in the second level is an ancestor of d as well as the third child of the root. Similarly, Or d(d, 3) = 2 and Or d(d, 4) = 2. Definition 2 (AncestorDeweyGroup) Consider a leaf element d at level  in an xm l document. AncestorDewey Group of d, d.AncestorDeweyGroup, is defined as j=2 ( j) where ( j) = [Ord(d, j) − 1] × A j−1 . The AncestorDeweyGroup of an attribute is that of its parent element. For example, reconsider the last leaf element in Fig. 3 with Dewey value “1.3.2.2”. AncestorDeweyGroup of this element is: d.AncestorDeweyGroup = (Or d(d, 2) − 1) × A1 + (Or d(d, 3) − 1) × A2 + (Or d(d, 4) − 1) × A3 = 2 × 8 + 1 × 2 + 1 × 1 = 19. The AncestorDeweyGroup values of remaining elements are shown in the PathValue table in Fig. 4. Noticeably, Andes only stores explicitly the AncestorDeweyGroup of leaf elements. Each internal elements are implicitly assigned the AncestorDeweyGroup of its first leaf descendant. Moreover, the AncestorDeweyGroup of each leaf element in a document is also unique and the pair (DocId, AncestorDeweyGroup) can be used as the key of the PathValue table. 3.4 Storage space comparison The space efficiency of Andes is comparable to most stateof-the-art encoding schemes due to the following three reasons. Firstly, the number of tuples in Document, DocLevel, and StructSummary tables are dependent on the number of documents, their depths, and the size of their structural summaries, respectively. All these are orthogonal to the data size and usually small in practice. Secondly, Andes only

ANDES: efficient evaluation of NOT-twig queries Table 2 Storage space of Andes vs. containment encoding scheme (without indexes) Data set

dc1000 U284

Andes (MB) 1,062 512.4

Pre-post-level scheme (MB) 1,292 539.6

stores leaf elements and attributes but not internal elements. Thus, significantly fewer tuples are materialized compared to encoding schemes such as Dewey-based scheme [17,20,31], containment scheme [6,35], and prime number scheme [30] where both internal and leaf elements are stored. Thirdly, for each node (leaf element or attribute), we store six attributes, namely, DocId, LeafOrder, AncestorDeweyGroup, BranchOrder , PathId and LeafValue. Note that LeafValue must be stored by any encoding schemes and DocId must be stored to support multi-document queries. Hence, our encoding scheme essentially uses four integers, whereas containment encoding schemes use at least three attributes (pre, post, level) for each node. As a result, Andes does not take significantly more storage space. Table 2 reports the storage space requirements (without indexes) of two representative benchmark data sets (Fig. 2) in m s sql Server 2008 using Andes and containment encoding scheme. Observe that Andes takes lesser space compared to the containment encoding scheme. Note that for the containment scheme, we only use the basic pre-post-level scheme along with node values. Storage of additional attributes such as kind of node etc. (e.g., [12]) will further increase its storage space. Additionally, naïve Dewey-based scheme requires an array of integers for each nodes, taking significant more space when the document is deep. Although [17,20,31] compress this integer array into a single binary string, its length remains dependent on the document depth. Note that although prime encoding scheme stores a single prime number for each node, to compare the document order between nodes it uses a list of Simultaneous Congruence, which can grow large when the data size increases.

4 Ancestor group-based NOT-twig evaluation In this section, we present a novel approach of efficiently evaluating not-twig queries using Andes. The main idea of our proposed approach is to group all elements under the same ancestor at a specific level  with a unique ancestor group identifier. This will allow us to compute and compare elements’ ancestors at a specific level  easily. We begin by formally introducing the notion ancestor group identifier. Then, we present how it can be used for evaluating not-twig queries. In the next section, we shall present an sql translation algorithm that realizes our proposed approach.

895

4.1 Ancestor group identifier Given an internal element d at level  of an xm l tree, a unique ancestor group identifier with respect to  is assigned to all the descendant leaf element(s) of d. It is computed using AncestorDeweyGroup values of the leaf elements and the AncestorValue at level  − 1. Formally, it is defined as follows. Definition 3 (Ancestor Group Identifier) Let di be a leaf element in the xm l tree D. Let da be an ancestor element of di at level  > 1. Then, Ancestor Group Identifier of di with respect to da at level , denoted as Gi , is defined as follows.   di .AncestorDeweyGroup  Gi = A−1 For example, consider the leaf elements d1 , d2 , d3 , and d4 in Fig. 3 (di refers to a leaf element with LeafOrder i). The AncestorDeweyGroup values of these elements are 0, 2, 4, and 5, respectively. Also, A1 = 8 and A2 = 2. If we consider the first book element at level  the ancestor element  of  2 as  0 4 = 0, G22 = A2−1 = these elements, then G12 = A2−1   2

4 5 2 2 8 = 0, G3 = 8 = 0, and G4 = 8 = 0. However, if we consider the publisher element at level 3 asances

tor element, then G33 = 24 = 2, and G43 = 25 = 2. Observe that the descendant leaf elements of a given ancestor element have identical ancestor group identifiers. As we shall see later, this equality of identifier values is important for our not-twig evaluation strategy. Note that we do not define ancestor group identifier with respect to the root element ( = 1) because it is a trivial case as all leaf elements in the document shall have same identifier values. Ancestor group identifiers of non-leaf elements. Observe that in the above definition only the leaf elements have explicit ancestor group identifiers. In fact, we can assign the ancestor group identifiers to the internal elements implicitly. The basic idea is as follows. Let dc be the nca at level  of two leaf elements di and d j with ancestor group identifiers equal to G  . Then, the ancestor group identifiers of all non-leaf elements in the subtree rooted at dc is G  . Note that the ancestor group identifiers of leaf elements is conceptually propagated to its ancestor non-leaf elements in dc . For example, reconsider the first book element at level 2 as the root of the subtree. Then, the ancestor group identifiers of the publisher and name elements are 0. Note that these identifiers are not stored explicitly in Andes as they can be computed from AncestorDeweyGroup and AncestorValue. Role of ancestor group identifiers to evaluate descendant axis. Observe that a key property of the ancestor group identifier is that all descendants of an ancestor element at a specific level must have same identifiers. We can exploit this feature to efficiently evaluate descendant axis. Given a query a//b,

123

896

K. H. Soh et al.

let da and db be elements of types a and b, respectively. Then, whether db is a descendant of da can be determined using the above definition as all descendants of da must have same ancestor group identifiers. As we shall see later, this equality property is also important for structural joins and not-twig evaluation strategy. 4.2 Computation of common ancestor Given two elements in an xm l tree, in this section, we discuss how the ancestor group identifier can be exploited to compute the level of a common ancestor. We first introduce the following lemma that we shall be using subsequently. Lemma 3 Let d be aleaf element at level  of an xm l tree  j=k ( j) = 0 where ( j) = [Ord(d, j) − 1] × D. Then, Ak−2 A j−1 and k ∈ [3, ]. Proof The maximum number of consecutive leaf elements with BranchOrder ≥ k is |Mk |. Given any element at level k, all but one of the descendants of this leaf element has BranchOrder ≥ k (Lemma 1). Hence, any element at level k has at most |Mk | + 1 descendant leaf elements. Given Ord(d, t) of n at each level t ∈ [k, ], any ancestor of d at level t − 1 has at least [Ord(d, t) − 1] children that are not d nor d’s ancestor. Each of these nodes is either a leaf element or has at least one descendant leaf element. Thus, an ancestor of d at level t − 1, excluding d, has at least [Ord(d, t) − 1] descendant leaf elements, all of which are descendants of the d’s ancestor at level k − 1 and are not descendants of any d’s ancestors at level greater than t − 1.Therefore, there is an element at level k − 1 with at least t=k [Ord(d, t) − 1] + 1 leaf descendants (including  d). Accordingly, t=k [Ord(d, t) − 1] + 1 ≤ |Mk−1 | + 1. Then, 

 ( j) = [Ord(d, t) − 1] × A j−1

j=k

[Ord(d, t) − 1] × Ak−1

= Ak−2

=

A−1

j=2 ([Ord(d2 ,

j) − 1] × A j−1 )

A−1

 ([Ord(d2 , j) − 1] × A j−1 )

Thus, ∃t ∈ [2, ], Ord(d1 , t) = Ord(d2 , t). It means d1 and d2 do not have common ancestor at level t. Since t ≤ , d1 and d2 also do not have common ancestor at level .  

( j)

Ak−2

0. Using Lemma 4, G1 = G2 indicates:   d1 .AncestorDeweyGroup A−1   d2 .AncestorDeweyGroup = A−1   j=2 (O1, j × A j−1 ) j=2 (O2, j × A j−1 ) = A−1 A−1 

(O1, j

 × A j−1 ) = (O2, j × A j−1 )

j=2

(1)

j=2

We will prove that Ord(d1 , j) = Ord(d2 , j) ∀ j ∈ [2, ]. Let t ∈ [2, ] be the minimum value in which Ord(d1 , t) = Ord(d2 , t). Assuming Ord(d1 , t) < Ord(d2 , t), that means Ord(d1 , t) + 1 ≤ Ord(d2 , t).  From Lemma 3, j=t+1 ([Ord(d1 , j) − 1] × A j−1 ) < At−1 . Hence, 

O1, j × A j−1

j=2

=

t  (O1, j × A j−1 ) + (O1, j × A j−1 ) j=2


1, di must have the same ancestor as d j at level  iff Gi = G j . Proof The proof follows from Lemmas 5 and 6.

 

Note that Lemmas 5 and 6, and Theorem 1 can also be used for internal elements since ancestor group identifier of an internal element of a subtree rooted at the nca is identical to that of any leaf element in the subtree (Sect. 4.1). 4.3 Evaluation of NOT-twig query We now discuss how ancestor group identifiers are exploited for evaluating not-twig queries. To illustrate the idea, let us consider an example shown in Fig. 5. Figure 5b depicts the logical steps to evaluate the twig pattern in Fig. 5a. In this example, we use the document in Fig. 3 and the fragment of the PathValue table in Fig. 4 for illustration. Note that, for clarity, Fig. 5b only depicts the AncestorDeweyGroup (denoted as Yi where i is the LeafOrder) values of the elements. Our evaluation can be broadly classified into two steps (enclosed in two dotted rectangles). The first step consists of selection operators based on the conditions specified in the nodes (i.e., tag condition and value condition). For example, Fig. 5b depicts the selections of the condition node catalog/book//publisher/name=’Pearson’ and the return node catalog/book/title. The two retrieved groups of elements are Nname = {Y9 } and Ntitle = {Y1 , Y5 , Y8 }, respectively. The second step consists of a join operation (Theorem 1), which is preceded by computation of the ancestor group identifiers based on Definition 3. For example, in Fig. 5b, the ancestor group identifiers of all elements in Nname and Ntitle are computed and then joined together by exploiting Theorem 1. The join is actually a right anti-join7 to retrieve those elements in Ntitle that do not have identical ancestor group identifiers with an element in Nname . Consequently, the elements Y1 and Y5 are returned as the result of this join (as well as the query). 7

Observe that by combining Lemmas 5 and 6, ancestor comparison between two elements (more specifically,

A right anti-join, denoted as , is a variant of the relational join operator which outputs only those tuples in the right relation that do not join with any tuple in the left relation.

123

898

K. H. Soh et al.

Fig. 5 Overview of evaluation of a not-twig query in Andes

(a)

5 SQL translation algorithm In the preceding section, we discussed how ancestor group identifiers can be utilized to evaluate a not-twig query. In this section, we present an algorithm to translate a not-twig query to an equivalent sql query over Andes that utilizes these ancestor group identifiers. We begin by introducing the notion of twig evaluation tree, which we shall be exploiting subsequently. 5.1 Twig evaluation tree (TET) Reconsider the example in Sect. 4.3. Notice that although the twig pattern in Fig. 5a has five nodes, only two of these are involved in the selection operation. Consequently, although there are four edges in the twig, only one structural join is sufficient for evaluating it. In other words, it is important to minimize the number of selections and joins in order to optimize query performance. To achieve that, we expand the Path Materialization (pm ) technique introduced in [32]. The key idea of pm technique is to avoid selection on an internal query node by enforcing the path condition on its positive leaf descendants. For instance, consider the subtwig rooted at publisher node in the twig in Fig. 5a. If we strictly follow the twig pattern matching rule (Sect. 2.2), then all publisher matches and name matches are retrieved and then their pc relations are validated. Using pm , only name matches satisfying the path expression catalog/book//publisher/name are retrieved (denoted as Nname ). All publisher matches must be an ancestor of a node in Nname . Similarly, matches for the subtwig rooted at title are all nodes satisfying catalog/book/title (denoted as Ntitle ). The twig query results are all nodes in Ntitle whose book parent has no descendants in Nname (we call this operation a join between Ntitle and Nname ). Compared to the pm technique, our approach makes two significant contributions. First, [32] uses ( pr e, post) containment scheme [6,35], which can only support ad evaluation. Thus, to support the join between Ntitle and Nname , [32] needs to retrieve all matches to book. On the other hand, our

123

(b)

novel encoding scheme supports checking of common ancestors without physically retrieving the ancestors. For example, we can simply evaluate whether nodes in Ntitle and Nname have common ancestor at level 2 (all matches to book must be at level 2). Second, [32] only focused on normal twig queries in which all children are positive. However, in not-twig queries, some internal nodes may have no positive children and as a result pm technique cannot be applied here. To systematically summarize our technique and simplify discussions, we introduce the notion of twig evaluation tree (tet). Informally, a tet is a tree whose nodes are query nodes requiring selection (named as σ -node) and edges represent the required joins between them. Definition 4 (σ -node) Given a twig pattern tree Q(N , E), a node n ∈ N is a σ -node if n is either: (i) the return node nr ; or (ii) a leaf node; or (iii) a node with at least one negative child and no positive child. The set of all σ -nodes in Q is denoted as Nσ ⊆ N . For example, consider the not-twig in Fig. 1a. It has four nodes but only location and name nodes are σ -nodes because they are a leaf node and the return node, respectively. Similarly, the not-twig in Fig. 1b has three σ -nodes (price, name, title). The not-twig in Fig. 1c has five σ nodes (title, description, website, publisher and book). Note that although publisher is neither a leaf node nor a return node, it is a σ -node because it has one negative child but no positive child. Before defining the edges of the tet, we introduce an auxiliary definition of υ-nodes. Reconsider the twig in Fig. 5. The negative edge between publisher and its parent book node is materialized by a join between matches to name and title. We call the publisher node as the υ-node of name node. Definition 5 (υ-node) Given a twig pattern tree Q(N , E) and a σ -node n ∈ N , a node n  ∈ N is called the υ-node of n, denoted as υ(n), iff (i) n = n  or n is a positive rightmost descendant of n  ; (ii) there are no σ -nodes on the path from n to n  except n itself; and (iii) n  is the highest (minimum level) node in N satisfying both (i) and (ii).

ANDES: efficient evaluation of NOT-twig queries

(a)

(b)

(c)

Fig. 6 Twig evaluation trees of the not-twigs in Fig. 1

Note that the υ-node of nr is always the root of the twig and the υ-nodes are unique for all σ -nodes. For example, in the not-twig in Fig. 1b, the three υ- nodes of the three σ -nodes price, name, title are price, publisher, book, respectively. Figure 1c has five σ -nodes description, website, publisher, title and book with five υ-nodes description, website, publisher, title and catalog, respectively. Note that although title is a positive rightmost descendant of catalog, its υ-node is not catalog but itself because the path from title to catalog contains one σ -node (book). Definition 6 [Twig Evaluation Tree (TET)] Given a nottwig query Q(N , E), a twig evaluation tree M(Nσ , E σ ) is a tree satisfying the following properties: – Nσ consists of the set of σ -nodes in Q. – The return node nr is the root of the tree. – A node n σ ∈ Nσ (n σ = nr ) is a child of n σ ∈ Nσ in M iff par ent (υ(n σ )) is on the path from υ(n σ ) to n σ in Q where par ent (n) returns the parent of n in Q. – Each edge in E σ from child n σ to parent n σ has a boolean label p Flag, which is positive when the edge between υ(n σ ) and par ent (υ(n σ )) in E is positive; otherwise, it is negative. Example 3 Figure 6 depicts the tets of the three not-twigs in Fig. 1. Consider the query in Fig. 1a. Nσ consists of location and name nodes whose υ-nodes are location and catalog, respectively. They are connected by an edge with negative p Flag (Fig. 6a) since location is a negative child of publisher, which is on the path from catalog to name. Similarly, consider the twig pattern in Fig. 1b. The σ -nodes are price, name and title whose υ-nodes are price, publisher and catalog, respectively. The tet is depicted in Fig. 6b. Finally, the tet of Fig. 1c is shown in Fig. 6c. The σ -nodes are description, website, publisher, title and book with υ-nodes description, website, publisher, title and catalog, respectively. Matching a TET, we now briefly discuss how we can matches data nodes to the tet and, accordingly, produce the results for the not-twig pattern. Given a twig query Q on

899

a document D, two nodes n, n a ∈ Q and two node sets N1 , N2 , we assume two generic operations exist, namely, select (n) and join(N1 , N2 , n a ). The select (n) returns all nodes in D satisfying the value and path conditions of n. The join(N1 , N2 , n a ) returns all nodes in N1 , which has a common ancestor in select (n a ) with at least one node in N2 . In Sect. 5.2, we shall discuss how to implement those two operations efficiently in Andes. In the sequel, to avoid confusion between nodes in Q and nodes in tet, we use Cn to refer to a node n in the tet. Definition 7 (σ -node matching) Given a twig pattern Q on document D with its corresponding a tet tree M and a σ node in Q with its corresponding node Cn in M, the set of matches in D to Cn , denoted as D(Cn ), is defined as follow: – If Cn is a leaf node in M then D(Cn ) = select (n). – If Cn is a non-leaf node in M then D(Cn ) =  Cni ∈childr en(Cn ) f (n, n i ) where (a) f (n, n i ) = join (select (n), D(Cn i ), n a ) if edge(Cn i , Cn ). p Flag = positive in M or (b) f (n, n i ) = select (n)\ join(select (n), D(Cn i ), n a ) if edge(Cn i , Cn ). p Flag = negative and n a is the nca of n and n i in Q. For example, consider the tet in Fig. 6a and corresponding twig pattern in Fig. 1a. If we match this tet against the document in Fig. 3, then D(Clocation ) = select (location) = {d4 } (di refers to node with ID i in Fig. 3). D(Cname ) = select (name)\ join(select (name), {d4 }, publisher) = {d9 }. Note that publisher is the nca of name and location in Fig. 1a and join(select (name), {d4 }, publisher) = {d3 }. Notice that D(Cname ) = {d9 } is also the result of the twig. Theorem 2 Given a twig Q on document D with its corresponding tet tree M, D(Cnr ) is equal to the result of Q. Proof We shall prove this theorem using loop invariant technique. The invariant is: For any σ -nodes n in Q, D(Cn ) is equal to set of n’s matches satisfying the subtwig rooted at υ(n) in Q. Initialization: When Cn is a leaf node in M, D(Cn ) = select (n). Since Cn is a leaf node in M, according to Definition 6, all query nodes in Q from υ(n) to n are not branching nodes. Thus, all nodes in select (n) satisfies the subtwig rooted at υ(n). Maintenance: When Cn is an internal node in M, assume our invariant holds for all children of Cn . For any query nodes n a on the path from n to υ(n), from Definition 6, a child n c of n a is either an ancestor-or-self of n or an υ-node of a child of Cn in M. Let d be a data node in D(Cn ), due to Definition 7, there exists an ancestor of d satisfying n a . Thus, d satisfies the subtwig rooted at υ(n).

123

900

Algorithm 1: Algorithm translateANDES

K. H. Soh et al.

Algorithm 2: Algorithm buildTET

Input: The twig query Q Input: Structural Summary S Output: An sql query S Q L f inal Cnr ← buildTET(Q, Q.r oot) /*Build tet of Q */ S Q L ← generateSQL(Cnr , S ) /*Build the core sql */ S Q L f inal ← generateFinalSQL(S Q L, Cnr ) /*Build the final sql */ 4 return S Q L f inal

1 2 3

Termination: The progress terminates at Cnr , the root of M. Remind that υ(nr ) is the root of the twig Q. The loop invariant shows that all nodes in D(Cnr ) satisfy Q. It is the results of the twig pattern Q.   In summary, tet offers the following benefits for processing not-twig queries. – It provides a structured method to identify which nodes require selection and how to join them. Thus, our translation algorithm (discussed in Sect. 5.4) is built upon the tet. In fact, the translation algorithm and the generated sql query closely follow the structure of the tet. – A tet is independent of any specific xm l encoding schemes. Thus, it can be used on any encoding schemes (not necessarily Andes) using selection and join as the basic operators. 5.2 The algorithm translateANDES Algorithm 1 outlines the procedure to translate a not-twig pattern to an sql query over Andes. It consists of the following three key steps. 1. The procedure buildTET is used to parse the twig pattern Q and build the tet of Q. 2. The procedure generateSQL is used to exploit the data stored in the tet to build an sql query S Q L. The structural summary S (available in Andes through the StructSummary table) is used to resolve the path expression of the twig patterns. To improve performance and reduce redundancy, all subqueries in S Q L only retrieve the composite key attribute values ((DocId, LeafOrder)). 3. The procedure generateFinalSQL is invoked to create the final sql query S Q L f inal by selecting all relevant subtree information of the set of composite keys generated in the preceding step. It also enforces document order of the final results and exploits the query hints feature of the underlying relational framework to improve performance. Next, we elaborate on these procedures in turn.

123

Input: The twig query Q Input: A node n in Q Output: A σ -node Cn which is the root of a tet built by the subtree of Q rooted at n 1 2 3

rn ← null ; for each child node n i of n do rn i ← buildTET(Q, n i ) ;

4 5

if n is the return node or n has no positive child then Create Cn and rn ← Cn /*n is an σ -node

6 7

else rn ← rn k where n k is the rightmost positive child of n

8 9

for each rn i = rn do rn .addChild(rn i , edge(n, n i ). p f lag) ;

10

*/

return rn

5.3 The buildTET procedure The buildTET procedure is outlined in Algorithm 2. It traverses Q recursively by visiting each node n in bottom-up postorder fashion (Lines 2–3). The variable rn stores the result returned by buildTET(n). For any n, rn refers the tet node Cn σ of a σ -node n σ such that n satisfies condition (i) and (ii) of Definition 5. That is, n σ is a positive rightmost descendant of n and there are no nodes σ -nodes on the path from n to n σ in Q. Notice that, for any σ -nodes n σ , bottom-up traversal ensures υ(n σ ) is the last traversed node that returns n σ , satisfying condition (iii) of Definition 5. The condition in Line 4 checks whether n is a σ -node. If yes, Cn is created (Line 5) to store all data of n (tag condition, value condition, path and hierarchical relationships) and is assigned to rn . When n is not a σ -node, rn is assigned to rn k (Line 7) where n k is the rightmost positive child of n (see condition (i) and (ii) of Definition 5). Lines 8–9 are used to add rn i as child of rn in tet where n i is a child of n but not n k . Notably, with this condition, n i is the υ-node of rn i while n is on the path from rn to υ(rn ). The positive or negative label of edge(n, n i ) is also recorded in p Flag following Defintion 6. Note that the aforementioned procedure is executed once for each node in Q. Hence, its time complexity is O(|N |) where N is the node set of Q. Example 4 Consider the not-twig in Fig. 1a. In postorder traversal, the location node is processed first. Since it is a leaf node, Clocation is created. Next, the name node is processed. Since it is the return node, Cname is created. Then, the publisher node is processed. The condition in Line 4 fails and r publisher ← Cname . Hence, when r publisher and rlocation are joined with an edge, the Cname and Clocation pairs are actually joined. Note that since r publisher = rname , the condition in Line 8 fails (Cname is not joined with itself). Finally, the node catalog is processed and rcatalog ← Cname . Cname is then returned as the final result. Note that

ANDES: efficient evaluation of NOT-twig queries

the root of the tet always correspond to the return node nr . The generated tet is shown in Fig. 6a. Consider now the not-twig in Fig. 1c. The first processed nodes are description and website. Cdescri ption and Cwebsite are created, respectively. Next, the publisher node is processed. It is not a leaf node and has no positive child. Hence, C publisher is created and linked with Cwebsite with p Flag set to negative. After that, the title node is processed and Ctitle is created. Then, the book node is processed. It has a positive child and it is the return node of the query. Hence, Cbook is created and connected with Cdescri ption , C publisher and Ctitle . Finally, the node catalog is processed where rcatalog = Cbook and Cbook is returned by the algorithm (Fig. 6c). 5.4 The generateSQL procedure Algorithm 3 traverses the tet to construct an sql to realize the tet matching on Andes. Intuitively, two generic functions select() and join() (see Sect. 5.1) are realized on Andes using relational select and join operations, respectively. Algorithm 3 generates the sql query for these operations. In particular, Lines 1–26 realize the select() function while lines 27–35 realize the join() function. In Algorithm 3, we assume that a method add() is available, which encapsulates some simple string manipulations to ensure a well-formed result sql query. Line 2 finds the parent node C p of Ci in the tet. Note that when Ci is the root (i.e., Cnr ), C p is empty. Lines 3–6 are used to build the SELECT statement. If Ci is the root node, the built SELECT statement is the outermost SELECT statement to retrieve the final results. Therefore, the DISTINCT condition needs to be enforced to retrieve non-duplicate tuples. Note that only DocId and LeafOrder are retrieved since they are sufficient to uniquely identify an element. On the other hand, if Ci is not the root node, the built SELECT statement is an inner SELECT statement, which is within an IN operator (Lines 29–35). Hence, DISTINCT condition is not necessary and AncestorDeweyGroup is retrieved to match with the retrieved column in Line 30. Lines 7–10 are used to distinguish elements and attributes. If the matching results are elements, they are retrieved from the PathValue table. Otherwise, they are retrieved from the Attribute table. Lines 11–35 are used to build the WHERE clause of the sql query. First, Lines 11–12 add a condition to enforce Ci ’s value condition using LeafValue attribute. Lines 13–28 are used to build the conditions to select the tuples matching Ci and enforce the join condition between Ci and its parent C p in the tet. Specifically, if Ci is the root node (Lines 13–16), the algorithm selects the tuples matching Ci without considering the join condition as C p is empty. Line 14 enforces the path condition for element selection. However, since Andes only stores leaf elements, when internal elements are required,

901

Algorithm 3: Algorithm generateSQL Input: A condition Ci in the tet M Input: Structural Summary S Output: sql query S Q L Initialize S Q L ; C p ← parent of Ci in M /*C p is empty if Ci is the root of M */ if C p is empty then S Q L.add(“SELECT DISTINCT Vi .DocId, Vi .LeafOrder”); 5 else 6 S Q L.add(“SELECT V p .AncestorDeweyGroup”);

1 2 3 4

7 8 9 10

if Ci .path is an element path then S Q L.add(“FROM PathValue Vi ”); else S Q L.add(“FROM Attribute Vi ”);

11 12

if Ci .node.value is not empty then S Q L.add(“WHERE Vi .LeafValue” + Ci .node.value”);

13 if 14 15 16

C p is empty then S Q L.add(“AND Vi .PathId IN” + getPathIds(getPaths(Ci . path E x pr ,S ))); if Ci .path is an element path then S Q L.add(“AND Vi .BranchOrder