Efficient Processing of XPath Queries Using Indexes - Springer Link

9 downloads 0 Views 94KB Size Report
1 Department of Computer Science, University of Missouri-Rolla, Rolla, MO 65409, ... Computer Science, Laurentian University, Sudbury ON P3E 2C6 Canada.
Efficient Processing of XPath Queries Using Indexes 1

Yan Chen , Sanjay Madria1, Kalpdrum Passi2, and Sourav Bhowmick

3

1

Department of Computer Science, University of Missouri-Rolla, Rolla, MO 65409, USA [email protected] 2 Dept. of Math. & Computer Science, Laurentian University, Sudbury ON P3E 2C6 Canada [email protected] 3 School of Computer Engineering, Nanyang Technological University, Singapore [email protected] Abstract. A number of query languages have been proposed in recent times for processing queries on XML and semistructured data. All these query languages make use of regular path expressions to query XML data. To optimize the processing of query paths a number of indexing schemes have also been proposed recently. XPath provides the basis for processing queries on XML data in the form of regular path expressions. In this paper, we propose two algorithms called Entry-point algorithm and Rest-tree algorithm that exploit different types of indexes, which we have defined to efficiently process XPath queries. We also discuss and compare two variations in implementing these algorithms; Root-first and Bottom-first.

1 Introduction With the wide acceptance of XML as a common format to store and exchange data, it is imperative to query XML data. Several query languages have been proposed to query semistructured data, such as XQuery[4], XML-QL[7], XML-GL[3], Lorel[1], and Quilt[5].. XPath[6] is a language that describes the syntax for addressing path expressions over XML data. To improve the performance of the query on large XML files it is essential to employ indexes on XML data. Indexing techniques used in relational and object-oriented databases do not suffice for XML data due to the semistructured nature of the data. In this paper, we introduce three types of indexes – name index, value index and path index to improve the performance of Xpath queries on XML data. We propose two algorithms called Entry-point algorithm and Rest-tree algorithm to efficiently process XPath queries using the proposed indexes and present performance evaluation of the algorithms. We present simulation results to show that XPath queries on large XML data execute faster using different types of indexes with the Entry-point algorithm proposed in the paper, when compared to traditional methods of querying with or without indexing. Without using indexes, the queries are implemented by traversing the complete XML DOM tree and also the methods that use indexes but do not exploit the index information of the ancestor nodes. 1 2

Partially supported by UM Research Board Grant and Intelligent Systems Center. Partially supported by NSERC grant 228127-01 and an internal LURF grant.

R. Cicchetti et al. (Eds.): DEXA 2002, LNCS 2453, pp. 721–730, 2002.  Springer-Verlag Berlin Heidelberg 2002

722

Y. Chen et al.

1.1

Related Work

Semistructured data such as XML do not confirm to a rigid, predefined schema and have irregular structure. Indexing techniques in relational or object-oriented databases depend on a fixed schema based on a known, strongly typed class hierarchy. Therefore, such techniques are not directly applicable in XML data. Several indexing schemes have been proposed for semistructured data in [10,11], dataguides [8], 1indexes, 2-indexes, and T-indexes [12], ToXin[13] and XISS[9]. Dataguides record information on the existing paths in a database, however they do not provide any information of parent-child relationships between nodes in the database as a result that cannot be used for navigation from any arbitrary node. Tindexes are specialized path indexes, which only summarize a limited class of paths. 1-index and 2-index are special cases of T-indexes. In our approach we add path information to every node that can trace the parent-child relationship for every node. In LORE system four different types of index structures have been proposed [10], including: value, text, link, and path indexes. Value index and text index are used to search objects that have specific values; link index and path index provide fast access to parents of an object and all objects reachable via a given labeled path. Lore uses OEM (Object Exchange Model) to store data and OQL (Object Query Language) as its query language. ToXin has two different types of index structure: the value index and the path index. The path index has two parts: index tree and instance functions, and these functions can be used to trace the parent-child relationship. Their path index contains only parent and children information but in our model, we store the complete path from root to each node. ToXin uses index for single level while we use multiple index for different levels. Also, in our proposal, we consider different types of indexes.

2 Indexing XML Data Consider an XML file given in Figure 1 that has information about a bookstore containing say 100,000 books. The DOM tree for the XML fragment in Fig. 1 is shown in Fig. 2. If we need to retrieve all the books with author’s name as “Chris” from the Benny-bookstore (a simple query which is often executed in information retrieval system), without using any optimization technique, we need to find all the nodes in the DOM tree with nodes labeled as BOOK. Then for each BOOK, we need to test the author’s name. After about 100,000 comparisons we get a couple of books with author “Chris” as the output. By using index on AUTHOR, we do not need to test author of each BOOK node. With the index of the key as “Chris”, we can find all the author nodes faster (e.g. only two such nodes). The nodes obtained can be checked if they satisfy the query condition. The execution time can be reduced considerably by using the index. This is a “bottom-up” query plan. Such a plan is useful in the case when we have a relatively “small” result set at the bottom, which can be pre-selected. However, if the query is to find all the books with the name beginning with “glory” and the author as “Chris” and assume that “Chris” is such a famous author that she has more than 5,000 books in the store. The query plan could be to get all the books with the name “glory” disregarding their authors. If there is a small number of books satisfying the constraint, (e.g., four

Efficient Processing of XPath Queries Using Indexes

723

“glory” books), it might be useful to introduce another type of index, which is built on the values of some nodes. Here, we need index upon strings. On the basis of the nodes obtained in the first step, we can further test another condition on the query. Hence, we can build a set of nodes as the “entry set”, which will depend on the specific query and on the type of XML data. 1-1-1 David 1-1-2 Chris 1-1-3 Chris 1-1-4 Michael 1-1-5 Jason 1-1-6 Tomas

Fig. 1. An XML Fragment

[BOOKSTORE: Benny-bookstore]

&2

&3

[BOOK: Brave the New World]

&7

&8

&9 [ISBN:1-1-2]

[BOOK: What lies beneath]

[BOOK: Matrix II]

[BOOK: The Root]

&13

&16

&19

&4

[BOOK: Glory days]

[ISBN:1-1-1] [AUTHOR: David]

&1

&10

[BOOK: I love the game] &11

&12

&14

[ISBN:1-1-3] [ISBN:1-1-4] [AUTHOR: Chris] [AUTHOR: Chris]

&15

&17

&18

&20

&21

[ISBN:1-1-5] [ISBN:1-1-6] [AUTHOR: [AUTHOR: [AUTHOR: Tomas] Jason] Michael]

Fig. 2. Simple DOM Tree of Benny-bookstore

724

2.1

Y. Chen et al.

Types of Indexes

We now describe types of indexes that can be built over an XML file. We will then describe query plan operations that use the indexing structures. To speed up query processing in an XML file, we build three different types of index structures. The first type identifies objects that have specific values; the next two are used to efficiently retrieve the objects in an XML (DOM) tree. In XML Structures [2], each node has an ID. We exploit this attribute and establish a relationship between node ID and storage address to speed our search. Name-index (Nindex) A name index locates nodes with the tag names. Using this index, we group nodes that have the same tag name. We call this index as Name-index. The Nindex for the incoming tag over the XML fragment in Fig. 2 will then be {&2, &3, &4, &13, &16, &19}. Value-index (Vindex) A value-index locates nodes with given value. XML has classified different data types, therefore, we do not need type coercion [10]. Vindex can be built selectively over basic types, such as numbers, strings or on other types. The Value-index for the word “Chris” is {&10, &12}, for the word “the” is {&2, &4}. Apparently, not all values are worth building an index on each of them. The index on the word “the” seems to be redundant. The administrator or the user can decide which values would be useful as a value index. The Nindex and Vindex can also be grouped together, as we need the name or type of nodes to restrict the value index. We can integrate these two indexes to facilitate and accelerate the query. In the simulation program, we extract the features of Nindex and Vindex to an abstract “type” index, which could be replaced by a specific index in actual environment. Path-index (Pindex) Pindex, a path index, locates nodes with the path from root node. To execute a query, after we get the first selected set of nodes, we need to test the nodes to see whether they satisfy the input expression. To make the testing efficient, it is helpful if we record the path from root to the current test node. This may be an extra attribute added to each node. Path index is the information we attach to each node to record its ancestors’ paths. This information is also very useful in our algorithm. As in some of the traditional algorithms [10, 8], if we do not execute the query from top to bottom, we need to get ancestors’ of certain nodes at the middle level. Since we need to build this information on each node, an index will be useful. In Fig. 2 the ancestral path information of &11 is {&1, &4}; node &7’s ancestral path information is {&1, &2}. Definition 1: Descent Number (DN) Descent Number is the information we attach to every node to record the number of its descents. The descents of a node include all the arcs going out of that node and reaching the leaf nodes. This information is used to select one index node set among several index node sets. The usage of DN will be elaborated in our algorithms. In Fig. 2, the DN of node &11 is 0; the DN of node &3 is 2.

Efficient Processing of XPath Queries Using Indexes

725

3 Entry-Point Algorithm XPath is a language for addressing parts of an XML document. XPath also provides basic constructs for manipulation of strings, numbers and Boolean data. XPath operates on abstract, logical structure of an XML document, rather than on its surface syntax. XPath uses path notations for navigating through the hierarchical structure of an XML document. A query written in any of the query languages, such as XQuery[4], XML-QL[7], XML-GL[3], Lorel[1], and Quilt[5] is easily transformed in terms of an XPath expression. If we need to retrieve a relatively small part (data) from the large XML file under certain constraints expressed using XPath, it will be expensive to compare each node with given search conditions. Assuming that various types of indexes, such as Nindex, Vindex and Pindex defined in Section 2 have been created, we give two techniques to process and optimize the XPath expression generated from a given query. In the first technique we find an entry-point node among a set of middle level nodes in the XPath expression. Then we split the XPath expression at the entry-point and test for the path condition for the first part and eliminate nodes from DOM tree that do not satisfy the path condition. Then we test the remaining part of the XPath expression recursively eliminating nodes that do not satisfy the path condition. The algorithm can be implemented either using root-first approach or bottom-up approach. We explain the technique using the example in Fig. 1 before formally giving the algorithm. Suppose we are given the following query: Select BOOKSTORE/BOOK Where BOOK.name=“Glory days” and /AUTHOR.title = “Chris” and BOOKSTORE.name = “Benny-bookstore” The above query is transformed to the following XPath expression: /BOOKSTORE [@name = “Benny-bookstore”]/child:: BOOK[@title = “Glory days”]/Child :: AUTHOR/child :: AUTHOR[@name = “Chris”] Given several types of indexes can enable us to get some specific node sets. For example, we can use Nindex to get all BOOK nodes or AUTHOR nodes. There can be two ways we can approach the query to accelerate the execution. We can either get all books named “Glory Days” and then test the condition on each one of them if the author is “Chris”, or first get all authors named “Chris”, and then test the parent nodes if book name is “Glory Days”. In the first strategy, we evaluate the former part of XPath expression first, that is, /BOOKSTORE [@name= “Benny-bookstore”]/child:: BOOK[@title = “Glory Days”] Then, we test each author child node, which is the latter part of XPath expression, /Child :: AUTHOR/child :: AUTHOR[@name = “Chris”]. In terms of the XML DOM tree, we find an entry-point node among the “middle level” nodes using the index, and then test path information of these nodes to check if they match the first part of the XPath expression, and eliminate the entry-point nodes that do not satisfy the first part of XPath expression. Then, we start from the remaining entry-point nodes as the root nodes of a set of sub trees and eliminate nodes recursively from the sub trees to get the final nodes. The other bottom-up strategy is similar, except that the entry-point nodes will constitute the final result nodes set.

726

Y. Chen et al.

3.1 Entry-Point Algorithms We now present two versions of Entry-point Algorithm; the Root-first Algorithm and the Bottom-first Algorithm. INPUT: XPath expression root/X1/X2/…/Xi/…/Xm STEP 1: FOR each Xi BEGIN IF Xi is indexed THEN BEGIN get every node xi of type Xi. Get the DN ni of each xi Sumi = ∑ni END END STEP 2: Get entry point Xn with minimum Sum, add all xn to a node set S; Consider the tree obtained after deleting all branches that do not have the node xn in its path. split the XPath into root/X1/X2/…/Xn-1 and /Xn+1/…/Xm by the entry point Xn; STEP 3: FOR each node xn in S BEGIN IF the path starting from root to node xn is not included in the path root/X1/X2/…/Xn-1/Xn THEN delete the sub tree that does not satisfy the path condition END STEP 4: FOR each node xn in S, consider all sub trees starting with xn BEGIN IF Xn+1/…/Xm is same as /Xm THEN return nodes Xm ELSE INPUT = Xn/Xn+1/…/Xm GO TO STEP 1 END Algorithm 1.1 Entry-point Root-first

INPUT: XPath expression root/X1/X2/…/Xi/…/Xm STEP 1: FOR each Xi BEGIN IF Xi is indexed THEN BEGIN get every node xi of type Xi. Get the DN ni of each xi Sumi = ∑ni; END END STEP 2: get entry point Xn with minimum Sum, add all xn to a node set S; Consider the tree obtained after deleting all branches that do not have the node xn in its path. split the XPath into root/X1/X2/…/Xn-1 and /Xn+1…/Xm by Xn entry point; STEP 3: FOR each node xn in S BEGIN IF the path starting from node Xn to a leaf node is not included in the path Xn/Xn+1/…/Xm THEN delete the sub tree that does not satisfy the path condition END STEP 4: FOR each node xn in S, consider all sub trees starting with xn as leaf nodes BEGIN IF root/…/Xn is same as /Xn THEN return nodes Xm ELSE INPUT= root/X1/X2/…/Xn GO TO STEP 1 END Algorithm 1.2 Entry-point Bottom-first

Next, we illustrate the Entry-point Root-first Algorithm.

Efficient Processing of XPath Queries Using Indexes

727

Example: Let the XPath expression to be evaluated be A/B/C/E//H. Thus, we need to retrieve the three nodes marked by circle in Fig. 3. Assume that the indexes have been built on nodes B and E. In Step1, we find the entry point E by calculating descent numbers of the nodes that have indexes. In Fig. 3, the descents of nodes B and E are shown in brackets. The descent numbers for nodes B and E come out to be 31 and 18 respectively. Thus, node E becomes the entry point. After applying Step 2, we obtain the DOM tree shown in Fig. 4. It is important to note that it is possible that the DN of an entry-level node at a higher level might be smaller than the DN of an entry-level node at a lower level of the DOM tree. This is possible for the case when there might be a large number of entry-level nodes at the lower level as compared to the number of entry-level nodes at a higher level of the DOM tree. In step 3, we test A/B/C/E on each E node and discard the right most sub tree with node E. The tree obtained is shown in Fig. 5. In step 4, we evaluate E//H on each E and finally we get the three H nodes. A

A

E (8)

F

G G H

D

I

G H F

H

H

C

C

D

E (4)

E (6)

F

H

F

G G

Fig. 3. Entry-point Algorithm E

F

G G H

B

B

B (14)

B (17) C

E

F

G G H

G H

H

C

D

E

E

F

H

F

G G

Fig. 4. Entry-point Algorithm (Cont.)

E

G H

H

F

Fig. 5. Entry-point Algorithm (Cont.)

It is easy to see that the Entry-point algorithm will work with regular expressions too. We observe that if we are given an XPath expression of the form root/X1/X2/…/Xi/…/Xm where any expression Xi = *, we can check if the path of the entry point node xn matches the given regular path root/X1/X2/*/…/Xn-1/Xn in Step 2 of the algorithm by using string matching and containment techniques. However, we need more study on “containment” relationship.

4 Rest-Tree Conception In the Entry-point Algorithm we find an entry point among a number of middle level nodes and split the XPath expression to eliminate the nodes that do not satisfy the condition. Under certain conditions the Entry-point Algorithm may not perform well. For instance, in the BOOKSTORE database shown in Fig. 1, suppose we want to find out books written by “David” where the title of the book contains the word “book”. The XML file might have hundreds of books having the word “book” in the title and

728

Y. Chen et al.

further there might be a large number of books by author “David”, but only one of them has the word “book” in its title. The Entry-point algorithm first eliminates all the nodes that do not have the word “book” in its title. Then it eliminates the nodes that do not have “David” as the author. Similarly in the Entry-point Bottom-first approach, the nodes that do not have “David” as the author name are eliminated first and then the parent nodes that do not have the word “book” in its title are eliminated. In this case, due to relatively large number of instances at the two levels, a large number of eliminations are required. We refine the Entry-point Algorithm to make it more efficient to handle such cases by selecting two or more entry-level points among the middle level nodes and eliminating the nodes based on those entry-level nodes. Definition 2. Rest-Tree The tree formed by the nodes that meet certain condition at its level, along with its descendant and ancestor nodes. For example, in Figure 1, the Rest-tree of the node that satisfies the condition that the node has the word “glory” in its title, is formed by &1, &3, &9, and &10. In the first step we employ Entry-point Algorithm to find all nodes that meet the condition statements at each level. The final result will then be the intersection of the Rest-trees of these nodes. In practice, we do not need to find the Rest-tree of every node satisfying the condition. Since we are left with only small set of nodes after applying the Entrypoint Algorithm, we need to find the Rest-trees of a relatively small set of nodes within a small sub tree. To get the intersection of rest-trees we note that the nodes that satisfy the query condition and that have the minimum number of descendants is available from the Entry-point Algorithm. The minimum level is the anchor level of the rest-tree algorithm. We just need to intersect the Rest-trees at this minimum level. For example, after the first step of Entry-point algorithm, we know there are 2000 nodes at Level A that meet say condition A, 1000 nodes at Level B that meet condition B, 200 nodes at Level C, 3000 at Level D, 400 at Level E. The minimum level is Level C and the ordering of the levels is C→E→B→A→D. As the ancestor node information is available as path-index, we can filter some nodes at Level C by checking the grandparent node information of the 400 nodes at Level E. Similarly, we can filter some other nodes at Level C by checking the parent node information of the nodes at Level B. The intersection at Level C will be complete by checking ancestor information at Level D nodes. The final step is to get all the nodes that satisfy the query requirement. 4.1 Rest-Tree Algorithm The Entry-point algorithm finds the entry point and eliminates nodes at the entry level by checking ancestor level conditions, followed by checking descendant level conditions. However, as stated earlier, under certain conditions it might be better to check descendant level nodes and eliminate the nodes that do not satisfy the XPath expression. We can apply the elimination step either from the descendant level or ancestor level. It is better to begin with descendant level when the descent condition will eliminate more nodes than the ancestor level condition. Descendant level nodes have ancestor node information in the path-index. Using this information, we can eliminate entry-level nodes that do not satisfy the path condition of descendant nodes.

Efficient Processing of XPath Queries Using Indexes

729

If the entry-level nodes are not in the set of ancestor nodes of valid descendant nodes, they are eliminated. Similarly, if the valid ancestor nodes are not in the pathinformation of entry-level nodes, those entry-level nodes are eliminated. Comparing the number of entry-level nodes that are eliminated in the Root-first and Bottom-first approach, it can be decided which level condition will be better to optimize the search. This can be implemented by testing if the number of valid descendant nodes is less than ancestor nodes; if so, select the descendant way, otherwise, vice versa. We now present the Rest-tree algorithm. INPUT: XPath expression root/X1/X2/…/Xi/…/Xm STEP 1:FOR each Xi BEGIN IF Xi is indexed THEN BEGIN get every node xi of type Xi get the DN number ni of each xi Sumi = ∑ni; END END STEP 2: get entry point Xj with minimum Sum, add all xj to a node set Sj; get comparison point Xk with second minimum Sum, add all xk to a node set Sk; STEP 3: IF level j > k THEN BEGIN FOR each node xk in Sk IF its ancestor is not in Sj THEN delete xk from Sk Si = Sk END ELSE FOR each node xj in Sj IF its ancestor is not in Sk THEN delete xj from Sj STEP 4: FOR each node xj in Sj BEGIN IF the path starting from root to node xj is not included in the path root/X1/X2/…/Xj THEN delete the sub tree that does not satisfy the path condition END STEP 5: FOR each node xj in Sj, consider all sub trees starting with xj BEGIN IF Xj+1/…/Xm is same as /Xm THEN return nodes Xm ELSE INPUT = Xj/ Xj+1/…/Xm GO TO STEP 1; END Algorithm 2. Rest-tree Algorithm

730

Y. Chen et al.

5 Conclusions In this paper, we have proposed three types of indexes on XML data to execute efficiently XPath queries. We proposed two algorithms to process XPath queries using these indexes to optimize the queries. We have simulated both Root-first and Bottom-first approaches and have observed that processing of XPath query using our Entry-point indexing technique performs much better than traditional algorithms with or without indexes (not reported here due to space constraint).

References 1. Abiteboul, S., Quass, D., McHugh, J., Widom, J., Wiener, J.L. The Lorel Query Language for Semistructured Data, International Journal on Digital Libraries. 1(1) (April 1997), pp. 68-88 2. Biron, P.V., Malhotra, A. (eds.). XML Schema Part 2: Datatypes W3C Recommendation, (May 2, 2001), http://www.w3.org/TR/xmlschema-2/ . 3. Ceri, S., Comai, S., Damiani, E., Fraternali, P., Paraboschi, S., Tanca, L. XML-GL: A Graphical Language for Querying and Restructuring XML Documents, In Proceedings of th the 8 International World Wide Web Conference. Toronto, Canada (May 1999), pp. 93109. 4. Chamberlin, D., Clark, J., Florescu, D., Robie, J., Siméon, J., Stefanescu, M.: Xquery 1.0: An XML Query Language. W3C Working Draft (20 December 2001). 5. Chamberlin, D., Robie, J., Florescu, J., Quilt: An XML Query Language for Heterogeneous Data Sources. In: International Workshop on the Web and Databases (WebDB) Dallas, TX (May 2000). 6. Berglund, A., Boag, S., Chamberlin, D., Fernandez, M.F., Kay, M., Robie, J., Siméon, J.: XML Path language (XPath) 2.0, W3C Working Draft, (20 December, 2001). 7. Deutsch, A., Fernandez, M., Florescu, D., Levy, A., Suciu, D., A Qquery Language for th XML. In: Proceedings of the 8 International World Wide Web Conference, Toronto, Canada (May 1999), pp.77-91. 8. Goldman, R., Widom, J.: DataGuides: Enabling Query Formulation and Optimization in Semistructured Database. In: Proceedings of the Twenty-third International Conference on Very Large Data Bases. Athens, Greece (August 1997), pp. 436-445. 9. Li, Q., Moon, B. Indexing and Querying XML Data for Regular Path Expressions, In th Proceedings of the 27 VLDB Conference, Roma, Italy (2001). 10. McHugh, J., Abiteboul, S., Goldman, R., Quass, D., Widom, J. Lore: A Database Management System for Semi-structured Data, SIGMQD Record, 26(3) (1997), pp. 54-66. 11. McHugh, J., Widom, J., Abiteboul, S., Luo, Q., Rajaraman, A. Indexing Semistructured Data. Technical Report, Stanford University, Stanford, CA (February 1998). 12. Milo, T., Suciu, D., Index Structures for Path Expressions, In Proceedings of the International Conference on Database Theory (1999), pp. 277-295. 13. Rizzolo, F., Mendelzon, A. Indexing XML Data with ToXin. Fourth International Workshop on the Web and Databases (in conjunction with ACM SIGMOD 2001), Santa Barbara, CA (May 2001).