Web Warehousing: An Algebra for Web Information

0 downloads 0 Views 317KB Size Report
attributes are the URL of the source document containing the hyperlink, the URL of the target ...... Department of Computer Science, University of Toronto, 1996.
Web Warehousing: An Algebra for Web Information W.-K. Ng

E.-P. Lim

C.-T. Huang

S. Bhowmick

F.-Q. Qin

Centre for Advanced Information Systems, School of Applied Science Nanyang Technological University, Singapore 639798, SINGAPORE fwkn,aseplim,cth,sourav,[email protected]

Abstract

While conventional keyword indexes maintained by web search engines such as Yahoo, Lycos, and World Wide Web Worm work well for most simple keyword searches, they are inadequate when more complex and structured queries involving the underlying hypertext structure of the World Wide Web are desired. Building from a database perspective, existing work to support such queries focuses on constructing SQL-like query languages for the WWW that assumes a relational abstraction of the WWW. Nonetheless, the WWW is a directed graph and imposing a relational abstraction lters out its inherent topological structure. In this paper, we propose a data model for the WWW that retains its topological structure and construct a web algebra to manipulate objects in this model. The web algebra establishes a formal foundation from which dierent web query languages can be designed.

1 Introduction From a user's perspective, the World Wide Web is a broadcast medium where a wide range of up-to-date information can be obtained at a low cost. Information on the WWW is important not only to individual users, but also to business organizations especially when critical decision making is concerned. While most users obtain WWW information using a combination of search engines and browsers, these two types of retrieval mechanisms do not necessarily address all of a user's information needs. This is particularly true in the case of business organizations which currently lack suitable tools to systematically harness strategic information from the Web that may impact the organization. The Web Warehousing Project at the School of Applied Science, Nanyang Technological University, Singapore, started in April 1997 with the following key objective: To design and implement a warehousing capability that materializes and manages useful information from the Web so as to support strategic decision making. This research project combines the areas of data warehousing, data mining, information retrieval and the World Wide Web. We aim to build a web warehouse This work was supported in part by the Nanyang Technological University, Ministry of Education (Singapore) under Academic Research Fund #4-12034-5060, #4-12034-3012, #4-12034-6022. Any opinions, ndings, and recommendations in this paper are those of the authors and do not reect the views of the funding agencies.

1

containing strategic information derived from the Web that may also interoperate with conventional data warehouses. This paper describes our preliminary work in web warehousing. In particular, we propose a data model for formulating web queries and representing web information, and a web algebra for retrieving information from the Web and manipulating these information to derive additional information. The algebra provides a formal foundation for data representation and manipulation for the web warehouse. The next section gives an overview of this aspect of the Web Warehousing project. Sections 3, 4 and 5 describe the data model and web algebra various examples are given for illustration. In Section 6, we discuss our work with respect to existing work in WWW modeling. The last section concludes the paper.

2 An Overview In order to manipulate web information eectively, one must rst understand the characteristics of the WWW. Apart from its size, we have identied some important characteristics: The WWW is a set of directed graphs with web documents corresponding to vertices and hyperlinks to directed edges. Web information is unstructured, as opposed to more structured data found in relational databases. Web information is dynamic. As dierent organizations and institutions put up web information, they may modify it anytime. Web information can be discovered by two main mechanisms: browsers and search engines. The former facilitates users to surf the Web by navigating through the links among web pages. The latter treats the WWW as a huge document collection and supports keywordbased queries on the collection. The last characteristic reveals a shortcoming of existing search engines. While web browsers fully exploit hyperlinks among web pages, search engines have so far make little progress in exploiting link information. Not only do most search engines fail to support queries on the Web utilizing link information, they also fail to return link information as part of a query's result. Thus, although the WWW is a graph, only vertex-level information is provided by search engines. Conventional search engine queries may be described as single-node queries one species a set of keywords that are expected to be found in certain web documents. The web algebra we propose supports structured or topological querying dierent sets of keywords may be specied on multiple nodes and additional criteria may dened for the hyperlinks among the nodes. Thus, the query is a graph-like structure and it is used to match portions of the WWW satisfying the conditions. In this way, a query result is a set of graphs (called web tuples ) reecting the query graph instead of a set of nodes in the conventional case. To meet the warehousing objective, we materialize a query result as a web table with the query graph forming the web schema for the table. We then dene a set of web operators with web semantics so as to equip the warehouse with the basic capability to manipulate web tables. These operators include web select , web join , web intersection , web union , and so on. The querying and 2

Web database

Web table

Web table

Web table

Web tuple

Web tuple

Web tuple

Node

Link

Node

Link

Node

Link

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Figure 1: Levels of abstraction in a web database. web operators constitute the primitive operators in the web algebra. They are discussed in more detail in the next few sections.

3 Web Data Model In this section, we propose a data model for the World Wide Web this is the model from which an algebra is dened. We introduce the concept of a web schema formally and describe its role in the web warehouse.

3.1 Web Objects Our data model of the WWW consists of a hierarchy of web objects (Figure 1). We begin with two basic web objects: Node's and Link's. The World Wide Web is a set of directed graphs where nodes correspond to HTML or plain text documents and edges correspond to hyperlinks interconnecting the documents. We dene a Node type and a Link type to refer to these two sets of distinct objects: Node = url, title, format, size, date, text] Link = source url, target url, label, link type]

The Node and Link object types consist of a set of attributes which are labeled simply as `Attributes' in Figure 1. Each attribute may either be atomic or it may itself by an object type consisting of another level of attribute components (not shown in gure). For simplicity, we assume all atomic attributes to be character strings, i.e., their domain is ? where  is the ASCII character set. For the Node object type, the attributes are the URL of a Node instance and its title, document format, size (in bytes), date of last modication, and textual contents. For the Link type, the attributes are the URL of the source document containing the hyperlink, the URL of the target document, the anchor or label of the link, and the type of the link. 3

The link type attribute requires further elaboration. Hyperlinks in the WWW may be characterized into three types: interior , local , and global 12, 13]. A link is interior when the target document is the same as the source document, i.e., the link points to another portion of the same document. A link is local when the source and target documents are hosted in the same server machine. A link is global when the source and target documents are hosted in dierent server machines. Therefore, link type = finterior, local, globalg. The next higher level of abstraction is a web tuple. A web tuple is a set of connected, directed graphs each consisting of a set of node's and link's which are instances of Node and Link respectively. A collection of web tuples is called a web table. If the table is materialized, we associate a name with the table. There is a schema (see next section) associated with every web table. A web database consists of a set of web schemas and a set of web tables. As mentioned earlier, ner gradation in the levels of abstraction is possible. For instance, we may dene url = fprotocol, host, port number, path, search partg (based on RFC 1738) and text = fheader, bodyg where header and body may be decomposed further. The granularity of the abstraction aects query expressiveness a more detailed model allows ner bits of information to be retrieved in a query.

3.2 Web Schema A web schema contains meta-information that binds a set of web tuples in a web table. Web tables are materialized results of web queries. In a web warehouse, a user expresses a web query by describing a query graph . Below is an example:

Example 1 Suppose a fresh graduate student John, is interested in operating systems research. He would like to know who are working in this area and some of their publications. Although he can perform an `operating system' keyword search using any of the current web search engines, he does not wish to be overwhelmed by the amount of irrelevant results returned by search engines. He wants to focus on academic and research work in operating systems only. In all likelihood, John knows that he can begin his probe with computer science departments worldwide. There are websites that maintains such lists. The following are a few examples: http://sunsite.doc.ic.ac.uk/bySubject/Computing/UniCompSciDepts.html http://www.nerdworld.com/users/dstein/nw403.html http://www.cis.temple.edu/cislabs/hotlist/hotlist cs depts.html

From there, he gured there would be links with anchor labels `people', `faculty', `research' that might be useful. For instance, the `faculty' link would point to a list of faculty members. From the hyperlinks associated with each faculty member, he can probe further for their research interests or publications in operating systems. John constructed the query graph shown in Figure 2.

When John's query is evaluated, he receives a set of web tuples each satisfying the query graph. By collecting the tuples as a table, the query graph may be used as the table's schema to bind the tuples. Hence, the web schema of a table is the query graph that is used to derive the table. Formally, a web schema is a 4-tuple M = fXn  X`  C P g where 4

`computer science departments' x

-

people e

-

faculty f

-z

`operating systems' y

-

research g

`operating systems' u

-

Figure 2: John's query graph. (i) (ii) (iii) (iv)

`operating publications systems' w h

-

Xn is a set of node variables, X` is a set of link variables, C is a set of connectivities (in DNF), P is a set of predicates (in DNF).

Let us elaborate on each of the schema components. In Figure 2, the boxes and directed lines correspond to web documents and hyperlinks respectively. Observe that some of them have keywords imposed on them. In order to express these conditions, we introduce node and link variables in the query graph. Thus, node x represents those web documents whose texts contain the words `computer science departments'. In other words, variables denote arbitrary instances of Node or Link. There are two special variables: a node variable denoted by the symbol `#' and a link variable denoted by the symbol `-'. These two variables dier from the other variables in that they are never bound (see next section). In Figure 2, the set of node and link variables are Xn = fx y z u wg and X` = fe f g h -g. Structural properties of web tuples are expressed by a set of connectivities . Formally, a connectivity k is an expression of the form: xhiy where x 2 Xn  y 2 Xn , and  is a regular expression over X` (see Appendix A). (The angle brackets around  are used for delimitation purposes only.) Thus, xhiy describes a path or a set of possible paths between two nodes x and y. For instance, the expression xh(ef jg)iy (i.e.,   (ef jg)) says that there exist either a simple link g 2 X` , or two links e 2 X` followed by f 2 X` , between x and y. In Figure 2, the connectivities are:

k1 k2 k3 k4

   

xh-ef iz zh-iu zh-hiw xh-giy

The last schema component is the set of predicates P . Predicates provide a means to impose additional conditions on web information to be retrieved. Let p be a predicate. If x y are node or link variables, then the following are possible forms of predicates:

p(x) p(x) p(x y)

  

x:attribute CONTAINS "A"] or x:attribute EQUALS "A"] or x:attribute = y:attribute] 5

`computer science departments' x `computer science departments' x `computer science departments' x

-

people e people e

-

faculty f

-

research g

faculty f

`operating systems' u

-z -z

-

`operating systems' publications w h

-

`operating systems' y

-

Figure 3: John's query graph after transformation. where attribute refers to an attribute of Node, Link or link type, A is a regular expression over the ASCII character set, x and y are arguments of p. For instance, x.text CONTAINS "data warehouse"] is a predicate with argument x 2 Xn which is true if and only if the web document corresponding to x contains the phrase \data warehouse". Likewise, e.source url EQUALS "http://www.yahoo.com"] is a predicate with argument e 2 X` which is true if and only if the link is found within Yahoo!'s home page. In Figure 2, the predicates are:

p1(x) p2 (y) p3(z) p4(u) p5 (w) q1 (e) q2(f ) q3(g) q4(h)

        

x y z .title CONTAINS "faculty"] u.text CONTAINS "operating systems"] w.text CONTAINS "operating systems"] e.label CONTAINS "people"] f .label CONTAINS "faculty"] g .label CONTAINS "research"] h.label CONTAINS "publications"]

 .url EQUALS "http://sunsite.doc..../UniCompSciDepts.html/"]  .text CONTAINS "operating systems"]

3.3 Schema Connected Components Although Figure 2 shows a single connected graph, a query graph may consist of one or more connected graphs in general. For each connected graph, all connectivities and predicates on nodes and links in the graph are true. That is,

C P

 

k1 ^ k2 ^ k3 ^ k4 p1 ^ p2 ^ p3 ^ p4 ^ p5 ^ q1 ^ q2 ^ q3 ^ q4

Allowing more than one connected graphs increases the expressive power of a query graph. In Figure 2, it is not likely to nd many documents in the Web interconnected as shown and satisfying 6

those predicates. Consider predicates k2  z h-iu and k3  z h-hiw. It describes two paths that must be satised simultaneously. However, the original intention is to get those web pages containing keywords `operating systems' via k2 or k3 . The same applies to k1 and k4 . In order to express the query more realistically, we permit logical disjunction for connectivities and predicates. When a query contains logical disjunctions for some of its connectivities and/or predicates, we transform the query graph into a disjunction of connected graphs each containing conjunctions only. For instance, Figure 2 will be transformed to Figure 3 before it is processed. The resultant connectivity and predicate are:

C P

(k1 ^ k2 ) _ (k1 ^ k3 ) _ k4 (p1 ^ p3 ^ p4 ^ q1 ^ q2 ) _ (p1 ^ p3 ^ p5 ^ q1 ^ q2 ^ q4 ) _ (p1 ^ p2 ^ ^q3)

 

Note that both C and P are in disjunctive normal form (DNF).

3.4 Schema Satisfaction How does a web schema bind a web table? We explore this notion in this section. First, let us introduce some notions. Let M = fXn  X`  C P g be a web schema. Then, (i) a variable in Xn or X` is bound when it is the argument of at least one predicate in P  (ii) for a connectivity k in C , we say that its leftmost and rightmost variables begin and end k respectively. For instance, x and y begins and ends xheiy respectively, e and f begin h(ejfg)iy while y ends it, and so on. A web tuple w satises a schema M if and only if the following are true: (i) For each predicate p in P , there exist some nodes or links in w such that they instantiate bound variables in M . (ii) For each connectivity k in C , there exist some sequence(s) of nodes and links in w such that each sequence is in the language L(k) (see Appendix A). A web table satises a web schema if and only if each web tuple in the table satises the schema. Note that the denition of schema satisfaction is conservative  a web tuple would satisfy the schema as long as some nodes and links in the web tuple satisfy the connectivities and predicates of the schema. There is no requirement that all nodes and links be described by the schema. Thus, it is possible for a subset of nodes and links of the web tuple to escape schema binding.

3.5 Remarks We denote a typical web query as 'M (W ) where M = fXn  X`  C P g is a schema, W is a web table, and ' is the derive operator (see next section). Note that the ' operator returns a web table, so we may write W = 'M (W ) to denote the fact that the resultant web tuples are materialized in W. Having introduced the notion of a web table, how does the World Wide Web relate to a web table? Contrary to conventional perception, the WWW is not one huge hypertext, but a collection 0

0

7

of hypertexts. So long as there exist a single web page without any inlinks or outlinks, the WWW is not singleton. If we let each disjoint hypertext be a web tuple, then the collection of web tuples constitute a special web table denoted as `WWW' (Courier face, upper case). In this paper, we use WWW to refer to the World Wide Web when it appears as an argument of web operators. For instance, when posing a query against the World Wide Web, we write 'M (WWW).

4 Web Operators With the World Wide Web now transformed into a web table WWW, we are ready to discuss the familty of web operators for WWW and its derived web tables. Some preliminary denitions are needed to facilitate our exposition: Let S denote a web-schema data type. Then, dom(S ) is the set of all possible web schemas. Similarly, let T denote a web-table data type and dom(T ) the set of all possible web tables. Due to space limitations, we can only present an overview of the functions of these operators.

4.1 Web Derive The derive operator ' : dom(S )  dom(T ) ! 2dom(T ) takes in a web table and a schema and extracts a set of sub-web tuples from the table satisfying the schema. This is generally the rst operator to be executed in any web query whenever fresh web information are required from the WWW.

Example 2 Suppose we wish to determine a collection of web documents about `java' starting from `http://www.yahoo.com/'. We can use 'M (WWW) where M = fXn  X`  C P g such that

Xn = fx yg X` = f-g C  xh-+ iy P  x.url EQUALS

y

:

"http://www.yahoo.com/"] ^  .text CONTAINS "java"]

As the symbol `-' refers to an unbound link variable, the path expression in the connectivity means one or more links.

Some computability issues arise when applying the ' operator on WWW and any web tables derived from WWW. We say that the ' operator is bound if and only if all variables that begin a connectivity in the schema specied for the operator are bound. A query which embedds a bound ' operator is always computable. Let us see why. Suppose a web query with schema M is posed against the WWW, i.e., we wish to compute 'M (WWW). Intuitively, the ' operator is evaluable when there are starting points in the WWW from which we begin our search. With current web technology, there are two methods to locate a web resource we either know its URL and access the resource directly or we go through a web search engine by supplying keywords to obtain URL's. Let x be a node variable, then a predicate such as x.url EQUALS "a-URL-here"] in a query allows us to use the URL specied to locate the document corresponding to x. The second method is embedded by predicates such as x.text CONTAINS "some-keywords"], x.title EQUALS 8

, and e.label CONTAINS "some-keywords"]. Here, x and e are the bound variables. When a node or link variable is bound, we can access the web resource it corresponds to either directly or through a web search engine. Variables that begin connectivities and are bound provide the starting point in the WWW for retrieving sub-web tuples. Hence, queries with such variables are computable. Query computability is generally an issue only when the ' operator is applied to the WWW. As mentioned above, the ' operator must be bound in order for the query to be computable. However, when applied to a web table that is derived from WWW earlier, the ' operator need not be bound for the query to be computable because we can enumerate through all the nodes in a web tuple each node in the tuple provides a starting point for searching.

"a-title-here"]

4.2 Web Select Analogous to the select operation in the relational model, we would like the select operation on a web table to extract web tuples from a web table satisfying certain conditions. However, since the schema of web tables is more complex than that of relational tables, selection conditions have to be expressed as predicates on node and link variables, as well as connectivities of web tuples. The web select operation augments the schema of web tables by incorporating new conditions into the schema. Thus, it is dierent from its relational counterpart. Before we describe the web select operation, we rst give a proper denition of the notion of `selection condition' in a web select operator. Let W be a web table with schema M = fXn  X`  C P g. Selection condition(s) on that table is denoted by another schema Ms = fXsn  Xs`  Cs  Ps g where Cs contains the selection criteria on connectivities, and Ps contains predicates on node and link variables in Xsn and Xs` respectively. In a relational selection, attributes in the selection predicate come from the schema of the relation to be selected. Likewise, there is a need to map node and link variables in Xsn and Xs` to those in Xn and X` in order to apply selection criteria and predicates on the web tuples. We establish this mapping below.

Establishing Variable Equivalence Determining variable equivalence establishes matching node and link variables in M and Ms . This is performed for node variables followed by link variables. Let xn be a node variable in M . We dene pred(xn ) to be the set of predicates in P that involve xn only. Similarly, we dene pred(xsn ) in the same manner for each xsn in Ps . The following example illustrates:

Example 3 Suppose M = fXn  X`  C P g where Xn = fx y zg, X` = fe f g, C  xheiy ^ xhf iz, and P

 p1 ^ p2 ^ p3 ^ p4 ^ p5

p1(x) p2(y) p3 (e) p4 (f ) p5 (z )

    

such that

x:url EQUALS "http://www-ccs.cs.umass.edu/db.html/"] y:text CONTAINS "(data warehouse | data mining)"] e:label CONTAINS "(data warehouse | data mining)"] f:label CONTAINS "information retrieval"] z:text CONTAINS "(information retrieval | clustering)"]: 9

Let Ms = fXsn  Xs`  Cs  Ps g where Xsn = fu v wg, Xs` = fg hg, Cs Ps  p6 ^ p7 ^ p8 ^ p9 such that

p6 (u) p7(u) p8(v) p9 (w)

   



uhgiv ^ uhhiw, and

u:url EQUALS "http://www-ccs.cs.umass.edu/db.html/"] u:text CONTAINS "database group"] v:text CONTAINS "(data warehouse | data mining)"] w:text CONTAINS "clustering"]:

Then, we have pred(x) = fp1 g, pred(y) = fp2 g, pred(z ) = fp5 g, pred(u) = fp6  p7 g, pred(v) = fp8 g and pred(w) = fp9 g.

Formally, establishing variable equivalence consists of two steps dened by functions n and ` respectively: n (Ms  M ) = ` (Ms  M ) =

f(x y ) j x 2 Xsn  y 2 Xn  (x:url = y:url) _ (pred(x) = pred(y ))g f(u v ) j 9(x y ) 2 n (Ms  M ) (z w) 2 n (Ms  M ) s:t:

u:source url = x:url v:source url = y:url u:target url = z:url v:target url = w:urlg

(1) (2)

n (Ms  M ) is a set of node-variable pairs whereby both variables in a pair refer to the same WWW node (i.e., they have the same URL), or they are the arguments of identical predicates. For a given (x y) pair, we say that x is semantically equivalent to y, and vice versa. ` (Ms  M ) is the set of link-variable pairs whose source and target node pairs are semantically equivalent. In Example 3, n (Ms  M ) = f(u x) (v y)g since u x have the same URL and v y are the arguments of the same predicates. We have ` (Ms  M ) = f(g e)g since the source and target URLs of g and e are semantically equivalent (as determined in n(Ms  M )). The process of establishing variable equivalence is said to produce a proper matching if n(Ms  M ) contains 1{1 mappings between variables from Ms and M . That is, each variable in Ms has exactly one equivalent variable in M .

Resolving Equivalent Variable When a proper matching is obtained from establishing variable equivalence, variable resolution can be performed on Ms to replace variables in Ms by their equivalents in M . Formally, variable resolution produces a schema Ms = fXsn  Xs`  Cs  Ps g where 0

0

0

0

0

(i) Xsn = fx j x 2 Xsn  6 9y 2 Xn s:t: (x y) 2 n (Ms  M )g fx j x 2 Xsn 9y 2 Xn s:t: (x y) 2 n (Ms  M )g, (ii) Xs` = fu j u 2 Xs` 6 9v 2 X` s:t: (u v) 2 `(Ms  M )g fu j u 2 Xs` 9v 2 X` s:t: (u v) 2 ` (Ms  M )g, V (iii) Cs  (c C ) cs , where cs is obtained from cs by replacing all x 2 Xsn , y 2 Xs` by their equivalents in M , V (iv) Ps  (p P ) ps , where ps is obtained from ps by replacing all x 2 Xsn, y 2 Xs` by their equivalents in M . 0

0

0

s2

s

s2

s

0

0

0

0

0

10

Note that Xsn consists of a set of node variables in Ms that are not equivalent to any variable in M and a set of node variables that are equivalent to some variable in M . That same applies to Xs`. The set of connectivities Cs is the same as Cs except that equivalent variables appearing in the connectivites are resolved. The same goes to Ps . In Example 3, we have Xsn = fx y wg, Xs` = fe hg, Cs  xheiy ^ xhhiw, and Ps  p10 ^ p11 ^ p11 ^ p12 where 0

0

0

0

0

0

0

0

p10 (u) p11 (u) p12 (v) p13 (w)

x:url EQUALS "http://www-ccs.cs.umass.edu/db.html/"] x:text CONTAINS "database group"] y:text CONTAINS "(data warehouse | data mining)"] w:text CONTAINS "clustering"]:

   

Denition of Web Select Given a web table W with schema M , a web selection M W computes a new web table W with schema M = fXn  X`  C  P g that is identical to Ms = fXsn  Xs`  Cs  Ps g except that equivalent variables are resolved. The components of M are dened as follows: 0

s

0

0

0

0

0

0

0

0

0

0

0

(i) (ii) (iii) (iv)

Xn = Xn Xsn, X` = X` Xs`, C  C ^ Cs , P  P ^ Ps. 0

0

0

0

0

0

0

0

Web table W is the set of tuples in W satisfying M . In Example 3, we have Xn = fx y z wg, X` = fe f hg, C  xheiy ^ xhf iz ^ xhhiw, and P  p1 ^ p2 ^ p3 ^ p4 ^ p5 ^ p11 ^ p13. 0

0

0

0

0

0

4.3 Web Intersect In performing an intersection of two web tables, we are interested in those web tuples from either of two tables satisfying both web schemas. This is explained below. Let Wi and Wj be two web tables with schemas Mi = fXin  Xi` , Ci  Pi g and Mj = fXjn  Xj` , Cj  Pj g respectively. Then, W = Wi \ Wj is a set of web tuples satisfying schema M = fXn  X`  C P g where (i) (ii) (iii) (iv)

Xn = Xin ] Xjn, X` = Xi` ] Xj`, C  Ci ^ Cj , P  Pi ^ Pj .

By denition, a web tuple that is described by a schema satises all the connectivities and predicates in the schema. With two schemas, a web tuple that is common to both schemas must satisfy the combined set of connectivities and predicates. This combination is achieved as follows. First, node and link variables in the two schemas must be disambiguated , i.e., they must be made distinct. Assuming that Xin \ Xi` = and Xjn \ Xj` = , then for each variable in Xjn that is nominally identical to a variable in Xin , we replace its occurrence in Xjn, Pj and Cj with another variable that is nominally distinct. We denote this disambiguation process by the ] operator. Likewise, we have X` = Xi` ] Xj`. Note that jXn j = jXin j + jXjnj and jX` j = jXi` j + jXj` j. 11

Once the variables are disambiguated, the new set of connectivities C is the logical conjunction of Ci and Cj . The new set of predicates P is similarly derived from Pi and Pj .

Example 4 Let M1 = fX1n  X1` C1  P1 g and M2 = fX2n X2`  C2  P2 g be the schemas for two tables W1 and W2 respectively. The schema components are dened as follows:

X1n = fx y #g X1` = fe -g C1  xh- iy ^ xhei# P1  x.url EQUALS "http://www.yahoo.com/"] ^ y.text CONTAINS "java"] ^ e.label CONTAINS "activeX"] X2n = fx #g X2` = feg C2  #heix P2  x.text CONTAINS "javascript"] ^ e.label CONTAINS "java"]: 

The rst schema describes a collection of web documents about `java' starting from URL `http://www.yahoo.com/'. These documents also contain anchors with label `activeX'. The second schema describes a collection of web documents containing anchors with label `java' that point to web documents with `javascript' appearing in the texts. When intersecting W1 and W2 , the schemas are combined. Nominally identical variables in the two schemas are renamed as necessary. Therefore, W1 \ W2 has the following combined schema:

Xn = fx y z #g X` = fe f -g C  xh- iy ^ xhei# ^ #hf iz P  x.url EQUALS "http://www.yahoo.com/"] ^ y.text 

CONTAINS "java"] ^  .label CONTAINS "activeX"] ^  .text CONTAINS "javascript"] ^

e f

z

:

 .label CONTAINS "java"]g

4.4 Web Union In performing a union of two web tables, we are interested in those web tuples from either of two tables satisfying either one or both web schemas. Let Wi and Wj be two web tables with schemas Mi = fXin  Xi`  Ci  Pi g and Mj = fXjn Xj` , Cj  Pj g respectively. Then, W = Wi Wj is a set of web tuples satisfying schema M = fXn  X`  C P g where (i) Xn = Xin ] Xjn, (ii) X` = Xi` ] Xj`, (iii) C  Ci _ Cj , (iv) P  Pi _ Pj . As in web intersection, variables are disambiguated to derive Xn and X` . When a web tuple is described by one or both of two schemas, the new set of connectivities C is the logical disjunction of Ci and Cj . The new set of predicates P is similarly derived from Pi and Pj . 12

Example 5 Suppose the two tables in Example 4 are union'ed. The disambiguated node and link variables will be the same. The new predicates and connectivities are expressed as follows. Let p1(x)  x.url EQUALS "http://www.yahoo.com/"] p2 (y)  y.text CONTAINS "java"] p3(e)  e.label CONTAINS "activeX"] q1(z)  z.text CONTAINS "javascript"] q2 (f )  f .label CONTAINS "java"]

k1  xh- iy k2  xhei# m1  #hf iz Then, the new predicate P is derived as follows: P  P1 _ P2 = (p1 (x) ^ p2 (y) ^ p3 (e)) _ (q1 (z ) ^ q2 (f )) 

Likewise, we have

C

C1 _ C2 = (k1 ^ k2 ) _ m1 

4.5 Web Join While performing web intersection and union are quite intuitive, doing a web join is not so. In a relational database, two relations which are union-compatible can be joined on a set of common attributes. What are the common attributes for two web tables with dierent schemas? When are web tables join-compatible ? Let Wi and Wj be two web tables with schemas Mi = fXin  Xi`  Ci  Pi g and Mj = fXjn  Xj` , Cj  Pj g respectively. Then, Wi and Wj are join-compatible (or joinable ) if and only if there exist identical documents (having the same URLs) between some pair of web tuples from Wi and Wj or that these documents must be described by some common predicates in Mi and Mj . Let us elaborate further. Let wi and wj be two web tuples from Wi and Wj respectively. Suppose a document at URL `http://www.abc.com/' appears in both wi and wj . Then, wi and wj are joinable if there exist predicates a.url EQUALS "http://www.abc.com/"] and b.url EQUALS "http://www.abc.com/"] in Pi and Pj respectively. Otherwise, wi and wj are still joinable if there exist predicates of the form e.attribute CONTAINS "some keywords"] in both Pi and Pj (e is any node variable). Therefore, W = Wi ./ Wj is a set of web tuples satisfying schema M = fXn  X` , C P g where Xn is the set of node variables appearing in P , X` is the set of link variables appearing in P , C and P are obtained from Mi and Mj through the same variable resolution process in web select.

Example 6 Web tables W1 and W2 from Example 4 are not joinable because their schemas do not

contain identical nodes. Suppose W2 has the following schema instead: X2n = fa b #g

13

X2` = feg C2  aheib P2  a.url

EQUALS "http://www.yahoo.com/"] ^  .text CONTAINS "javascript"] ^  .label CONTAINS "java"]

b

e

:

Then, M1 and M2 refer to an identical WWW node at URL `http://www.yahoo.com/'. Hence, they are join-compatible. When deriving the new schema, we note that node variable a 2 X2n and x 2 X1n are referring to the same node. After variable resolution, W1 ./ W2 has the following schema:

Xn = fx y b #g X` = fe f -g C  xh- iy ^ xhei# ^ xhf ib P  x.url EQUALS "http://www.yahoo.com/"] ^ y.text 

CONTAINS "java"] ^  .label CONTAINS "activeX"] ^  .text CONTAINS "javascript"] ^

e f

b

:

 .label CONTAINS "java"]g

4.6 Summary The above set of operators on web objects (dened in Section 3) forms a web algebra. Web algebra manipulates web tables as rst-class objects each operator accepts one or two web tables as input and produces a web table as output. A new web schema is produced each time a web operator is applied. (We ignore trivial applications of operators whereby the output web table is the same as the input web table.) The web derive operator projects web tuples from a web table. In particular, portions of the WWW are extracted when it is applied to WWW. Web intersect, union and join are binary operators on web tables. Web select extracts a subset of web tuples from a web table. It is interesting to contrast web algebra with relational algebra. In relational algebra, a relation is the rst-class object of manipulation by relational operators. The select operator performs rowwise extraction of tuples from a relation. Its counterpart in web algebra is the web select operator which extracts subset of web tuples from web tables. The project operator performs column-wise extraction of attributes from a relation. Its corresponding web derive operator projects portions of web tuples from a web table. The other operators in both algebras are quite similar although their precise semantics are dierent. Unlike web operators, relational operators do not always produce a relation with a new schema only certain operators such as project and join result in new schemas. Another dierence between both algebras is that web operators function at the intention (schema) level while relational operators function at the extension (tuple) level. For instance, when performing a web intersection, we look at the two input schemas and derive a composite schema from them. In a relational intersection, we look at tuples from both relations and decide whether they should go into the output relation. This is the reason why a new schema always results from a web operation. In conventional relational databases, actual data (tuples) changes more frequently than schemas as a result of standard database operations. Even in relational project and join where new schemas 14

are produced, the new schema is a simple composition of attributes from source schemas. In a web database, both schemas and tuples change per web operation.

5 Web Query Examples Web algebra is a formal query language for the WWW. In this section, we see examples of its usage to query the WWW. Let W and M be a web table and a schema respectively, then 'M (W ) is a web algebraic query if Qi and Qj are web algebraic queries, then so are 'M (Qi ), M (Qi ), Qi Qj , Qi \ Qj , and Qi ./ Qj . Since the result of a web algebraic query is a web table, we may use a `web query' in place whenever a web table is called for (e.g., in 'M (Qi )). Examples of valid query expressions include 'M (WWW), 'M (Wi Wj ), 'M (Wi) \ 'M (Wj ), Qi ./ 'M (WWW) \ Qj . We shall use Example 1 introduced in Section 3.2 to illustrate web algebraic queries in this section. i

j

Using the Derive Operator Using web algebra, John expresses his query as 'M1 (WWW) where M1 = fX1n  X1`  C1  P1 g describes his query graph as follows: X1n = fx y z u wg X1` = fe f g h -g C1  (k1 ^ k2 ) _ (k1 ^ k3 ) _ k4  P1  (p1 ^ p3 ^ p4 ^ q1 ^ q2) _ (p1 ^ p3 ^ p5 ^ q1 ^ q2 ^ q4 ) _ (p1 ^ p2 ^ ^q3 ): The connectivities and predicates are dened in Section 3.2. Note that the predicates and connectivities are in disjunctive normal form (DNF) in order to reect the intention of John's query. Figure 2 shows the John's query graph. In this query, John used predicates p2  p4  p5 which are semantically the same. However, their arguments (y u w) are dierent. Three separate predicates are used because y u w may be matched to dierent WWW nodes during query evaluation. The result of his query is a set of web tuples instantiating schema M1 . One such tuple is this: Contained in the page corresponding to x is the Computer Science Department of the University of Wisconsin. From the CS Department's page (http://www.cs.wisc.edu/), there is a link labeled `People and Organizations' (http://www.cs.wisc.edu/directories/). This link matches the query (see connectivity k1 ), so it would be taken. From there, another link labeled `Faculty and Research Sta Directory' (http://www.cs.wisc.edu/directories/facstafflist.html) would again be taken. This brings us to a page corresponding to z  a directory of faculty members. From z , we collect those faculty pages containing the keyword `operating systems'. We also collect faculty pages containing `publications' links that lead to pages with keyword `operating systems'.

Using the Select Operator Web tuples from John's query are collected in web table W1 . John may use this table to ne tune his queries in future. For instance, he may wish to identify only those researchers working on 15

-

x

`operating systems' `database' publications z e

-

Figure 4: Query graph of M3 .

`processor scheduling' within the eld of operating systems. A good indication of whether someone works on processor scheduling is to look at his/her publications. Thus, John may impose additional constraints on the node corresponding to w in Figure 2. Specically, the node must also contain the keyword `processor scheduling'. Essentially, what John is doing is to select tuples from W1 satisfying new constraints. In web algebra, this query is expressed as M2 (W1 ). The selection schema M2 = fX2n  X2`  C2  P2 g is dened as follows:

X2n = fz #g X2` = feg C2  #heiz P2  z.text z e

CONTAINS "operating systems"] ^

 .text CONTAINS "processor scheduling"] ^  .label CONTAINS "publications"]

The process of establishing equivalent variables maps z to w 2 X1n and e to h 2 X1` . Once the equivalents are resolved, the last two predicates in P2 allow John to impose the above-mentioned constraints.

Using the Join Operator As John browses through the WWW, he comes across a page containing many links about database researchers (http://www.lpac.ac.uk/SEL-HPC/Articles/GeneratedHtml/UserHome.db.html). Now, he wishes to know if the people he has found earlier who work on operating systems also has an interest in databases. What he can do is to construct a web table W3 of people working on database systems (using 'M3 (WWW)) and then perform a web join between W1 and W3 (i.e., W1 ./ W3 ). The query schema for constructing W3 is given as M3 = fX3n  X3`  C3  P3 g where

X3n = fx yg X3` = fe -g C3  xh--iy ^ xh-eiz P3  x.url EQUALS "http://www.lpac.ac.uk/SEL-HPC/.../db.html"] ^ y e

z

 .text CONTAINS "database"] ^  .text CONTAINS "database"] ^  .label CONTAINS "publications"]

The query graph is shown in Figure 4. When performing W1 ./ W3 , a tuple from W1 and W3 will be joined on common nodes u w y from M1 and y z from M3 . As there are two nodes in M3 that can be matched to some nodes in M1 , there are only two mappings to be resolved. One possible pair 16

t

? `computer science departments' x

-

people e

-

research g

faculty f

-

- ?u

-z

`operating systems' y

publications m

`operating systems and database'

-

Figure 5: Query graph of M4 .

-?

publications w h `operating systems and database'

of mappings is shown in Figure 5 (which is transformed into Figure 6). Here, y z are resolved to u w respectively. Let W4 be the resultant table with schema M4 . Then, M4 = fX4n  X4`  C4  P4 g is dened as:

X4n = X1n ftg X4` = X1` fmg C4  C1 ^ (k5 _ k6 ) P4  P1 ^ p6 ^ p7 ^ p8 : The connectivities and predicates come from M3 and are dened as follows: k5  th--iu k6  th-miw p6 (t)  t.url EQUALS "http://sunsite.doc..../UniCompSciDepts.html/"] p7 (u)  u.text CONTAINS "database"] p8(w)  w.text CONTAINS "database"] q5 (m)  m.label CONTAINS "publications"]

6 Related Work There has been work in query languages for the World Wide Web 5, 9, 11, 13, 12, 16, 10] and hypertext documents 2, 3, 14]. There are also work that examines the computability issues of web queries 1]. Due to space limitations, we compare our work with some of the more closely related work: Mendelzon, Mihaila and Milo's WebSQL 13, 12], Konopnicki and Shmueli's W3QS 9], Fiebig, Weiss and Moerkotte's RAW 5, 16], and Lakshmanan, Sadri and Subramanian's WebLog 10]. 17

`computer science departments' people x e

faculty z f

`computer science departments' people x e

faculty z f

-

`computer science departments' x

-

-

research g

`operating systems' `database' u publications t m `operating systems' `database' publications w publications h m

-

- 

-

-



- 

`operating systems' `database' y publications m

- 





t

t

Figure 6: Query graph of M4 after transformation. Mendelzon, Mihaila and Milo proposed a WebSQL query language based on a formal calculus for querying the WWW. The query language allows both content and structure queries. In addition, a cost model based on query locality (i.e., how much of the network must be visited when evaluating a query) has been proposed for evaluating the cost of processing a query. In many aspects, their motivations and objectives are similar to ours: To permit more complex and expressive queries on the WWW. One major dierence between their work and ours is that the result of a WebSQL query|a set of web tuples|is immediately attened once available. This relational abstraction on the WWW has two consequences: The resultant table obtained cannot be used further in a WebSQL query and structure information of the web tuples is lost permanently. This limits the expressiveness of queries to a certain extent as complex queries involving operators such as web join are not possible. Direct implementation of relational abstraction of the WWW exhibits a fundamental irony: While we expect a set of graph structures (i.e., web tuples) from the WWW as the result of a structure query, projecting the structure information away only to retain the attributes as linear tuples in a relation is not realizing the full potential of web queries. In our work, both structure and content are intact as web tuples in a web table. Furthermore, they can be manipulated by web operators to satisfy a new query, or to obtain a relation. Hence, query languages based on web algebra have higher expressiveness and are potentially more powerful as they permit more complex queries to be constructed. Konopnicki and Shmueli proposed a high level querying system called the W3QS for the WWW whereby users may specify content and structure queries on the WWW and maintain the results of queries as database views of the WWW. This is a more comprehensive system with graphical interfaces for locating, ltering and presenting WWW information. We note there is no formal modeling of the WWW. Clearly, the WWW is a graph with URL's as nodes and hyperlinks as edges. Nonetheless, there are dierent document formats in the WWW and it is unclear as to what attributes of WWW objects are available due to the lack of a formal data model. As a consequence, it is unclear as to the foundation of the W3QL query language in 18

W3QS. For instance, what are the precise denitions of objects manipulated, and the operators used in a query? In a W3QL query, users specify which WWW search engines to use for evaluating the query. As a declarative query language, we think this should be hidden from the user, and be automatically embedded in the query processor. Although structure queries are supported, it appears that only path queries are actually implemented, as evidenced by the grammar of the query language, the examples given, and the table storage structure for query answers. A path query is a linear sequence of alternating nodes and links beginning and ending in nodes. This is a simple form of a graph it is unclear as to how more general graphs can be expressed in the W3QL language. As stated in their work, the default format for a query result in W3QS is a table. However, this is not a direct relational abstraction of the WWW. From the examples shown, the table accommodates instances of graphs. However, only examples with simple paths are shown whereby each tuple is a record of alternating node and link attribute values. When more general graphs are to be stored, it is unclear how the graph structure (i.e, web schema) will be represented and associated with the table storing the query results. We note that W3QL queries are always made to the WWW. Past query results are not used in the evaluation of future queries. On the whole, the W3QL language appears to be quite complicated as users need to specify parameters in the query such as which web search engine to use, and which algorithm to use (e.g., breath-rst search to search the WWW matching a graph pattern). Again, we feel that these parameters should be automatically handled by the query processing engine. Fiebig, Weiss and Moerkotte extended relational algebra to the World Wide Web by augmenting the algebra with new domains (data types) 5, 16], and functions that apply to the domains. The extended model is known as RAW (Relational Algebra for the Web). Only two low level operators on relations, scan and index-scan , have been proposed to expand an URL address attribute in a relation and to rank results returned by web search engine(s) respectively. Compared to WebSQL and our proposed web algebra, RAW provides minor improvement on the existing relational model to accommodate web data. Inspired by concepts in declarative logic, Lakshmanan, Sadri and Subramanian designed WebLog to be a language for querying and restructuring web information. Conceptually, a web page is modeled as groups of related information called rel-infon 's. A WebLog query is dened by a logic program containing one or more rules deriving a web page containing some selected rel-infon attributes or hyperlinks, or a relation containing useful URLs. Using WebLog, one can consolidate the result of a query within a web page, and restructure the returned attributes in the web page. However, we make some observations. First, WebLog being a logic based query language is not easy to use. There is no formal denition of web operations such as join, intersection and union of web tables. Hence, the expressive power of WebLog and our proposed web algebra are quite dierent. Second, WebLog does not capture schema information in web tables, thus making it di$cult to use WebLog query results in subsequent queries. Third, the evaluation of WebLog queries using the traditional unication and resolution approaches may not be e$cient.

19

7 Conclusions The World Wide Web is a collection of semi-structured and heterogeneous data that contain invaluable information. Hence, it is logical to extend data warehouses to the World Wide Web so as to construct web warehouses. In this paper, we construct a web data model and a web algebra for web information manipulation that is central to a web warehouse. The proposed data model is unique in its ability to capture the topological structure of WWW within the schema. A table in the web data model consists of subgraphs of WWW as web tuples. To manipulate web tables, we have dened a set of operations as part of the web algebra. In contrast with existing query models that atten web query results into linear tuples, the proposed model retains topological information in the query results. Thus, the web algebra is closed , i.e., each web operator accepts one or two web tables as operands and returns a web table as its result. The unique features of our web data model and algebra allow future queries to be dened on the results returned by earlier queries. Due to space limitations, we have only described an algebra for our web warehouse. Thus far, we have designed an SQL-like web query based on the web algebra and a query processor for evaluating web queries. These and other aspects of our web warehousing project will be reported in future papers.

A Regular Expression A regular expression over the alphabet  is dened as follows: , the empty string is a regular expression if a 2 , then a is a regular expression if a and b are regular expressions, then so are ajb and ab if a is a regular expression, then so are a? and (a) nothing else is a regular expression. To each regular expression r over the alphabet , we associate a language L(r)  ? in the following sense: L(r) is dened recursively by: L( ) = f g if a 2 , then L(a) = fag L(ajb) = L(a) L(b) L(ab) = L(a) \ L(b) L(a? ) = L(a)?  S where L? = i 0 Li is the reexive and transitive closure of the language L. 

References 1] S. Abiteboul, V. Vianu. Queries and Computation on the Web. Proceedings of the 6th International Conference on Database Theory , Greece, 1997.

20

2] C. Beeri, Y. Kornatzky. A Logical Query Language for Hypertext Systems. Proceedings of the European Conference on Hypertext , pp. 67{80, Cambridge University Press, 1990. 3] M. P. Consens, A. O. Mendelzon. Expressing Structural Hypertext Queries in Graphlog. Hypertext , pp. 269{292, 1989. 4] C. J. Date. A Formal Denition of the Relational Model. Chapter 7 of Relational Database: Selected Writings , Addison-Wesley Publishing Company, 1986. 5] T. Fiebig, J. Weiss, G. Moerkotte. RAW: A Relational Algebra for the Web. Workshop on Management of Semistructured Data (PODS/SIGMOD'97), Tucson, Arizona, May 16, 1997. 6] R. H. Guting. GraphDB: Modeling and Querying Graphs in Databases. Proceedings of the 20th International Conference on Very Large Data Bases , Santiago, pp. 297{308, 1994. 7] R. H. Guting, R. Zicari, D. M. Choy. An Algebra for Structured Oce Documents. ACM Transactions on Information Systems , Vol. 7, No. 2, pp. 123{157, 1989. 8] J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, A. Crespo. Extracting Semistructured Information from the Web. Workshop on Management of Semistructured Data (PODS/SIGMOD'97), Tucson, Arizona, May 16, 1997. 9] D. Konopnicki, O. Shmueli. W3QS: A Query System for the World Wide Web. Proceedings of the 21st International Conference on Very Large Data Bases , Zurich, Switzerland, 1995. 10] L.V.S. Lakshmanan, F. Sadri., I.N. Subramanian A Declarative Language for Querying and Restructuring the Web Proceedings of the Sixth International Workshop on Research Issues in Data Engineering , February, 1996. 11] E.-P. Lim, W. K. Ng. A Relational Interface for Heterogeneous Information Sources. Proceedings of IEEE International Conference on Advances in Digital Libraries (ADL'97), Library of Congresss, Washington, D.C., May 7{9, 1997. 12] A. O. Mendelzon, G. A. Mihaila, T. Milo. Querying the World Wide Web. Proceedings of the International Conference on Parallel and Distributed Information Systems (PDIS'96), Miami, Florida, 1996. 13] G. A. Mihaila. WebSQL|A SQL-like Query Language for the World Wide Web. Master's Thesis, Department of Computer Science, University of Toronto, 1996. 14] T. Minohara, R. Watanabe. Queries on Structure in Hypertext. Foundations of Data Organization and Algorithms (FODO'93), pp. 394{411, Springer-Verlag, 1993. 15] E. Sandewall. Towards a World-Wide Data Base. Proceedings of the Fifth International World Wide Web Conference , Paris, France, May 6{10, 1996. 16] J. Wei. Entwurf und Implementierung Einer Algebra fur das World-Wide Web. Diplomarbeit, Fakultat fur Informatik, Universitat Mannheim, Lehrstuhl Praktische Informatik III, February 1997.

21