A Logic Query Language and its Algebraic Optimization ... - CiteSeerX

0 downloads 0 Views 277KB Size Report
to achieve a set-oriented model of computation, and to support some extra features 28]. ... In this document we show how the query optimizer handles recursive programs. We also .... perfectly clear (which is true for a relational environment). ... On the PRISMA database machine, however, we do not take a logic ap- proach inĀ ...
A Logic Query Language and its Algebraic Optimization for a Multiprocessor Database Machine1 Maurice A.W. Houtsma2 Hendricus J.A. van Kuijk2 Jan Flokstra2 Peter M.G. Apers2 Martin L. Kersten3 Memorandum INF-88-52 December 1988

The work reported in this document was conducted in the PRISMA project, a joint e ort with Philips Research Laboratories Eindhoven, partially supported by the Dutch \Stimuleringsprojectteam Informaticaonderzoek Nederland (SPIN)" 2 University of Twente, Computer Science Department, P.O. Box 217, 7500 AE Enschede, the Netherlands 3 Centre for Mathematics and Computer Science, Kruislaan 413, Amsterdam, the Netherlands 1

Abstract A logic query language, called PRISMAlog, is introduced. The language is one of the interfaces of a multiprocessor, main-memory database machine, called PRISMA. It is a language with a purely declarative semantics; the meaning of a program is given by its least xed-point. Besides allowing recursive rules, PRISMAlog supports operations like negation, arithmetic, aggregates, and group-by. Optimization of PRISMAlog programs is completely algebraic, and focusses on the use of distributed database techniques to introduce parallelism. Optimization criterion is minimization of response time. Techniques used to optimize PRISMAlog programs and to produce parallel schedules are illustrated.

Chapter 1

Introduction In the PRISMA project one of the main research issues is to develop a multiprocessor, main-memory database machine. The research is focussed on the use of distributed database design techniques to achieve a high degree of parallelism, which is used to improve query response time. Besides a powerful relational database machine, we would like to o er a powerful query language as well. Therefore, we have developed a logic database query language, called PRISMAlog. The choice for a pure logic language was made for several reasons, such as:  It o ers a high level of abstraction, allowing speci cation of rules and complex views, thereby o ering possibilities for support of reasoning.  It allows for recursive rules.  It is a good basis for extensions, such as complex objects.  It has a clear semantics, but allows for di erent control structures specifying its execution model, for instance, tuple-oriented resolution or set-oriented database compilation.  It is amenable to parallel processing. The choice of logic-based query languages for database systems has been taken in several research projects, such as [12,24,28]. We will introduce our approach through comparison with some of the predominant approaches. To extend the expressive power of a database, research has been conducted on coupling a logic language, like Prolog, to an existing database system [10,18]. Because of the close t between a Horn clause-like language 1

and a relational database, this coupling is conceptually of a simple nature. However, Prolog has a sequential execution model, which heavily in uences the eciency of this coupling. It leads to single-tuple calls to the database system, and a nested-loop join employed by the Prolog program to combine tuples in di erent relations. Whereas the relational database system itself works in a set-oriented way, and can, for example, choose from a variety of join-strategies, making use from all its knowledge about indexes, sorting, etc. Moreover, Prolog contains some non-logical features, like the cut-operator, which turns it into a rather imperative language with a strict sequential execution model. This makes it more dicult to write programs, and heavily in uences its amenability to parallel processing. The aforementioned drawbacks have triggered research on developing a better logic language. For instance, LDL is a logic-based data language that has a non-sequential execution model. It uses compilation techniques to achieve a set-oriented model of computation, and to support some extra features [28]. The design of LDL is based on pure Horn clause logic; it contains sets as primitive data objects and it supports complex terms. The development of LDL takes place in the context of pure logic. Its semantics is described by means of a term model, which is not completely satisfactory because this description leads to a very complex and extensive model. For example, capturing the semantic properties of the equality relation (=) is a very tedious, extensive task [6]. Evaluable predicates (like e.g. arithmetic functions) are not included in the language, neither is typing. A number of possible extensions is being investigated [7,20]. In PRISMAlog, we start from a pure Horn clause language too (sometimes called Datalog [24]). We have restricted PRISMAlog to be a pure data retrieval language. In this way, we can supply a simple semantics for the language. Data de nition primitives and updates are not yet well understood in the context of logic, and would thus lead to a very complex semantics. Therefore, data de nition and updates have to take place through the SQLinterface of the PRISMA database machine. Because we noticed that a language such as SQL o ers a number of retrieval capabilities not encountered in logic languages, we have sought for ways to incorporate these in PRISMAlog as well. This is done for arithmetic functions, group by, and aggregate operations. PRISMAlog is based on a strict set-oriented model of computation. The meaning of a program is given by its least xed-point, and therefore the order of the rules, and predicates within a rule, is of no importance for its meaning. Hence, PRISMAlog is a purely declarative language. Actually, the semantics 2

of PRISMAlog is de ned in terms of an eXtended Relational Algebra (XRA). This allows a complete algebraic optimization of a PRISMAlog program, and gives the relational query optimizer ample opportunities to employ any of the known optimization strategies. The translation of recursive rules uses a xed-point operator, and a rewriting strategy is employed by the query optimizer to rewrite a recursive program into a number of relational expressions and transitive closure operations. This rewriting strategy itself is described in more detail in [1,2, 15]. In this document we show how the query optimizer handles recursive programs. We also brie y discuss the transitive closure operation and the use of parallelism to compute it. The remainder of the document is structured as follows: in Ch. 2 we discuss the PRISMA machine and the architecture of the database system, in Ch. 3 we describe PRISMAlog in full detail, and nally in Ch. 4 we describe the optimization of PRISMAlog programs and the production of parallel schedules.

3

Chapter 2

The PRISMA Database Machine The PaRallel Inference and Storage MAchine (PRISMA) is a highly parallel machine for data and knowledge processing. The PRISMA machine is designed to support both a main-memory database system and an expert system shell. In this chapter we focus on the architecture, as far as it is relevant for the database machine, and design of the database system. More information about the PRISMA database machine can be found in [3,19]. The PRISMA database machine consists of 64 nodes. Each node is composed of a data processor, a communication processor, and 16 Mbyte memory. The nodes are connected by a high-speed network. The machine is built from commercially available, state-of-the-art hardware. Most database management systems are organized as a set of tightly coupled programs. The requirement of good performance often results in systems that are coupled more tightly than advisable for good software maintainability. Because the theory of distributed database systems has passed its infancy, we claim that it becomes possible to apply distributed techniques within a single database management system as well. According to this philosophy, which we have adopted for the PRISMA database system, a traditional database management system is viewed as a tightly-coupled distributed system. This approach is e ectuated in PRISMA by fragmenting the relations, and letting each fragment be managed by its own local database management system. Such a local database management system is called a One-Fragment Manager (OFM). It contains all functionality normally encountered in a full4

blown DBMS. These OFM's are regarded as the unit of parallelism in the database machine. Parallelism is obtained by having a number of OFM's working simultaneously on several nodes. This shows that mainly coarsegrain parallelism is considered. The design supports two major ways of improving the performance of a database system. First, the query of a user is transformed into a parallel schedule. This leads to several OFM's working simultaneously to process the query, and minimizes response time. Second, each user has a private instantiation of a parser, a query optimizer, and a transaction manager. Hence, concurrent users are working in parallel on the machine; if the consistency of the database allows it. The PRISMA database machine has two di erent user interfaces. An SQL interface is included to support existing applications and gateways to other systems. A logic programming interface, PRISMAlog, is supported to allow for complex query formulation including recursion. In the next chapter, PRISMAlog is discussed in full detail.

5

Chapter 3

The logic query language PRISMAlog A logic query language like PRISMAlog allows for the formulation of complex queries, and therefore a powerful database machine is required to obtain an acceptable performance. To use the available parallelism in the database machine, it should be a language with a set-oriented model of computation, which does not contain extra-logical features (such as the Prolog cut) that a ect this model. The reason is that in this way we can leave all the issues regarding `inference' and program execution up to the relational query optimizer. Additional constructs, which do not a ect the set-oriented model of computation and are supported by the relational database, can be integrated in the PRISMAlog language. The approach taken in the design of PRISMAlog is the following:  It is based on pure Horn clause logic; no arbitrary functions are allowed.  It has a purely declarative semantics; the meaning of a program is given by its least xed-point.  It is a data retrieval language, there are no data de nition or update predicates.  It uses a derived typing mechanism. Because a program is based on pure Horn clause logic, and its meaning is given by its least xed-point, the sequential order of execution of the rules 6

in a program has been removed. Also, the evaluation order of the subgoals within a rule is not determined by their syntactic ordering. This relieves the programmer from the burden of controlling the ecient execution of a program, and it means he does not have to worry about non-termination of his program caused by an improper order of the rules/subgoals (cf. with Prolog, where programs can be written that never terminate because of such an improper order). We have not introduced any data de nition constructs in PRISMAlog, like is done, for instance, in LDL [28]. For a language to contain data de nition constructs it should, to our opinion, at least contain typing, and the meaning of data de nition constructs and their consequences should be perfectly clear (which is true for a relational environment). Unfortunately, the concept of an update (be it at the schema level or at the tuple level) is not yet well understood in the context of logic languages. Instead, we rely on the SQL-interface to the PRISMA database machine for the de nition and maintenance of relations. Since relations are de ned through an SQL-interface, we can use the data type of the attributes as an extra check on the correctness of a PRISMAlog program. Whenever we use a relation in a PRISMAlog program, we can thus determine the type of the variables from the relation de nition in the data dictionary. We will now describe the PRISMAlog language, its extra constructs, and its meaning (which is given by a translation into eXtended Relational Algebra).

3.1 Simple Horn Clauses PRISMAlog resembles other logic languages, like e.g. Prolog, in its syntax. Its alphabet is composed of constants, variables, predicates, and the boolean connective & (`logical and'). We adopt the Prolog convention that variables are denoted by identi ers starting with an uppercase character, predicates are denoted in lowercase characters, and constants are denoted in lowercase characters or between quotes. A PRISMAlog program consists of a set of Horn clauses. The three types of Horn clauses that can occur are facts, which are unit clauses [23], rules, and queries, which are de nite goals that starts with a question mark. An example of a PRISMAlog program, which derives all employees and their salary that either work for the accounting department, or for the sales 7

department is: acc or sales(Enr, Sal) employee(Enr, Name, accounting, Sal). acc or sales(Enr, Sal) employee(Enr, Name, sales, Sal). ? acc or sales(X, Y). In this program we suppose a relation EMPLOYEE with attributes employee number, name, department, and salary to exist in the database, and we solve the program w.r.t. the actual database extension. This amounts to a proof of satis ability in the model theoretic sense [13], where the database extension is viewed as an interpretation and the query as a formula to be evaluated on this interpretation. On the PRISMA database machine, however, we do not take a logic approach in solving a query. Instead, a query is translated into an eXtended Relational Algebra (XRA) expression. This expression is regarded as the meaning of a query, and its result is the least xed-point of the query. The meaning of the above-mentioned program would thus be given by the following XRA-expression: 2=accountingEMPLOYEE

[

2=sales EMPLOYEE

As can be seen, every predicate de nition leads to one XRA expression, and when there are several clauses that de ne the same predicate a union of their corresponding XRA expressions is taken. By considering the meaning of a program in this way, it is guaranteed that the order of the predicates in a rule does not in uence the meaning, or execution, of the program; the same is true for the order of the rules. The translation of non-recursive Horn clauses into Relational Algebra is straightforward [9]. E ectively, every predicate de nition (which can be composed of several PRISMAlog rules) is translated into a view de nition in XRA. The translation of recursive rules is not completely straightforward, which is explained in the next section.

3.2 Recursive Rules One important feature of logic languages is the possibility to specify recursive rules. Let us present an example in which we assume a relation CONN in our database, with attributes departure city and arrival city. This relation represents all direct connections between two cities that can be made by train. A rule that de nes all possible connections is speci ed as follows: 8

ind conn(Dep, Arr) ind conn(Z, Arr) & conn(Dep, Z). ind conn(Dep, Arr) conn(Dep, Arr). This example clearly shows that the order of speci cation of the rules is irrelevant. As a Prolog program it would never terminate because of the order of the rules, as a PRISMAlog program there is no problem. The least xed-point of the program de nes its meaning and no inference strategy is imposed by the language. Since we cannot specify recursion in Relational Algebra, we have extended Relational Algebra with a xed-point operator called -calculus expression. This concept is borrowed from theoretical computer science, where it is used for describing the semantics of sequential programs [27]. The translation of our example would now be as follows: IND CONN = X [CONN [ 1;4(CONN 12=1 X )] where IND CONN has two attributes, just like the relation CONN, denoting departure and arrival city. The meaning of this expression is obtained by iterating over the variable X . First, the empty relation (;) is substituted for X and the result of the expression computed. Then, this result is substituted for X and the expression is computed, and so on. Because all operations are monotone (negation is not allowed) and the database is nite, it is guaranteed that the least xed-point of an expression exists. Therefore, the result of the expression will, at a certain iteration step, become stable; no new tuples are generated beyond this iteration step. This process is shown in Table 3.1, where CONNi denotes 1;4(CONN 12=1 CONNi?1), CONN1 = CONN, and CONN0 is the identity relation. The projection is necessary to make the result of the expression union compatible with the starting relation (here CONN), so that it can be substituted without any problems. Note that in Table 3.1 the meaning of the expression is given. It does not imply that the result of a -calculus expression is obtained by such a computation. Actually, we will use a rewriting algorithm for regular recursive queries [1,15], and use an iterative parallel strategy for non-regular recursive queries [2]. The terms regular and non-regular stem from formal languages, and mean there does exist a corresponding regular or context-free (non-regular) grammar [14]. Note that in the context of logic languages one sometimes uses the term regular in a di erent way, to indicate what we shall call linear recursion. Linear recursion means there is precisely one recursive 9

iteration 1 2 .. . n

variable

;

CONN .. .S n?1 i i=0 CONN

result CONN CONN [ 1;4(CONN 12=1 CONN) .. .S n CONNi i=0

Table 3.1: Meaning of -calculus expression rule for each predicate, and the only recursive predicate allowed in the body of a rule is the one corresponding to the rule that is being de ned. We will now extend our example to make a system of regular, mutual recursive rules. Assume a relation TRAIN in our database with as attributes departure city, arrival city, departure time, and arrival time. There also exists a relation BUS with the same schema, and a relation CHANGE with an attribute city that indicates one can change from bus to train and vice versa. When we want to de ne a, possibly alternating, sequence of train and bus connections, we have the following possibilities. First, we can take a simple connection by train. Second, we can take a single connection by train, followed by a number of train and bus connections. We have to make sure that we do not go back to the departure city, and we only take trips leaving a city later then our time of arrival in that city. Third, we can take a single train connection, change to a bus, and continue with a number of bus and train connections. The same restrictions concerning time of departure and place of arrival apply. The same three cases as distinguished for trips starting with a train, can be described for trips starting with a bus. The complete example, which forms a system of regular, mutual recursive rules, and de nes a, possible alternating, trip of bus and train connections is expressed in PRISMAlog as follows: train trip(Dep, Arr, Dtime, Atime) train(Dep, Arr, Dtime, Atime). train trip(Dep, Arr, Dtime, Atime) train(Dep, Z, Dtime, At) & train trip(Z, Arr, Dt, Atime) & Arr6=Dep & At < Dt. train trip(Dep, Arr, Dtime, Atime) train(Dep, A, Dtime, At) & change(A) & Arr6=Dep & At < Dt & bus trip(A, Arr, Dt, Atime). bus(Dep, Arr, Dtime, Atime). bus trip(Dep, Arr, Dtime, Atime) 10

bus trip(Dep, Arr, Dtime, Atime)

bus(Dep, Z, Dtime, At) & bus trip(Z, Arr, Dt, Atime) & Arr6=Dep & At < Dt. bus(Dep, A, Dtime, At) & bus trip(Dep, Arr, Dtime, Atime) change(A) & Arr6=Dep & At < Dt & train trip(A, Arr, Dt, Atime). When we translate these rules into -calculus expressions we get the following: TRAIN TRIP = X [TRAIN [ 1;6;3;8(TRAIN 12=1^16=2^4