A Calibration Mechanism Identifying the ... - Semantic Scholar

1 downloads 0 Views 169KB Size Report
myra@wiwi.hu-berlin.de, http://www.wiwi.hu-berlin.de/ myra. Abstract. A multidatabase optimizer draws an abstract plan for a query, the fragments of which areĀ ...
A Calibration Mechanism Identifying the Optimization Technique of a Multidatabase Participant  Myra Spiliopoulou Institut fur Wirtschaftsinformatik, Humboldt-Universitat zu Berlin [email protected], http://www.wiwi.hu-berlin.de/myra

Abstract

A multidatabase optimizer draws an abstract plan for a query, the fragments of which are optimized by the underlying autonomous databases. In order to construct a plan that can be optimized coherently, the multidatabase optimizer must know the principles of the local optimizers. We propose a methodology for the identi cation of the techniques used for join ordering: a suite of test queries is issued against the local database, the processing time is measured and used to draw conclusions on the characteristics of the query execution plan. We establish a taxonomy of optimizers and apply our methodology to identify the category to which an unknown optimizer belongs.

Keywords: Multidatabase query optimization, optimizer calibration, non-conformant databases

1 Introduction A multidatabase system provides an integrated environment for data retrieval from multiple databases. A challenging problem in this area is query optimization in the absence of adequate information on the characteristics of the local databases. We propose a methodology for the identi cation of the optimization technique of the local databases, assisting in the effective decomposition of the original query and the ecient processing of its fragments. A multidatabase query involves two or more Local D ataB ase S ystems (LDBSs). The M ultidatabase Q uery O ptimizer (MQO) partitions the query into subqueries processed by the LDBSs and coordinates the merging of results. In the studies of [2, 4], multidatabase optimization mechanisms are proposed. Du et al present a tree balancing mechanism, which transforms left-deep trees to

 A summary of part of this work has been presented in the 8th Workshop \Grundlagen von Datenbanken", Friedrichsbrunn, Germany, 1996.

bushy ones, in order to increase the concurrency and independent processing in multidatabase query execution [2]. However, the local optimizers processing the query subtrees may violate those transformations because they adhere to di erent principles. Salza et al propose a distributed optimization mechanism, according to which the query is executed as a set of multiple subqueries by the LDBSs [4]. Each LDBS forwards the cost estimate on the subquery it processed to the LDBS waiting for its output. The query execution plan (QEP) is constructed incrementally. However, it is not always possible to require from an LDBS to send a cost estimate prior to query execution nor to exploit an estimate coming from another LDBS. Furthermore, the technique demands that all involved relations coming from the same LDBS are joined together, even by means of cartesian products, which might have a prohibitive overhead. In order to assist the MQO in preparing the subqueries towards the LDBSs, researchers have turned towards methods for the identi cation of LDBS characteristics. In [1], a calibration database is established in each LDBS to estimate the cost of restrictions and joins according to an abstract cost model. In [7], multiple regression is proposed to estimate the cost parameters of restriction and join algorithms. In a multidatabase query, the MQO must specify the order and execution sites of the multidatabase joins, as well as their interleaving with local joins. The subqueries are optimized locally and the local optimal (sub)QEPs cannot be changed by the MQO because of the LDBS autonomy. The execution time of a query plan may vary considerably. Since access to an LDBS may be charged or communication delays may be considerable, the MQO must be able to estimate a good approximation of the query execution cost. Knowledge of the of the local optimization techniques is indispensable for a realistic prediction of this cost. We propose a model that identi es the optimization techniques of an LDBS. A multidatabase optimizer can exploit this knowledge to foresee the order of query

operators in the subquery towards each LDBS. To this purpose, our model acts complementary to mechanisms identifying cost functions [1, 7] and techniques evaluating the behaviour of di erent join algorithms and index-based access methods [1]: By knowing the cost function, the join algorithms and the join ordering strategy, the multidatabase optimizer can simulate the decision of the local optimizers and produce a global QEP conformant to their principles. In the next section, we present a motivating example for the need to identify the principles of a local optimizer. In section 3, we introduce our methodology and categorize the LDBSs in terms of optimization functionality. In the next two sections, we present the tests performed by our model to identify various aspects of the LDBS optimizer functionality and the heuristics it uses. The last section concludes our study.

2 A Motivating Example Let DB1, DB2 be two LDBSs. Relations R1; : : :; R5 belong to DB1, relations S1 ; S2; S3 belong to DB2. The example query graph in Fig. 1 contains two \multidatabase joins" R1 1 S1 and R2 1 S2 . The MQO must specify the order and the execution sites of those joins and their interleaving with local joins. In Fig. 2 we show two QEPs that may be generated by the MQO in order to compute the query cost. In QEP (a), the multidatabase joins are performed by of the query originator, while in QEP (b) they are assigned to DB2. S3

DB2 S1

S2

DB1

R1 R3 R2 R5 R4

Figure 1: A query graph The join ordering speci ed by the MQO and used for cost estimation cannot be enforced to the autonomous LDBSs: the global perspective of the MQO is not known to the LDBSs, who optimize the subQEPs according to their local criteria. Further, if DBS1 is a le system, the execution of 4 joins in one

DB2 DB1 R5

DB1

R4

S3 S1

R3

S2

R5 R1

R2

R3

R4

DB2

DB1 S2

(a)

R1

R2

S1

S3

(b)

Figure 2: Two example QEPs batch may yield a prohibitive cost. If it is a DBMS considering only linear QEPs, any cost estimation based on a bushy QEP as in Fig. 2(a) is unreliable.

3 Optimizer Calibration Methodology We identify an LDBS optimization technique by \calibrating" the optimizers: We issue a series of queries towards a synthetic database in the LDBS and use their response times to detect the structure of the QEPs thus executed.

3.1 An LDBS Taxonomy In order to identify the optimization technique of a LDBS, we establish a taxonomy of LDBSs according to their functionality, as shown in Figure 3. We use our calibration mechanism to identify the category to which each LDBS belongs. The hierarchy root is the LDBS, which may or may not contain an optimizer. For LDBSs without optimizers, we verify whether they process joins as such or implement them as restrictions over cartesian products. The latter is expected in le systems. For LDBSs having an optimizer, we test whether the optimizer is activated before or during execution and whether it is exhaustive or heuristic-based. This knowledge is necessary to estimate the duration of the optimization phase, which is part of the overall processing cost. Some optimizers, like the one of System R, combine an exhaustive strategy with a heuristic reducing the size of the search space. We still consider such optimizers as \exhaustive". Non-exhaustive ones are based on deterministic algorithms or use a combinatorial optimization strategy [6]. In this rst study in

LDBS

MDBS TestQuery R1

RefQuery1 R0

R2

Rn Ri R0

No Optimizer

Run-Time Optimizer

Compile-Time Optimizer

LDBS

RefQuery2 R0

Ri

R1

Local Optimizer

Exhaustive

Joins in order of appearance Deterministic algorithms

RefQuery1

}

Non-exhaustive

.........

?

RefQuery2

time

Synthetic Database

Joins as restrictions over cartesian products

}}

TestQuery

QEP

Randomized algorithms

Top-Down Heuristic ......... Minimum Selectivity Heuristic Combination of TDH and MSH

Figure 3: A categorization of LDBSs this area, we present methods for assigning optimizers to the categories pointed to by thick arrows.

3.2 Methodology We establish a \synthetic" database with special purpose characteristics on each LDBS. First, we issue a set of \reference" queries, whose QEPs are trivial. Then, we issue a set of \test\ queries with non-trivial QEPs. We compare their processing time, i.e. the time needed by the LDBS to optimize and execute each one. Based on the relative processing time of reference and test queries, we draw conclusions on the QEP which was used for the test queries, and hence on the optimization technique that produced it. This process is depicted in Figure 4 for an example set of queries. RefQuery1 : R0 1 Ri produces the empty result. RefQuery2 : R0 1 R1 produces a nonempty output. TestQuery contains n joins, all identical to the join of RefQuery2 except of R0 1 Ri. By comparing the processing times of these queries, we can nd which join was executed rst by the LDBS.

3.3 Calibration Framework We make very few assumptions on the willingless of the LDBS to export meta-information on its functionality. In the terminology of [1], we consider both \conformant" LDBSs, which provide database statistics, and \non-conformant" LDBSs, which give no information on their processing strategy. We assume that (i) the LDBS permits the creation of the synthetic



Figure 4: The calibration process database, (ii) it is possible to store meta-information on join selectivities in the data dictionary, and (iii) all queries of the same experiment are issued under the same LDBS workload. The rst assumption has also been taken for granted in [1], since such an activity is permissible in cooperating databases. The second assumption concerns information that is anyway indispensable for an optimizer. The last assumption is related to our notion of processing time: Rather than assuming an idle system or no foreign load, we de ne as \query execution time" the time required by the system to respond under a load typical for it. Hence, we only require a uniform workload during the execution of a single experiment; if this is not possible, the experiment should be repeated to obtain a reliable sample.

Establishing the synthetic database. Our syn-

thetic database consists of three sets of relations. The number of relations is a parameter of n, the size of the largest join query that is expected in this LDBS.

 Database size: 2  (n+1)+k relations, organized in

two sets of n relations, A1 and A2, and one set of k relations A3 . The relations in A1 are all of equal size a, in A2 of equal size a = 10  a, while A3 is a small set of relations of arbitrary size, needed in the tests of section 5. 0

 Relation size: Relation sizes are measured in

LDBS block units. The relations should be large enough to permit reliable measurement of the time needed to join them. However, they should t in memory, otherwise the selection of execution algorithms may interfere with optimization decisions we are trying to identify.

 Attributes: All relation attributes are of the same

type and have a constant length. The values of each attribute are uniformly distributed.  Number of attributes: n join attributes per relation. An attribute participates in only one join. Note that access methods and join algorithms, which generally depend on the relation size, are not relevant, since the optimization principles we study are orthogonal to the suite of algorithms considered by the LDBS. Knowledge of those principles can be combined with a method to estimate the cost of different algorithms and access methods [1].

Joins on the synthetic database. We introduce

four join types: 11 produces the empty result 12 outputs a single tuple 13 outputs a relation equal in size to its left input 14 produces two tuples The join selectivity factors should be stored as metainformation, either explicitly or by issuing queries to establish them. Our calibration methodology is based on the placement of type 11 joins within the QEP. Therefore, the selectivity factor of 11 joins should not be set to zero. Since selectivity factors are approximate values computed statistically, 11 and 12 joins have the same selectivity. This must be speci ed explicitly or by means of appropriate queries and updates before each experiment. Alternatively, the output of a 11 join can be a tuple which satis es no other join predicate: for queries with this join type, the empty result is then produced by the immediately subsequent join.

4 Calibrating an LDBS Our calibration mechanism assigns an LDBS to a leaf category of the taxonomy in Figure 3. We rst test how the LDBS processes join queries. If there is an optimizer, its basic characteristics are stepwise identi ed.

4.1 Is there an optimizer? Principles. If the LDBS does not have any optimizer at all, then query execution follows some primitive pattern, such as (i) perform the cartesian product of all relations and apply the join predicates as restrictions to the results or (ii) execute the joins in order of appearance. A more sophisticated approach is interpreted as the output of an optimizer.

How does the LDBS execute joins? We consider three relations R0; R1; R2 2 A1 , all having size a. We

execute two reference and two test queries; the parentheses stress the order of appearance:

 refQuery1 : R0 11 R1 produces no tuples  refQuery2 : R0 13 R2 produces a tuples  testQuery1 : (R0 13 R2) 11 R1; the second join produces the empty relation

 testQuery2 : (R0 13 R2) 13 R1 produces a tuples Since all relations have the same meta-information, the preparation time for refQuery1 and refQuery2 is the same. Their execution time di ers only by the time needed by refQuery2 to write its output: time(refQuery2 ) = time(refQuery1 )+write(a) (1) where time() denotes processing time and write(a) is the time needed to write a tuples to the output. We measure time(refQuery1 ) and time(refQuery2 ), so that write(a) can be calculated. testQuery2 executes the same join as refQuery2 and produces the same output. The same holds for testQuery1 and refQuery1 . Thus: time(testQuery1 ) = time(refQuery1 ) + x1 time(testQuery2 ) = time(refQuery2 ) + x2 where the values of x1 and x2 account for the di erent preparation time required for a query containing one more join and for the much larger time needed to execute that join. Since the two test queries access the same relations with the same meta-information, their preparation time is the same. Therefore, any di erence between x1 and x2 is due to di erent execution times.

Evaluation of the results. By comparing the processing times of the queries, we observe the following:  Join order: 1. If x1  x2, join R0 11 R1 was executed rst in testQuery1 : then x1 consists only of parsing time and is closer to zero than to x2. 2. If x1  x2, the join producing the empty result was not executed rst, i.e. the LDBS ignored the selectivity factors.

 Use of cartesian products:

1. If the joins are executed as such, the two joins in testQuery2 have almost the same execution time, because the output of each 13 -join consists of a tuples. Hence: time(testQuery2 )  2  time(refQuery2 ) i.e. x2  time(refQuery2 ). The inequality covers the fact that the intermediate relation may not need to be read from disk. 2. If we observe that x2  2  time(refQuery2 ), the LDBS performs joins as restrictions over a cartesian product. Hence, by issuing four queries and comparing their processing times, we can deduce whether the LDBS executes joins as such or builds cartesian products. If there is no optimizer, the MQO must optimize the join sequence itself and forward to the LDBS an already optimal QEP with given join order.

4.2 Can the LDBS process large queries? Motivation. In the experiment of subsection 4.1 we have used queries with one or two joins only. However, the LDBS may be using its optimizer only for large queries that justify the optimization overhead. Large join queries occur in decision support applications and expert systems, which may acquire information from one or more LDBSs. They can also arise in an LDBS with an object-oriented querying interface. It is therefore essential to verify whether the LDBS can process large join queries. Formulation and processing of large join queries. For the ecient execution of a large join

query, optimization is indispensable. Hence, an LDBS that does not use an optimizer for such a query, does not have any. We consider n + 1 relations R0 ; : : :Rn 2 A1 , where n is the typical size of a large query towards the LDBS. We issue the following queries:

 refQuery1 : R0 11 R1.  refQuery2 : R0 13 R1.  testQuery1 : R0 11 R1 11 R2 : : : 11 Rn, which

corresponds to a linear query graph R1 |R2| .. .|Rn and produces the empty result.

 testQuery2 : R0 13 R1 13 R2 : : : 13 Rn, which produces an output of a tuples.

The results concerning the reference queries can be obtained from the test in subection 4.1. Equation 1 is applicable here, too. The two test queries are applied on the same relations, have identical query graphs and consist of joins with the same selectivity. Therefore, they have the same optimization time o1 = o2 . Hence: time(testQuery1 )  time(refQuery1 ) + o1 time(testQuery2 )  time(refQuery2 ) + o2 where x  0.

Evaluation of the results. The outcome of the experiment can be success or LDBS failure:  If testQuery1 succeeded but testQuery2 failed, then the optimizer processed it, but the query processor apparently run out of resources.  If testQuery1 failed, then either the LDBS used cartesian products to execute the query, eventually running out of resources, or the optimizer could not process it. In the latter case, the optimizer is not appropriate for large join queries and the MQO must partition them into smaller fragments before sending them to the LDBS. If the test queries are successfully executed: 1. If time(testQuery1 )  time(refQuery1 ), then optimizationTime(testQuery1 )  0. This means that the LDBS has no optimizer. Moreover, the query processor is performing joins, not cartesian products: the execution time of testQuery1 was reduced to that of processing a single join. 2. If time(testQuery1 )  time(testQuery2 ) ? write(a), there is no optimizer: This equality holds only if the LDBS executes a cartesian product instead of optimizing the query properly. In this test, we identify LDBSs using optimizers only for large queries. We also identify LDBSs supporting large join query optimization and LDBSs that can execute such queries but do not optimize them. The latter must obtain QEPs already optimized by the MQO. In this experiment, we based our conclusions on a clear distinction between optimization and execution time. However, if the LDBS has a run-time optimizer, then optimization and execution times are interleaved. We present a solution to this problem in [5]. We omit it here due to space limitations.

4.3 Exhaustive optimizers Motivation. When the MQO estimates the process-

ing time of a query, it must also consider the optimization time required by each LDBS. Optimization time is particularly important for large queries. If the search space of a query is scanned exhaustively, the optimization time increases exponentially with the query size. Non-exhaustive optimizers have a polynomial time increase. To recognize whether the scan is exhaustive or not, we need to measure the optimization time for a series of queries, and draw the curve of optimization time with the query size.

Duration of the optimization phase. We consider m star queries from set A1 , so that the ith -query Qi has as center the relation R0 being joined with R1; : : :; Ri using join type 11 . For LDBSs supporting large queries, m can be equal to n. The processing time of Qi consists of the optimization time oi and the execution time ei . Since all joins produce the empty result, all queries have the same execution time e1 , the time needed to execute one join. The optimization time of an one-join query is negligible, so that time(Q1 )  e1 . Therefore: oi = time(Qi ) ? time(Q1 )

(2)

Evaluation of the results. According to the

above formula, we measure the processing times of Q1; : : :; Qm , compute their optimization times and draw the curve of their increase with the query size: If the increase is exponential, the optimizer scans the search space exhaustively. Otherwise, the optimizer performs a heuristic non-exhaustive search [6].

The side-e ects of parallel joins. The above test assumes that query execution consists of performing a single join, which produces the empty result. However, an optimizer might decide to execute several joins involving R0 in parallel and merge their results, in order to increase concurrency. Then, the duration of a join is increased and Eq. 2 does not hold. To detect this case, we repeat part of the above test for larger relations in set A2 : We execute queries Q1 : R0 11 R1 and Qm : R0 11 Rm . The optimization times are o1  0 and om , as before, since the relations of A1; A2 di er in size but have the same characteristics. The values time(Q1 ) and time(Qm ) are measured. We can then verify the formula: 0

0

0

0

0

0

0

0

time(Qm ) = om + time(Q1 ) 0

0

If the formula holds, the above conclusions are valid. If not, the LDBS executes joins simultaneously and merges their results. This implies that bushy trees are considered [3]. This knowledge is valuable, as it identi es the search space of the LDBS optimizer. Moreover, it indicates that the optimizer is not exhaustive: exhaustive optimization of bushy spaces is prohibitively expensive. Hence, a polynomial increase of query optimization time can be assumed, for which the results of this experiment supply a rough approximation.

5 Deterministic Heuristics An optimizer may be using one or more heuristics. We currently support the top-down heuristic (TDH) executing the least expensive join rst and the minimum selectivity heuristic (MSH) [6]. We present the experiment verifying the usage of the MSH hereafter. The experiment for the TDH can be found in [5]. Since heuristic-based optimizers perform a total ordering of joins according to some ranking criterion, we exclude from our tests the optimizers considering bushy trees, as identi ed in the experiment of 4.3.

5.1 Minimum Selectivity Heuristic Motivation. The Minimum Selectivity Heuristic

(MSH) ranks joins by selectivity factor. An optimizer may use it as a a self-standing heuristic or as a criterion for the TDH. For the latter case, we note that according to TDH, the least-expensive join is executed rst; this can be the join with the highest selectivity factor (MSH), the join on the smallest input relations or the join with the lowest execution cost. Therefore, the calibration test must be such that the optimal QEP can be produced by the MSH but not by the related TDH.

Join ranking. We issue a star query testQuery with m joins. The center of the star is R0. The query consists of m ? 1 type 12 joins R0 12 Ri, i = 1; : : :; m ? 1, and of one join R0 11 Rm which produces the empty result. Relation Rm belongs to set A2 , so it is larger than all other relations. Hence R0 11 Rm has the highest selectivity, but also the largest input streams and probably the highest join cost. Therefore, it would not be the rst one to execute if joins were ranked by relation size or join cost as in TDH; only the MSH gives to this join the highest ranking. The reference queries are:  refQuery1 : R0 11 Rm 0

0

0

0

 refQuery2 : R0 12 R1 produces an one-tuple re-

lation S0  refQuery3 : S1 12 R2, where S1 2 A3 is an onetuple relation of equal size as S0 ; the output of this join is again S0  refQuery4 : S1 11 Rm produces no tuples 0

Evaluation of the results. If the MSH is used, then the rst and only join executed is R0 11 Rm . Its processing time is: time(testQuery) = om + time(refQuery1 ) 0

If the MSH is not used, the rst join executed is one producing an one-tuple relation. Then: time(testQuery) = om + time(refQuery2 ) + z z = time(refQuery4 ) or z  time(refQuery3 ), depending on whether the join executed next is R0 11 Rm or not.

6 Conclusions In this study, we have presented a calibration mechanism, the aim of which is to assist a multidatabase optimizer in obtaining knowledge on the LDBS optimization technique. This knowledge is necessary for the cooperation between the multidatabase and the LDBS optimizers, and for the generation of QEPs compatible with this technique. We introduce a categorization of LDBSs, based on their optimization functionality, and we identify the category of an unknown LDBS by calibrating its optimizer. We have presented the methodology used by our model to verify the existence of an optimizer in a LDBS, to estimate the optimization time and to identify whether the optimizer is exhaustive, compile-time or run-time, or based on heuristics. The usage of two particular heuristics is also detected. Orthogonally to those characteristics, the potential of executing large join queries and the support of concurrent join execution (as in bushy trees) are also studied. We are currently designing a working prototype of this model to test some commercial DBMSs and the exhaustive optimizer of System R. We are extending our model in two directions: On the one hand, we are further re ning the categorization of optimizers by introducing orthogonal directions, e.g. concerning the query tree form (linear or bushy) and the combination of heuristics with exhaustive search. On the other

hand, we are considering methods for the detection of more sophisticated heuristics, including those used by run-time optimizers. Finally, we would like to identify the in uence of data caching, which is extensively used in modern LDBSs. To avoid the impact of caching on our experimental results, we can perform operations that empty the cache. However, the exploitation of cached data leads to more ecient query execution, a factor that should be taken into account by a multidatabase optimizer. Since this issue is related to multiquery optimization, we regard it as future work.

References [1] Weimin Du, Ravi Krishnamurthy, and MingChien Shan. Query optimization in heterogeneous DBMS. In Int. Conf. on Very Large Databases, pages 277{291, Vancouver, Canada, 1992. [2] Weimin Du, Ming-Chien Shan, and Umeshwar Dayal. Reducing multidatabase query response time by tree balancing. In SIGMOD Int. Conf. on Management of Data, pages 293{303, San Jose, CA, 1995. [3] Goetz Graefe. Parallelizing the Volcano database query processor. In 35th CompCon Conf., Digest of Papers, pages 490{493, San Francisco, CA, 1990. IEEE. [4] Silvio Salza, Giovanni Barone, and Tadeusz Morzy. Distributed query optimization in loosely coupled multidatabase systems. In Fifth Int. Conf. on Database Theory { ICDT'95, pages 40{53, Prague, Czech Republic, 1995. [5] Myra Spiliopoulou. Identifying the optimization principles of a DBMS participating in a multidatabase. Technical Report ISS-23, Institut fur Wirtschaftsinformatik, Humboldt-Universitat zu Berlin, Berlin, Germany, 1996. [6] Michael Steinbrunn, Guido Moerkotte, and Alfons Kemper. Optimizing join orders. Technical Report MIP9307, Faculty of Mathematic, University of Passau, Passau, Germany, 1993. [7] Qiang Zhu and Per- Ake Larson. A query sampling method for estimating local cost parameters in a multidatabase system. In 10th Int. Conf. on Data Engineering, pages 144{153, Houston, Texas, 1994.