Generating Synthetic Database Schemas for ... - Springer Link

0 downloads 0 Views 358KB Size Report
Element renaming guarantees that elements in synthetic schemas will have a ... An excerpt from the IMDB schema is shown in Figure 3. Fig. 3. An excerpt from ...
Generating Synthetic Database Schemas for Simulation Purposes Carlos Eduardo Pires1, Priscilla Vieira1, Márcio Saraiva1, and Denilson Barbosa2 1

Federal University of Campina Grande, Computer Science Depart., Campina Grande, PB, Brazil {cesp,vieira,marcio}@dsc.ufcg.edu.br 2 University of Alberta, Depart. of Computing Science, Edmonton, Alberta, Canada [email protected]

Abstract. To simulate query answering in Peer Data Management System (PDMSs), simulators need to associate a database schema to each peer in the overlay network. Finding or creating a high number of database schemas can be a time consuming and tendentious task. This work proposes an automatic process to generate multiple synthetic database schemas with semantically coherent variations of a given base schema. The schemas are obtained through applying different types of modifications to subsets of the base schema. Our experimental validation has shown that the proposed method is able to produce random schemas that can be used in realistic simulations. Keywords: Synthetic Database schema, Ontology, PDMS, Simulation, Random.

1 Introduction Peer Data Management Systems (PDMSs) [1, 2, 3] are advanced P2P applications in which each peer is an autonomous data source that makes available a local schema. Peers manage their data locally, revealing part of their internal schemas to other peers. Schema mappings are generated to allow information exchange between peers. PDMS overlay networks tend to be large and allow complex interactions between the physical machines, underlying network, application, and user [4]. The testing of a PDMS overlay network or protocol in a realistic environment is often a complex and costly undertaking. Hence, simulation is the most popular technique for investigating overlay networks and PDMS applications [5]. Researchers interested in simulating a PDMS overlay network tend to avoid the development of a complex simulator and focus on some specific issue, e.g. query answering. Concerning that issue, PDMS simulators need to associate a local database schema to each peer in the simulated network. To make sense, the local schemas should belong to the same application domain. Depending on the domain that is considered, it can be easy to find a small number of related schemas. For instance, after a quick navigation in the Database Answers web site [6], we have found 13 database schemas related to the education domain. However, to simulate a realistic and large-scale PDMS environment, one often needs a large number of schemas. Moreover, these synthetic schemas must belong to A. Hameurlain et al. (Eds.): DEXA 2011, Part II, LNCS 6861, pp. 502–510, 2011. © Springer-Verlag Berlin Heidelberg 2011

Generating Synthetic Database Schemas for Simulation Purposes

503

a common domain and exhibit sufficiently large overlap (so as to allow the definition of mappings between them). On the other hand, there must also be some kind of heterogeneity among the synthetic schemas in order to better test the strengths and weaknesses of the simulated approaches. One possible solution is to generate these schemas manually, which is time consuming and error-prone, and thus does not scale. Instead, this work proposes an automatic process to generate a large number of database schemas to be used in PDMS simulations. As shown in Figure 1, the proposed process consists in, given a base schema S, automatically generating multiple synthetic database schemas S1, S2,…, Sj, where each synthetic schema Sj corresponds to a modified subschema of S. We use two kinds of operations to modify the new schemas: structural changes and element renaming. Structural changes ensure that: (i) synthetic schemas will have a different number of elements; (ii) elements will differ in the number of properties; and (iii) properties will have different data types. Element renaming guarantees that elements in synthetic schemas will have a different label which is semantically similar to the original one. Each operation is performed by a different schema modifier algorithm. The process is extensible in the sense that new schema modifiers can be added. Some schema modifiers use a domain ontology as an external resource. Parameters are used to influence the characteristics of the synthetic schemas to be produced. The following sections offer a detailed description of how synthetic schemas are produced and modified.

Fig. 1. Generating and modifying synthetic database schemas

2 Building Synthetic Schemas To meet diverse application requirements, the schema generation process accepts several types of parameters. Depending on the parameters values that are provided, different synthetic schemas S1, S2,…, Sj can be produced for the same base schema S. The input parameters include: (i) Base schema: reference schema used as a basis to produce synthetic database schemas; (ii) Number of schemas: quantity of synthetic schemas to be generated as output; (iii) Schema size: number of elements that each schema should contain; (iv) Schema format: to simplify matters, the synthetic schemas are represented in the relational format; and (v) Modifying operations: different types of operations that can be used to modify the schemas. Figure 2 depicts the proposed algorithm to generate multiple synthetic schemas. It accepts the general parameters (line 1) described in the previous subsection and

504

C.E. Pires et al.

returns a collection of synthetic schemas (line 14). Each synthetic schema Sj is created incrementally: elements are selected from the base schema and added to Sj one at a time (lines 6-9). To guarantee the generation of ad-hoc schemas we use a random function to pick up elements in the base schema. Each selected element must maintain a relationship with at least one of the elements that were already included in the current synthetic schema. Such requirement is explained because, during the generation of a synthetic schema, if a new element is simply selected and added to the synthetic schema (i.e. ignoring its relationships with the other elements in the base schema), then manual intervention would be need to link the new element with the elements that were already added to the synthetic schema. Once a synthetic schema is built it is modified through the operations provided as input (line 10). These operations are detailed in Section 3. 1 2 3 4 5 6 7

8 9 10 11 12 13 14

GenerateSyntheticDatabaseSchemas (input: Parameters; output: CollectionSyntheticSchemas) { Counter Å 0; While Counter ≤ Parameters.NumberSchemas Do SyntheticSchema[Counter].Size Å 0; SyntheticSchema[Counter].Elements Å GetElement(Parameters.BaseSchema.Elements, Random); While SyntheticSchema[Counter].Size ≤ Parameters.SchemaSize Do SyntheticSchema[Counter].Elements Å SyntheticSchema[Counter].Elements + GetElement(SyntheticSchema[Counter].Elements.ReferencedElements, Random) – SyntheticSchema[Counter].Elements; SyntheticSchema[Counter].Size++; End While; ModifySyntheticSchema(SyntheticSchema[Counter], Parameters.ModifyingOperations); SyntheticSchemaColletion Å SyntheticSchemaColletion + SyntheticSchema[Counter]; Counter++; End While; Return(SyntheticSchemaCollection); }

Fig. 2. Algorithm for generating and modifying synthetic database schemas

We use the Internet Movie Database (IMDb) [7] both to illustrate our ideas and for experimental purposes. The IMDb schema is used as the base schema and corresponds to a graph consisting of 60 relations. An excerpt from the IMDB schema is shown in Figure 3.

Fig. 3. An excerpt from the IMDB schema

Generating Synthetic Database Schemas for Simulation Purposes

505

Fig. 4. A step-by-step example illustrating the generation of a 4-size synthetic schema

Figure 4 illustrates an example in which a 4-size synthetic schema has been requested. The first element selected by the random function is certificate. The next element to be added to the synthetic schema is necessarily one of the elements that maintain a relationship with certificate in the base schema. According to the IMDB schema, the candidate elements are movies and country. Suppose that country is selected by the random function. The other element to be added to the synthetic schema must maintain a relationship with certificate and/or country in the base schema. Besides movies, the candidate elements are location, shotin, releasein, prodcompany2country, country2sfx, located, and distributor2country. Certificate and country are also candidate elements since they maintain a relationship with each other. However, since these elements are already included in the synthetic schema they are discarded. Assuming that location is selected, the synthetic schema contains three elements: certificate, country, and location. The current candidate elements are now movies, shotin, releasein, prodcompany2country, country2sfx, located, and distributor2country. Finally, consider that the fourth element selected by the random function is movies.

3 Modifying Synthetic Database Schemas Assuming that a synthetic schema Sj has been generated, the possible operations to modify Sj are described as follows: Removal of Properties - consists in eliminating properties from elements of a synthetic schema Sj. At each iteration an element is selected from Sj and one of its properties is removed. In both cases a random function (Random) is invoked. To guarantee that each element will keep a minimum percentage of its original properties a parameter (MinimumPct) is used. Each element must keep at least one property. The parameter MaxModifications determines the number of properties to be removed from the entire synthetic schema. The variable

506

C.E. Pires et al.

Modifications is incremented whether or not a property is removed. For instance, a property cannot be removed if it is the only remaining property of an element. Figure 5 illustrates the algorithm to remove properties from elements in a synthetic schema. 01 02 03 04 05 06 07 08 09 10 11 12

RemoveProperties (input: SyntheticSchema, MaxModifications, MinimumPct; output: SyntheticSchema) { Modifications Å 0; While Modifications = MinimumPct Then Property Å GetProperty(Element.Properties, Random); Element.Properties Å Element.Properties – Property; SyntheticSchema.Elements(Element).Properties Å Element.Properties; End If; Modifications++; End While; Return(SyntheticSchema); }

Fig. 5. Algorithm to remove properties from elements of a synthetic schema

Insertion of Properties - consists in adding semantically related properties to the elements of a synthetic schema Sj. To this end, a domain ontology O on the same topic of the base schema must be available to provide the related properties. Figure 6 shows an excerpt from the Movie Ontology [8]. At each iteration an element ei is randomly selected from the synthetic schema Sj. Its label is used as input to identify a corresponding term ti in O as well as terms that are semantically equivalent to ti, i.e. the synonyms of ti. All properties of ti and the ones of its synonyms are considered candidate properties. Among them one is randomly selected and added to ei. String is used as the default data type for the new property. If a corresponding term ti is not found in O, a new property cannot be inserted.

Fig. 6. An excerpt from the Movie Ontology [8]

Figure 7 illustrates the algorithm that allows the addition of properties to the elements of a synthetic schema. The parameter MaxModifications determines the number of properties to be added to the elements of the synthetic schema Sj. To avoid all properties from being added to the same element, the parameter MaximumPct is used. It controls the maximum number of properties that can be added to each element. The variable Modifications is incremented whether or not a property is added to an element. For instance, a property cannot be added to an element ei when ei already contains a property with the same label. Replacement of Element Label - consists in replacing the label of an element ei in the synthetic schema Sj by a semantically related label. Again, a domain ontology O is used as an external resource. At each iteration an element ei is randomly selected from Sj. Its label is

Generating Synthetic Database Schemas for Simulation Purposes 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15

507

AddProperties (input: SyntheticSchema, MaxModifications, MaximumPct, DomainOntology; output: SyntheticSchema) { Modifications Å 0; While Modifications