Knowledge Discovery with FlexiMine - Semantic Scholar

12 downloads 150 Views 264KB Size Report
Apr 8, 1998 - Data Mining and Knowledge Discovery (KDD) have become an important ..... main expression, and source domain name, as the tags in theĀ ...
Knowledge Discovery with FlexiMine Carmel Domshlak, Tzachi Rozen, Elad Schiller, Solomon Eyal Shimony Dept. of Math. and Comp. Sci. Ben Gurion University of the Negev P. O. Box 653, Beer-Sheva 84105, ISRAEL e-mail: f carmel, tzachi, schiller, shimonyg @cs.bgu.ac.il April 8, 1998 Abstract

FlexiMine is a KDD system designed as a testbed for ongoing data-mining research, as well as a generic knowledge discovery tool for varied database domains. Among its tasks are learning dependencies from databases, in the forms of Association Rules and probabilistic models. Our current application databases (student information, and hospitalization records) have proved problematic to existing schemes for automated induction of these models. Diculties were encountered in the multiple-relation nature of the databases on the one hand, and in the existence of meaningful numerical data on the other hand, resulting in generation of nonsense depedencies, to the exclusion of all else, unless handled carefully. In addition to capabilities for in-process ltering performed within FlexiMine, (usually a separate \data-cleaning" step in other systems), a powerful abstraction mechanism allows for induction of more meaningful relationships. In order to apply the mined knowledge, a clear global model semantics is required. Using Bayesian Knowledge Bases, a generalization of Bayes networks, enables construction of a consistent probabilistic model from the association rules.

Keywords: Knowledge Discovery in Databases, Inducing Probabilistic Dependencies, Data-Mining, Association Rules, Bayesian Knowledge Bases.

1 Introduction Data Mining and Knowledge Discovery (KDD) have become an important technology and research area in recent years, among the AI and Databases communities [7]. FlexiMine is a prototype KDD system currently being developed at Ben-Gurion University (BGU) for testing techniques and algorithms for Data Mining and Knowledge Discovery and their evaluation in the context of real-life databases and users. The emphasis of this system and its software architecture is on integration of most KDD operations, (including database access and selection, preprocessing, data transformations such as abstraction, data-mining algorithms, and interactive visualization) and on extensibility. Thus, the system facilitates incorporation of new algorithms or their improved variants, and convenient extension of support to new databases or abstraction hierarchies. Yet, the system preserves a friendly and easy to use interface which enables users and domain experts to access the system from both local and remote locations, and to comment on, and evaluate, result quality. Currently we have two databases on which we evaluate our algorithms: a database containing personal information on students at the faculty of natural science at BGU, including course grades, and a medical database containing information on emergency room visits and hospitalization in a large hospital (Soroka hospital, a close aliate of BGU). At present FlexiMine contains algorithms of several types and from various elds of AI: Learning association rules from a single relation and from multiple relations, learning probabilistic models, Decision-tree learning, and Meta-queries. Of these, we have gathered some experience with learning association rules from our application databases, much of it in the form of numerous automatically generated nonsense association rules, to the exclusion of any useful rules. Using ltering schemes existing in the literature and their variants, most nonsense rules can be eliminated, but that does not help in generating the missing desired rule types (for example, in the student database, rules associating eventual success in CS courses with early indicators, such as student admissions data, or grades in early math courses). Problems encountered were in generating the relation for association-rule generation, and in the domains of the attributes, which are extremely multiple-valued. We have had some success in using abstraction lters to overcome the latter problem. Overcoming the former problem is also possible, by using a novel scheme of mining association rules directly from the multirelation database. Another problem we observed is in using mined knowledge, even after abstraction and ltering. We have a large number of rules, on which we would like to use an automated reasoning tool, and yet a set of association rules exhibit no clear global semantics. The well-known probabilistic model, Bayes networks, seems a good candidate model, but cannot handle incompleteness or cycles. We have thus opted to use Bayesian Knowledge Bases [20], which overcome the above diculties, have a closer relationship to the association-rules, yet have a consistent global semantics. Section 2 provides background on association rules and Bayesian Knowledge Bases. In Section 3, we discuss some of our failed early attempts at knowledge discovery. Section 4 discusses FlexiMine's abstraction mechanism, and some preliminary usage results. Section 5 presents our methods of generating association rules from multiple relations. 1

Performance issues and distributed association-rule generation are discussed in Section 6. The reasons for using BKBs and our implementation scheme are outlined in Section 7. We conclude with a discussion of learning BKBs directly from data (Section 8), an evaluation of lessons learned to-date, and future directions.

2 Background We discuss association rules, including existing parallelized schemes and an introduction to PVM, over which our distributed association-rule generator was implemented. Bayesian Knowledge Bases are introduced, as a generalization of Bayes networks.

2.1 Association Rule Generation

One of the most important data mining problems is association rules (AR) generation. Formally, this problem de ned in [3] as: Let I = fi1; i2 ;    ; im g be a set of items. Let D be our database, where each transaction T is a set of items, such that T  I . We say that a transaction T contains X , a set of some items in I , if X  T . An association rule is an implication of the form X =) Y , where X  I; Y  I; and X \ Y = ;. The rule X =) Y holds in the database D with confidence c if at least c% of transactions in D that contain X also contain Y . The rule X =) Y has support s in the database D if at least s% of the transactions in D contain X [ Y . As an example, let I = fa; b; c; d; e; f g and let D = ffa; bg; fa; b; cg; fa; dg; fa; eg; fb; cg; fb; dg; fb; eg; fc; dg; fe; f gg. The transaction fa; bg contains fag, since fag  fa; bg. The implication fag ) fbg is an association rule. It has support of 2=9 since two out of the nine transactions in D contains fa; bg. It has con dence of 2=4 since two out of the four transactions in D that contain fag also contain fbg. Running an rule generator over our hospital database, with elds PatientAge, MainSymptom, FirstDiagnose and HospitalizationPeriod we would like to produce arule such as:

and MainSymptom is ``Fever'' =) FirstDiagnose is ``Debriation'' and HospitalizationPeriod is ``MoreThanDay'' with 74% con dence and support of 34% of the transactions PatientAge is ``LessThan17''

De ning a query that can produce the information for such a rule is possible in a conventional database, but a typical user (even a domain expert) may not know for which rules to look. Consequently one would need an unsupervised method for association rules discovery. Several algorithms for nding association rules were developed in the last ve years [1, 11, 3, 13]. These algorithms nd all of the rules that satisfy a given support/con dence criterion. Most algorithms have a common main structure, a two-step process: nding all sets of items that have support above the given minimum (frequent itemsets), and generating the desired rules from these itemsets.

2

2.1.1 Parallel Association-Rule Algorithms Three parallel algorithms for mining association rules, based on the Apriori algorithm were suggested in [2]. These algorithms have been designed to investigate the performance trade-o s between computation, memory usage, communication, and the use of problemspeci c information in parallel data mining. Speci cally, 1. The focus of the Count Distribution algorithm was on minimizing communication, even at the expense of carrying out redundant duplicate computations and duplicate storage in parallel. 2. The Data Distribution algorithm attempted to utilize main memory more eciently, at the expense of extra communication. 3. The Candidate Distribution algorithm exploits the semantics of the particular problem at hand to reduce synchronization between the processes and has load balancing built into it. An empirical comparison of these algorithms was made [2], implemented on a IBM POWERparallel System SP2, a shared nothing machine [25]. The communication primitives used by the implementation are part of the MPI (Message Passing Interface). The candidates for a message-passing communication standards were under discussion [12]. The algorithms, as they were presented, assume a shared-nothing architecture, where each of N processes run on the private processor that has a private memory and a private disk.

2.1.2 PVM - Parallel Virtual Machine PVM (Parallel Virtual Machine), which we used as a platform for distributed induction of association rules, provides a uni ed framework within which parallel programs can be developed in an ecient manner using existing hardware [8]. PVM enables a collection of heterogeneous computer systems to be viewed as a single parallel virtual machine. PVM transparently handles all message routing, data conversions and task scheduling across a network of incompatible computer architectures. The user writes an application as a collection of cooperating tasks. Tasks access PVM resources through a library of standard interface routines. These routines allow the initiation and termination of tasks across the network as well as communication and synchronization between tasks. Its message-passing primitives are oriented towards heterogeneous operation, involving strongly typed constructs for bu ering and transmissions.

2.2 Bayesian Knowledge Bases

Probabilistic models are widely used for summarization of the database, or for performing diagnostic or predictive reasoning based on the database. The attributes of the database will be the variables of such a model, and the domain of each attribute will de ne all possible instantiations of the corresponding variable. One well known graphical probabilistic model is Bayesian Networks (BNs) [14]. 3

A recent variant of Bayes Networks is Bayesian Knowledge Bases (BKB), introduced in [18]. BKBs are a generalization of weighted proof graphs and BNs that allow cycles, as well as incompleteness of the local conditional probabilities. The dependencies in the model are de ned between instantiations, and cycles are allowed between variables as well as between instantiations of variables. 0.4 E=1

0.6

0.7 B=0

0.1 B=2 0.3

0.2

0.5 D=0

A=1

E=0

0.4 B=1

0.35 C=1 A=0 0.2

0.8

S-node I-node

Figure 1: BKB example - variables A B; C; D; and E The BKB model consists of a directed graph, called a correlation graph, which has two distinct types of nodes (Fig. 1). The rst type is called instantiation-node (I-node), and corresponds to the instantiations of the variables (several I-nodes correspond to a single node in a Bayes network - one I-node for each domain value of the Bayes network node). The second type is called support-node (S-node) and is used to quantify the conditional dependencies between instantiations of variables (loosely corresponding to one or more conditioning-cases, or to one or more conditional probability table (CPT) entries, in a Bayes network). The S-nodes have exactly one outgoing arc to one of the I-node (consequent), and a (possibly empty) set of incoming arcs, each from an Inode (antecedents). The S-nodes are labeled by a number that indicates the conditional probability of the I-node at the head of their outgoing arcs, given the I-nodes at the tails of their incoming arcs. If there is no incoming arc, this number indicates the prior probability of the I-node at the head of the outgoing arc.

3 Initial Failed Attempts From the initial stages of FlexiMine system evaluation on the application databases, it became evident that the straightforward association rule induction process leads to no useful results. 1. Interestingness of the resulting rules - Interesting, and therefore useful, dependencies are mingled with a lot of unintuitive, uninformative rules that can even describe functional dependencies in the database [24]. Unfortunately, experiments show that the output consists mostly of the latter type of rules. Furthermore, because of such \informative garbage", it is impossible to focus on the real object of the search. 4

2. Global understanding of the resulting rules - Even if we focus on the rules that contains the useful information, it is impossible to perform a global reasoning on them, due to the lack of a global semantics of the rules. 3. Computational performance issues - The complexity of the algorithm is potentially exponential, thus, even in the optimal conditions, it takes a lot of time for the process to terminate. In order to achieve speedup, several parallel algorithms were proposed [2], but no KDD system implements them, as far as we know. Furthermore, all parallel algorithms were designed and evaluated in a multiprocessor environment, which is not widely accessible. 4. Multiple relations - The application databases consist of multiple relations. In addition to the semantic problem of how to generate a single joint relation from the data in a manner that makes sense, performance of actually executing the requisite join operation was abysmal. 5. Overspeci cation - Attribute domains may be at a resolution that is too high for meaningful summarization or rule induction. Without a taxonomy to create meaningful terms from sets of values, no association can receive sucient support. Thus, useful combinations between values may be invisible to the data mining algorithms, and on the other hand, garbage associations stand out in contrast. As an example of the latter diculty, consider the table in Figure 2, a part of a larger table (both in number of records and in number of attributes) obtained from our student database - grades in calculus and algebra. Numerical grades are from the original database, but the additional \mark" and \status" columns were added to illustrate the overspeci cation issue. Attempting to mine rules directly from the numbers achieves nothing, since no pair of grades (algebra, calculus) appears more the once. An abstraction of the domain, that is, classifying the values of the domain according to some pre-de ned categories, losing excessive detail en-route, may be helpful. For example, one may classify the grades below 55 as F, grades from 56 to 69 as C, grades from 70 to 85 as mark B, and the rest as A. (Figure 2). It is now easy to see, for example, that the mark in calculus is never higher than the mark in algebra, for any given student (this is not correct in the numerical domain). Further abstraction to PASS/FAIL is possible, but the appropriate level of abstraction at which to perform knowledge discovery is an open problem. FlexiMine provides a mechanism for abstraction with hierarchical capabilities, allowing creation of \taxonomic" hierarchies both statically and dynamically (see Section 4).

4 Abstraction in FlexiMine Abstraction of domain values is in essence a process of grouping and labeling. For instance, in the student grades example, one can think of the grades as a collection of individual integers ranging from zero to one hundred, each of which is a category by itself, and of the abstraction process as their grouping under the four labels: A, B, C and D. In FlexiMine, abstraction may be de ned over domains, such as integer, as well as on 5

id

1 2 3 4 5 6 7

grade 68 43 92 85 66 52 80

Calculus Algebra mark status grade mark status C PASS 76 C PASS F FAIL 51 F FAIL A PASS 94 A PASS B PASS 90 A PASS C PASS 59 C PASS F FAIL 48 F FAIL B PASS 77 B PASS

Figure 2: Student Grades Table values of a speci c attribute of a speci c relation. The result of an abstraction is a map providing an (abstract) label or value to each of the original values. If the abstraction result types are domain-based, it enables further abstractions to be de ned on top of existing ones, leading to abstraction (\taxonomic") hierarchies [15], which are a well known AI technique of classifying the system's knowledge about objects. For example, we may classify the marks A, B and C as PASS marks and the mark D as FAIL mark, as shown in Figure 2. In that case, each grade has an hierarchy of labels. For example, the grade 68 has the label C, and C is labeled PASS in the second level of the abstraction hierarchy. Note that existing relationships between attributes in one level may be lost at a higher level. For example, we might interpret the fact that the grade in algebra no lower than the grade in calculus, as the \fact" that calculus is more dicult to pass than algebra. This distinction is lost at the higher abstraction level, since mapping A, B, C to Pass results in losing useful information about the lower grade in calculus. It is not easy to predict the correct abstraction level to use, and even to choose the correct groupings. Using a prede ned hierarchy may help by providing background knowledge, but may also be detrimental if the grouping happens not to t the data well. Abstraction in our system thus allows for both prede ned labeling schemes, and for creating labeling schemes, based on the actual data. Currently, we allow a user to do the grouping and labeling on top of histograms. The data is also available to domain-based classi cation algorithms in our system. Abstractions are layered, where each layer is de ned in an expression language. The abstraction de nition is compiled in the context of the domain over which the abstraction is de ned. The compiler returns a search tree (a map) corresponding to the expression. The search trees are then integrated into the KDD session as a temporary mapping relation, and a retrieval procedure that translates values to labels (from the above relation). Mapping relations may be either set-of-interval based, or set-of-value based. An assumption of non-overlap between set covers is made (and optionally, enforced) in both cases. After abstraction de nition, the mechanism becomes transparent to the user, as well as to other system layers, enabling easy incoproration of automatic abstraction schemes (such as those based on automated clustering). 6

4.1 Abstraction Speci cation Language

The domain mechanism of the language includes the prede ned domains: integer, real and string, the manual de ned domains: set and list (sequence), and the compound domain of a tuple of domains (a structure). The statements of the language de ne abstraction over a given source domain into a given destination domain. The de nition is made by dividing the source domain and its sub-domains into smaller sub-domains and by combining sub-domains. At each step the new generated sub-domain is given a value from the destination domain or an internal name from the set of internal strings. The language consists of the following three statements types:  BUILD () FROM [];  = [] [ [];  COVER (); The BUILD and the COVER statements should begin and end, respectively, an abstraction description. The assignment statement may appear any number of times in between. The BUILD statement consists of a destination domain expression, source domain expression, and source domain name, as the tags in the statement suggest. The source domain name is used by the assignment statements, and may be a valid value of the destination domain, empty, or an internal string. The assignment statements de ne new subranges of the source domain. They have the following parts:  - the label to be given to the new subrange: a valid value of the destination domain of the abstraction, or an internal string.  - a name of an already de ned subrange, including the name of the source domain of the abstraction.  - a function F to apply to values of the source subrange.  - a criterion for deciding which subset of values from the source subrange is in the new subrange.  - an expression for excluding values from the selected subset of values. There are two basic methods of labeling in the assignment statement: the obvious one, as in \name = ...;", and the automatic one. To use the automatic method, specify the reserved word AUTOMATIC instead of a name, which will result in a 'gensym'ed label - an internal string. The source subrange is given by an expression consists of names of already de ned subranges. The possible forms are: \", \." and \(, , ...)". The rst and the second forms refer to an already de ned subrange or sub-subrange, and the third form de nes a union of already de ned subranges and 7

sub-subranges. To keep things relatively simple, in an assignment statement of the third form, only the name and the source parts are allowed. To keep track of the subrange hierarchy de ned by the assignment statements, the compiler maintains a DAG, with each node corresponding to an already de ned subrange. A new node de ned by the rst and second statement type is added to the graph immediately below their source subranges. A new node de ned by the third form is added as a parent of all subranges of the list. The selection part is also optional. If it appears, it takes values from the range of F in one of the following forms:  An exact range in a form (x..y).  A CONS expression in a form CONS(x), which divides the subrange in steps of size x.  A SUB expression in a form SUB(n), which divides the subrange into n equal-sized pieces.  A REL expression in a form REL(Relation, Value), where Relation is any binary relation and Value is any value. The REL expression chooses the set of values x in the subrange, such that the result of applying F to x stands in the relation Relation to Value. The COVER statement completes the abstraction de nition. It takes a list of already de ned names of the assignment statements of the abstraction description. The list should not include internal names. The only allowed names are values from the destination domain. The associated subrange of the list values must be pair-wise mutually exclusive. Values of the source domain not included in one of the associated subranges are collected and labeled OTHER (a reserved word), by default.

4.2 Some Preliminary Results

Initial experiments on association rule generation under abstraction provide encouraging results. As shown above, on actual application data, it is hard to get any useful rules over multiple-valued domains. On our student grade database, as illustrated above w.r.t. Table 2, running the association rule algorithm resulted in either no rules (with a high minimum support requirement), or numerous rules, each supported by one or two transactions (but, obviously, many with a misleading con dence of 100 percent). The situation remained similar even when grouping grades into spans of 10. In the latter case, the con dence of almost all rules was around 25 to 30 percent and the support was around 5 percent. More meaningful results were obtained when grade abstraction boundaries was in increments of 20. This time we had two rules (Table 1) with better than 50 percent con dence and better than 25 percents support. Additionally, this number remained stable with large changes in minimum required support and con dence. Comparing this result with the rules we get with the \natural" abstraction of (bad, good, excellent) grades (Table 2), it is near-optimum. Unfortunately, although stable in support and con dence thresholds, the results were unstable in abstraction partition parameters: 8

rule con dence (%) support (%) algebra = A ) calculus = B 57 26 algebra = B ) calculus = B 63 29 E D C B A 0-20 21-40 41-60 61-80 81-100

Table 1: Association Rules with Uniform Abstraction rule con dence (%) support (%) algebra = excellent ) calculus = good 76 26 algebra = good ) calculus = good 87 57 calculus = good ) algebra = good 68 57 bad good excellent 0-55 56-84 85-100

Table 2: Association Rules with Coarse Abstraction slightly di erent categories result in a poor outcome (Table 3). This behaviour supports our observation that \good" groupings are sensitive to the data, and predicting them is a non-trivial task. Thus, our scheme of allowing construction of abstractions based on data histograms, is a useful tool. rule con dence (%) support (%) algebra = B ) calculus = C 57 25 F C B A 0-55 56-70 71-85 86-100

Table 3: Association Rules with Domain-Based Abstraction Abstractions were also generated for parts of the International Classi cation of Disease codes (ICD9) in our medical database. These were created with the aid of a domain expert for cardiovascular diseases. Preliminary attempts to view and process the database using the abstraction showed that it was essential to the process, due to a large number of disease codes with nearly equal semantics.

5 Multiple-Relation Association Rule Generation Two well-known association rules induction algorithms, Apriori and AprioriTid, were presented in [3]. The computational complexity of these algorithms is potentially exponential in the number of attributes, and their space requirements may also grow expo9

travel gender id 1 2

personal gender # M 2 F 2

id 1 2 3 4 5

degree B.A. M.A. Ph.D. B.A. B.A.

native Yes No Yes Yes No

# 1 2 1 2 4

gender id 1 1 2 2 2

coutry IL IL GB IL USSR IL USSR IL USA GB

degree 1 2 2 3 4 4 5 5 5 5

Table 4: Example Database nentially in the worst case. Any run-time improvement scheme would thus be a boon. The application databases consist of multiple relations. Thus, mining a multi-relational database, forces us to perform a universal join on its relations, without considering the potential advantage that may be gained by the original structure of the database. Algorithms that pre-process the implicit joint relation, while taking advantage of the structure, are discussed below. We assume that the join to be performed is on a sequence of relations, such that between each two successive relations ri, ri+1, a one-to-many connection exists between ri and ri+1. In this case, each transaction in the joint relation consists of successive groups of attributes, each group generated from the appropriate relation in the sequence. For example, consider the following relation de nitions (with the primary key in italic font), and data records as shown in Table 4: 1. gender(Gender, Gender ID) 2. personal (Gender ID, Degree, Native, Personal ID) 3. travel (Personal ID; Country) The joint relation of interest, R is generated by projection of the join on the attributes: (Gender, Native, Degree, Country). We assume, that the transactions in R are sorted lexicographicly (based on < relation; attribute; value > tuples). The rst transaction in R is: (M, Yes, B.A., IL). M comes from the gender relation, Yes and B.A. from the personal relation, and IL from travel. The transactions of R are shown as sequences in Table 5 (ignore numbers and parenthesis for now). Both Apriori and AprioriTid divide the problem into two subproblems: generating frequent itemsets, and rule construction (see Section 2). In Apriori, the rst step requires multiple passes over the database, to calculate the support counters for the potentialy frequent itemsets. 10

Transaction:[data] (number of repetitions) [M] (3) [Yes,B.A.] (1) [IL] [M ](3) [No,M.A.] (2) [IL] [M ](3)[No; M:A:](2) [GB] [F] (7) [Yes,Ph.D.] (1) [IL] [F ](7) [Yes,B.A.] (2) [USSR] [F ](7)[Y es; B:A:](2) [IL] [F ](7) [No,B.A.] (4) [USSR] [F ](7)[No; B:A:](4) [IL] [F ](7)[No; B:A:](4) [USA] [F ](7)[No; B:A:](4) [GB] Table 5: Joint Relation Apriori Iteration 1 2 3 Total

Our modi cation 22 40 8 70

Simple Apriori 40 58 10 108

Table 6: Performance Comparison We can associate any pre x Hi(T ) with i groups of any transaction T , with a maximal group of successive transactions G(Hi(T )), such that Hi(T ) is also a pre x of every transaction in G(Hi(T )). During the pre-processing stage, the counts jG(Hi(T ))j for all Hi(T ) are evaluated and stored. Observe that if for some transactions, U; V , we have Hi(U ) = Hi(V ), the counts are only computed and stored once - at the rst time the pre x is encountered. The key issue is that, when itemset support is counted in the Apriori or AprioriTid algorithms for some itemset I , it is sucient to consider just the rst transaction T that contains I . Now, the support of I is exactly jG(Hi(T ))j, where i = argmini(I  Hi(T )). For example, Table 5 contains the counts jG(Hi(T ))j for i = 1; 2. For i = 3 we always have jG(Hi(T ))j = 1, and this number is omitted. If pre xes of successive transactions tend to match, this scheme saves much recomputation and search. For the example database, the number of count modi cations made in each algorithm iteration is depicted in table 6, for our modi ed scheme vs. simple Apriori. A comparative analysis of the results is complicated, and outside the scope of this paper.

11

User Interface

Group 1

Group 2

PVM server

PVM environment on hosts:

Master process

PVM daemon on host

Slave process on host

Figure 3: Process Hierarchy

6 Distributed Induction of Association Rules Ecient parallelization of association rules induction is of interest, and has been accomplished for supercomputer systems (see Section 2.1.1). However, much less is known about achieving e ective parallelism for this task in distributed environments, which are far more accessible to most users. During the work on FlexiMine we were interested in modifying and investigating the behavior of these algorithms in a multicomputer distributed network environment. In such a network system, no assumptions may be made about:  computer architecture, computational speed and data format homogeneity  number of CPUs in each computer (both monoprocessors and multiprocessors may be connected)  current availability and reachability of the hosts (hosts and network fault tolerance)  location of the hosts (remote hosts may be added into the system)  process-to-processor mapping (multiple tasks may execute on a single processor)  machines CPU or memory load  network load On the other hand, we may use the fact that all computers share a common NFS such that the whole database is available for all hosts. Following these assumptions and restrictions, we have decided to implement the Count Distribution and Data Distribution algorithms on the PVM communication platform [8]. For both algorithms, the process 12

hierarchy is presented in Figure 3. Each task of association rule discovery is handled by a special process (master) that creates a number (n) of identical processes (slaves), that together actually perform the task. The number of slaves may vary from case to case, and may actually depend on user decisions or on some learned information (previous invocations, database size etc.) - see Figure 4. Slaves are identi ed by the unique PVM task identifier, or by a name uniting all the slaves in one virtual group and the instance number of the task in the same group. For this purpose, we use the group management of PVM, handled by a special group server, that is started when the rst group function is invoked. Message broadcasting between the slaves is local for the group, enabling a number of separate association rule discoveries to execute simultaneously (see 3). In both algorithms, in each pass, synchronization between the processes is required, making the process/processor independence very important. In di erent situations we should prefer to bind two slave processes to one powerful processor and to leave some other, relatively weak or currently busy, processor free of our task processes. The machines load consideration in PVM partially balances relative progress of the slaves. A resourse manager is a PVM task responsible for making tasks scheduling (placement) decisions. These decisions could be based on variable parameters (host load averages, memory size, CPU speed etc) [8]. Intermediate

Data file

Association Rules (AR) module

module

BKB

Rules file

BKB file

constructor

Rules BKB Reasoning Serial Association Rule

Parallel Association Rule

Generator

Generator

Engine

BKB Visualization

Figure 4: AR and BKB modules structure We have performed several experiments on real data generated from the student database. Experiments were on our department LAN, where neither computers nor LAN communications were dedicated to our experiments, and thus the computers and network load changed dynamically during the runs. As expected, speedup was rather far from ideal, not just due to the distributed setup, but because the \ideal" was based on an 13

assumption that all processors were run on the fastest machine, not possible on the actual system where some of the machines were slower. Additionally, process/processor was not necessarily a one-to-one mapping - several processes may execute on one processor. Likewise, much time and space overhead was caused by the communication primitives of PVM, that use an extremely general scheme (see section 2.1.2). While speedup is not as close to the ideal as for the parallel version, as expected, it is still useful - a factor of about 5 for 10 PVM processes. A comparison between the distributed versions of the algorithm, surprisingly, suggests that data-distribution did not save as much main-memory space (compared to count distribution) as in the parallel version.

7 BKBs in FlexiMine Although BNs are widely used [10], they seem to have several shortcomings.  BNs fail to capture certain conditional independencies between particular instantiations [22, 23] and thus demand a complete speci cation of the conditional probability of every instantiation of a variable, given any instantiation of its conditioning variables.  Cycles in the graph structure are not allowed [19], in order to ensure consistency.  BNs cannot represent partially known distributions. The rst weakness was addressed by various extensions to the BN model, including Similarity Networks [9], Independence-Based Relevance [22, 23], Weighted Proof Graphs (WAODAGs) [6], and Context-Speci c Independence [4]. The latter problems are still mostly open in a probabilistic framework, but cannot be ignored. In the real world, having complete information is typically unattainable, thus a representation must have the ability to handle and recognize incompleteness as it occurs. As to prohibiting cycles in the model, especially when we learn it from a data and not by interviewing an expert, mutual causality must be handled and represented by the model. The BKB model [18] overcomes these weaknesses. An additional advantage of BKBs is their simple, intuitive, knowledge representation that closely corresponds to the association rules. For example, an association rule A = 1 ^ B = 2 ^ C = 1 =) D = 0 may be represented in the constructed BKB by an S-node, with three I-nodes, corresponding to the instantiations A = 1, B = 2, and C = 1 as antecedents, and the I-node, corresponding to D = 0 as the consequent (Fig. 1). The probability labeled by this S-node is exactly the con dence of the rule. We use this close relationship when constructing the BKB from the association rules. Owing to Apriori algorithm decomposition, we combined the BKB on-line construction with the rule discovery part of the algorithm (Fig. 4). First, the initial BKB is empty. After the large itemsets generation, each time a rule is discovered and approved, we insert the corresponding new S-node (and possibly new I-nodes) and all relevant edges into the BKB. Because of such \parallel" processing, no additional traversal of large itemsets is required. Furthermore, BKB construction may run as a separate parallel process, that 14

receives the generated rules by a message passing mechanism, which prevents Apriori module slow-down (in a distributed environment with sucient free processing power). This approach can be the basic framework for any on-line usage of the discovered rules. When construction is complete, the BKB is saved in the format used by MACK [17]. Two further steps must be performed on the resulting model, in order to make it a legal BKB: enforce topological consistency, and normalization. These issues are beyond the scope of this paper. Suce it to say, at this point, that one of the topological requirements, that an I-node not support an I-node that contradicts it, is automatically taken care of by the association-rule generation step. Note also that attempting to generate a Bayes network in this manner will not result in anything resembling a legal BN, due to rule cycles, which cause cycles in the correlation graph.

7.1 Reasoning on BKBs

The BKB constructor is based on the BKB kernel of a new knowledge acquisition tool called MACK [17]. MACK is a manual tool for BKB construction, that automatically guarantees consistency of the knowledge-base, and assists the expert in avoiding or correcting inconsistencies as knowledge is acquired. In our case of unsupervised BKB construction, system behavior must be di erent, necessitating a number of changes in BKB construction. Likewise, the original BKB kernel prohibited cycles in the graph structure, mandating a modi cation of the consistency testing module. Clearly, BKBs are useless without an e ective reasoning engine, such as one for nding most probable inferences (abductive explanations) for observed evidence, or for making predictions over the model. The problem is analogous to (and more general than) the NPhard problem of belief revision in BNs, or nding minimum-cost proof on a WAODAG. To-date, nding most-probable inference in general BKBs has been implemented as best rst heuristic search, where the heuristic used was cost-so-far, with dismal results (PESKI in [17]). We adapted and implemented a successful WAODAG search cost-sharing heuristic [5], to apply to BKBs. Empirical results show that the heuristic saves considerable search e ort in several BKBs [21]. The implemented inference engine inputs a BKB le and the observed evidence, and searches for the most probable inferences, working as a separate, non-blocking process.

8 Discussion In order to be useful for reasoning and decision making, we prefer probabilistic models to sets of association rules. Our probabilistic model of choice is Bayesian Knowledge Bases (BKBs). Learning BKBs from data by using an intermediate association-rule generation step seems a simple scheme, and indeed is easy to implement. However, it does su er from several de ciencies. First, the semantics of this learning process is not well understood. Second, association-rule generation is not all that fast a process, even with the best existing algorithms. Since BKBs are a simple generalization of BNs, for which direct learning schemes exist, it seems reasonable to consider learning BKBs directly from the data [16]. In par15

ticular, we are examining this approach where the BKB evaluation criterion is minimum description length (MDL). The problem with these schemes is that although the learning process is better understood, the resulting search problem for best model is hard, and solution is typically a greedy, sub-optimal, search algorithm. We are considering using the initial stages of association-rule generation as a focusing mechanism for searching for some of the BKB structure. Regardless of the model used, correct attribute domains granularity is required. The importance of the abstraction mechanism to the data mining process is obvious. It is a powerful tool that allows the user to e ectively use intuition about the meaning of the raw data. Using the results of classi cation (and clustering) of the data mining algorithms in further sessions of the data mining process, is also facilitated by abstraction. Our experiments with the standard association rule algorithms show that without the abstraction mechanism it is hard to get any rules on multiple-valued domains. Further research is needed to show the bene ts of using the hierarchy of abstraction, in particular of mining rules on abstract domains that get their values from an autonomic clustering process.

Acknowledgments This research is supported in part by an infrastructure grant for data-mining from the Israeli Ministry of Science, and by the Paul Ivanier Center for Robotics and Production Management.

References [1] R. Agrawal, T. Imielinski, and Arun Swami. Mining association rules between sets of items in large databases. In ACM-SIGMOD, pages 207{216, Washington, D.C., 1993. [2] R. Agrawal and J. C. Shafer. Parallel mining of association rules: Design, implementation and experience. IBM Research Report, RJ 10004, 1996. [3] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In VLDB94, Santiago, Chile, 1994. [4] Craig Boutilier, Nir Friedman, Moises Goldszmidt, and Daphne Koller. Contextspeci c independence in bayesian networks. In Uncertainty in Arti cial Intelligence, Proceedings of the 12th Conference, pages 115{123. Morgan Kaufmann, August 1996. [5] Eugene Charniak and Saadia Husain. A new admissible heuristic for minimal-cost proofs. In Proceedings of AAAI Conference, pages 446{451, 1991. [6] Eugene Charniak and Solomon E. Shimony. Cost-based abduction and map explanation. Arti cial Intelligence Journal, 66(2):345{374, 1994. 16

[7] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI Press/MIT Press, 1996. [8] Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Robert Manchek, and Vaidy Sunderam. PVM:Parallel Virtual Machine. A User's Guide and Tutorial for Networked Parallel Computing. The MIT Press, Cambridge, Massachusetts, 1994. [9] David Heckerman. Probabilistic Similarity Networks. MIT Press, 1991. [10] David Heckerman. A Tutorial on learning with Bayesian Networks. IBM Technical Report, MSR-TR-95-06, 1995. [11] H. Mannila, H. Toivonen, and A. Inkeri Vercamo. Ecient algorithms for discovering association rules. In AAAI-94 Workshop on Knowledge Discovery in Databases, 1994. [12] MPI: A Message-Passing Interface Standard. Message Passing Interface Forum, May 1994. [13] Jong Soo Pack, Ming-Syan Chen, and Philip S. Yu. An e ective hash-based algorithm for mining association rules. In ACM-SIGMOD, San Jose, California, 1995. [14] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, CA, 1988. [15] Elaine Rich and Kevin Knight. Arti cial Intelligence. McGraw Hill, Inc., 1991. [16] T. Rosen. Discovery of Rule-Based Probabilistic Knowledge in Databases. Ph.D Thesis Proposal, 1997. [17] E. Jr. Santos and D. O. Banks. Acquiring consistent knowledge. Technical report AFIT/EN/TR96-01, Depart. of Electrical Computer Engineering, Air Force Institute of Techology, 1996. [18] E. Jr. Santos and E. Santos. Bayesian knowledge-bases. Technical report AFIT/EN/TR96-05, Depart. of Electrical Computer Engineering, Air Force Institute of Techology, 1996. [19] Eugene Jr. Santos, Eugene S. Santos, and Solomon E. Shimony. Generalized dseparation for probabilities reasoning: An inference-based model. (in preparation), 1998. [20] Eugene Santos Jr. Cost-based abduction and linear constraint satisfaction. Technical Report CS-91-13, Computer Science Department, Brown University, 1991. [21] S. Shimony, C. Domshlak, and E. Jr. Santos. Cost-sharing in bayesian knowledge bases. In UAI-97, pages 421{428, 1997.

17

[22] Solomon E. Shimony. The role of relevance in explanation I: Irrelevance as statistical independence. International Journal of Approximate Reasoning, 8(4):281{324, June 1993. [23] Solomon E. Shimony. The role of relevance in explanation II: Disjunctive assignments and approximate independence. International Journal of Approximate Reasoning, 13(1):27{60, July 1995. [24] A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Transac. on Knowledge and Data Eng., 8(6):970{974, 1996. [25] Scalable POWERparallel Systems. Int'l Business Machines. GA23-2475-02 edition, February 1995.

18