ARMGA: IDENTIFYING INTERESTING ... - Semantic Scholar

3 downloads 30942 Views 179KB Size Report
transactions that contain laptop also contain printer, and there are in total .... probability that a customer who buys tea also buys coffee, i.e., con f ğt ) cŞ ¼ suppğt .... encoding. Although it is known that genetic algorithm is good at searching for.
Applied Artificial Intelligence, 19:677–689 Copyright # 2005 Taylor & Francis Inc. ISSN: 0883-9514 print/1087-6545 online DOI: 10.1080/08839510590967316

ARMGA: IDENTIFYING INTERESTING ASSOCIATION RULES WITH GENETIC ALGORITHMS

Xiaowei Yan, Chengqi Zhang, and Shichao Zhang & Faculty of Information Technology, University of Technology, Sydney, Australia

& Priori-like algorithms for association rules mining have relied on two user-specified thresholds: minimum support and minimum confidence. There are two significant challenges to applying these algorithms to real-world applications: database-dependent minimum-support and exponential search space. Database-dependent minimum-support means that users must specify suitable thresholds for their mining tasks though they may have no knowledge concerning their databases. To circumvent these problems, in this paper, we design an evolutionary mining strategy, namely the ARMGA model, based on a genetic algorithm. Like general genetic algorithms, our ARMGA model is effective for global searching, especially when the search space is so large that it is hardly possible to use deterministic searching method.

INTRODUCTION Association rule mining has been investigated by many researchers and practitioners for many years (Aggarawal and Yu 1998; Agrawal et al. 1993; Silverstein et al. 1998; Piatetsky-Shapiro 1991; Wu et al. 2002). One approach widely used is the support-confidence framework (Agrawal et al. 1993), where an association rule is an implication between two sets of items, and the interestingness of a rule is measured by two factors: its support and confidence. For example, rule, flaptopg ! fprinter g, with its confidence being 0.50 and its support equal to 0.01, implies that 50% of transactions that contain laptop also contain printer, and there are in total 1% of transactions that contain both laptop and printer. A rule is excavated out if and only if both of its confidence and support are greater than two thresholds, the minimum confidence, and the minimum support, respectively. Therefore, users need to appropriately specify these two thresholds before they start their mining job. This research is partially supported by a large grant from the Australian Research Council (DP0343109) and partially supported by large grant from the Guangxi Natural Science Funds. Address correspondence to Xiaowei Yan, Faculty of Information Technology, University of Technology, Sydney, P.O. Box 123, Broadway NSW 2007, Australia. E-mail: [email protected]

678

X. Yan et al.

There are other interesting measures for association patterns. Piatetsky-Shapiro (1991) defined the interest rule as RI ¼ P ðA; BÞ P ðAÞP ðBÞ, for a given rule A ! B. Silverstein et al. (1998) used the chi-square (v2 ) test to find a correlated association patterns. Aggarawal and Yu (1998) gave a concept called collective strength for the interestingness of an itemset. Wu et al. (2002) proposed another alternative model using the probability ratio. Although these measures are effective on mining interesting association rules, it is difficult for users to apply them if the users know little about the database to be mined. For example, if an inappropriate threshold with strict constraint is given by a user, he might get nothing. On the other hand, a loose constraint threshold may lead to many generations of uninteresting association rules and poor mining performance. These models are consequently referred to as being database-dependent. One solution to obtain appropriate thresholds is to try multiple database reanalysis and discussions between the miner and the users. However, the need of human interferences must give rise to the loss of system automation. On the other hand, many databases are simply large. Association rule mining must confront exponential search spaces. Both these fundamental problems stand in the way of the widespread application of earlier data mining techniques. In this paper, we propose a new approach for mining acceptable association rules using genetic algorithm. It is genetic algorithm that is an efficient tool to perform global searching work, especially when the searching space is too large to use the deterministic searching method. It imitates the mechanics of natural species evolution with genetics principles, such as natural selection, crossover, and mutation. In particular, our approach does not require users to specify thresholds. Instead of generating a unknown number of interesting rules as done in traditional mining models, only the most interesting rules are given according to the interestingness measure as defined by the fitness function. Therefore, we refer to this method being database-independent, compared with the models mentioned previously.

RELATED WORK This section recalls some concepts required for association rule mining, briefly states the challenges, and outlines the research into genetic algorithm-based mining techniques. I ¼ fi1 ; i2 ; . . . ; im g is a set of literals, or items. For example, goods such as milk, sugar, and bread for purchase in a store are items; and Ai ¼ r is an item, where r is a domain value of attribute, Ai , in a relation, RðA1 ; . . . ; An Þ.

ARMGA

679

X is an itemset if it is a subset of I. For example, a set of items for purchase from a store is an itemset; and a set of Ai ¼ t is an itemset for the relation RðPID; A1 ; A2 ; . . . ; An Þ, where PID is a key. D ¼ fti ; tiþ1 ; . . . ; tn g is a set of transactions, called the transaction database, where each transaction t has a tid and a t-itemset: t ¼ ðtid; titemsetÞ. For example, the shopping cart of a customer going through checkout is a transaction; and a tuple ðt1 ; . . . ; tn Þ of the relation RðA1 ; . . . ; An Þ is a transaction. A transaction t contains an itemset X if and only if (iff), for all items i 2 X , i is in t-itemset. For example, a shopping cart contains all items in X when going through checkout; for each Ai ¼ ti in X, ti occurs at position i in the tuple, ðt1 ; . . . ; tn Þ. There is a natural lattice structure on the itemsets, 2I , namely, the subset=superset structure. Certain nodes in this lattice are natural grouping categories of interest (some with names). For example, items from a particular department, such as clothing, hardware, furniture, etc., and within clothing, children’s, women’s and men’s, and so on. An itemset X in a transaction database D has a support, denoted as suppðX Þ (we also use pðX Þ to stand for suppðX Þ), that is the ratio of transactions in D contain X. Or suppðX Þ ¼ jX ðtÞj=jDj; where X ðtÞ ¼ ft in Djt contains X g. An itemset X in a transaction database D is called a large (frequent) itemset if its support is equal to, or greater than, a threshold of minimal support (minsupp), which is given by users or experts. An association rule is an implication X ! Y , where itemsets X and Y do not intersect. Each association rule has two quality measurements: support and confidence, defined as: . The support of a rule X ! Y is the support of X [ Y , where X [ Y means both X and Y occur at the same time. . The confidence of a rule X ! Y is conf ðX ! Y Þ as the ratio: jðX [ Y ÞðtÞj= jX ðtÞj or suppðX [ Y Þ=suppðX Þ. That is, support ¼ frequencies of occurring patterns; confidence ¼ strength of implication. Support-confidence framework (Agrawal et al. 1993): Let I be the set of items in database D, X, Y  I be itemsets, X \ Y ¼ ;, pðX Þ 6¼ 0 and pðY Þ 6¼ 0. Minimal support (minsupp) and minimal confidence (mincon f )

680

X. Yan et al.

are given by users or experts. Then X ! Y is a valid rule if (1) suppðX [ Y Þ  minsupp, (2) con f ðX ! Y Þ ¼

suppðX [Y Þ suppðX Þ

 mincon f ,

where con f ðX ! Y Þ stands for the confidence of the rule X ! Y . Mining association rules can be broken down into the following two subproblems: (1) Generating all itemsets that have support greater than, or equal to, the user specified minimum support. That is, generating all large itemsets. (2) Generating all the rules that have minimum confidence in the following naive way: For every large itemset X , and any B  X, let A ¼ X  B. If the confidence of a rule A ! B is greater than, or equal to, the minimum confidence (or suppðX Þ=suppðAÞ  mincon f ), then it can be extracted as a valid rule. To demonstrate the use of the support-confidence framework, we illustrate the process of mining association rules by an example as follows. Example 1 (adapted from Brin et al. [1997]). Suppose we have a market basket database from a grocery store, consisting of n baskets. Let us focus on the purchase of tea (denoted by t) and coffee (denoted by c). When suppðtÞ ¼ 0:25 and suppðt [ cÞ ¼ 0:2, we can apply the supportconfidence framework for a potential association rule t ) c. The support for this rule is 0.2, which is fairly high. The confidence is the conditional probability that a customer who buys tea also buys coffee, i.e., con f ðt ) cÞ ¼ suppðt [ cÞ=suppðtÞ ¼ 0:2=0:25 ¼ 0:8, which is very high. In this case, we would conclude that the rule t ! c is a valid one. Problem Statement As we have seen, in the support-confidence framework (Agrawal et al. 1993), the first step in association rule discovery is usually the identification of frequent itemsets, sets of items J such that suppðJ Þ  minsupp, where the support of a itemset J is defined by the ratio of transactions in a database containing the itemset J. One of the main challenges in association rule mining is to identify frequent itemsets in very large transaction databases that comprise millions of transactions and items. Consequently, some recent research has focused on designing efficient algorithms for this purpose (Toivonen 1996; Webb 2000). The main limitation of these approaches, however, is that multiple

ARMGA

681

passes over the database are required. For very large databases that are typically disk resident, this requires reading the database completely for each pass, resulting in a large number of disk I=Os. The larger the size of a given database is, the greater is the number of disk I=Os. Therefore, more efficient mining models are being exploited. Accordingly, many variants of the A priori algorithm (such as the hash-based algorithm (Park et al. 1997) sampling (Toivonen 1996), the OPUS AR algorithm (Webb 2000), and the instance selection-based algorithms (Zhang et al. 2003) have been reported. Another main challenge is that the performance of these frequent itemset mining algorithms is heavily dependent on the user-specified threshold of minimum-support. Very often a minimum-support value is too big and nothing is found in a database, whereas a slightly small minimum-support leads to low-performance. This generates a crucial requirement: Users have to give a suitable minimum-support for a mining task. However, in real-world applications, mining different databases requires different values of minsupp. This can be illustrated by the following cases. Case I. The database Wisconsin Breast Cancer (D1 ) from UCI contains 699 records. Only attributes from column 2 to column 11 are considered. Then the maximal support of 2-itemsets is 0.6366, and the support of itemsets in D1 distributes in the interval ½0; 0:8283. Case II. The Tumor Recurrence Dataset from the JASA Data Archive contains 87 records (see: http://lib.stat.cmu.edu/jasadata/). The maximal support of 2-itemsets is 0.3218, and the support of itemsets in D2 is distributed in the interval [0, 0.5862]. For Case I, no interesting itemsets are found when minsupp ¼ 0:9, and 4092 interesting itemsets are obtained when minsupp ¼ 0:03. For Case II, no interesting itemsets are searched for when minsupp ¼ 0:7, and 104 interesting itemsets are generated when minsupp ¼ 0:003. Therefore, users find it difficult to assign appropriate minsupp values for their databases. This motivates us to design mining algorithms with database-independent minimum support. Research into Genetic Algorithm-Based Learning There have been many applications of genetic algorithms in the field of data mining and knowledge discovery. Most of them are addressed to the problem of classification. Usually, genetic algorithms for rules mining are partitioned into two categories, according to their encoding of rules in the population of chromosomes (Freitas 1999). One encoding method is called the Michigan Approach, where each rule is encoded into an individual. Another is

682

X. Yan et al.

referred to as the Pittsburgh Approach, with which a set of rules are encoded into a chromosome. For example, Fidelis et al. (2000) gave a Michigan-type of genetic algorithm to discover comprehensible classification rules, having an interesting chromosome encoding and introducing a specific mutation operator. But the method is impractical when the number of attribute is large. Weiss and Hirsh (1998) also followed the Michigan method to predict rare events. Pei et al. (1997), on the other hand, used the Pittsburgh Approach for the discovery of classes and feature patterns. Other applications can be demonstrated by GA-Nuggets, a system to infer the values of goal attributes given the values of predicting attributes (Freitas 1999) and SIAO1, which finds the first-order logic classification rules by generalizing a seed example (Augier et al. 1995) Moreover, a recent work that is worth mentioning is the dAR, designed by Au and Chan (2002) for mining association rules or more exactly for discovering changing patterns in historical data. In dAR, the entire set of rules is encoded in a single chromosome and each rule is represented by some nonbinary symbolic values. It uses a complicated fitness function and a Pittsburgh encoding. Although it is known that genetic algorithm is good at searching for undetermined solutions, it is still rare to see that genetic algorithm is used to mine association rules. We are going to further investigate the possibility of applying genetic algorithm to the association rules mining in the following sections. ALGORITHM This section describes our model, called ARMGA, for association rules mining with genetic algorithm. Modeling For association rule X ! Y , X is its antecedent and Y the consequent. Rule X ! Y is a k-rule if X [ Y is a k-itemset. The support of the rule is defined as suppðX [ Y Þ, and the confidence as suppðX [ Y Þ=suppðX Þ. The traditional task of mining association rules is how to find all rules X ! Y , such that the supports and confidences of the rules are larger than, or equal to, a minimum support, minsupp, and a minimum confidence, mincon f, respectively. Both of these thresholds are user-specified. In our ARMAG model, we require that the confidence con f ðX ! Y Þ should be larger than, or equal to, suppðY Þ, because we only deal with positive association rules of the form X ! Y .

ARMGA

683

Recall the definition of the rule interest, RI ¼ P ðA; BÞ  P ðAÞP ðBÞ, for A ! B (Piatetsky-Shapiro 1991). Hence, we define the positive confidence of the rule as pcon f ðX ! Y Þ ¼

suppðX [ Y Þ  suppðX ÞsuppðY Þ suppðX Þð1  suppðY ÞÞ

when 1  suppðY Þ ¼ 0 or suppðX Þ ¼ 0, this definition is meaningless. It can be easily proved that suppðX [ Y Þ ¼ suppðX Þ for any itemset X if 1 suppðY Þ ¼ 0. We define that pcon f ðX ! Y Þ ¼ 1 when 1  suppðY Þ ¼ 0. If suppðX Þ ¼ 0, then suppðX [ Y Þ ¼ 0 for any itemset Y. We can define that pcon f ðX ! Y Þ ¼ 0 when suppðX Þ ¼ 0. In both cases, rule X ! Y is called a trivial rule. We now restate the mining task as follows. Given a rule length k, we search for some high-quality association k-rules, with their pcon f s acceptably maximized, by using a genetic algorithm. Encoding Our ARMGA model follows the Michigan strategy, encoding each association rule in a single chromosome. First, we number and quote all items in I ¼ i1 ; i2 ; . . . ; in by their indexes. In other words, we can assume that the universal itemset I ¼ 1; 2; . . . ; n. Given an association k-rule X ! Y , where X ; Y  I , and X \ Y ¼ ;, we encode it to an individual shown by Figure 1. In Figure 1, j is an indicator that separates the antecedent from the consequent of the rule. That is, X ¼ A1 ; . . . ; Aj and Y ¼ Ajþ1 ; . . . ; Ak ; 0 < j < k. Therefore, a k-rule X ! Y is represented by k þ 1 positive integers. Operators We design as follows three genetic operators: select, crossover, and mutation. Function selectðc; psÞ acts as a filter of the chromosome with considerations of their fitness and probability ps. It returns TRUE if chromosome c is successfully selected with probability ps, and otherwise FALSE if failed.

FIGURE 1 A chromosome for a k-rule.

684

X. Yan et al.

Boolean selectðc; psÞ begin if ðfrandðÞ  fitðcÞ < psÞ then return TRUE; else return FALSE; end In this function, frandðÞ returns a random real number ranged from 0 to 1, fitðÞ is the fitness function of the ARMGA model, and ps is a prespecified input parameter in a range of 0 and 1. Function crossover ðpop; pcÞ uses a two-point strategy to reproduce offspring chromosomes at a probability of pc from population pop, and returns a new population. These two crossover points are randomly generated, such that any segment of chromosome may be chosen, as illustrated in Figure 2. population crossover ðpop; pcÞ begin ;; pop temp for 8c1 ¼ ðA10 ; . . . ; A1k Þ 2 pop do begin for 8c2 ¼ ðA20 ; . . . ; A2k Þ 2 pop ^ c2 6¼ c1 do begin if ðfrandðÞ < pcÞ then begin i irandðk þ 1Þ; j irandðk þ 1Þ; ði; jÞ ðminði; jÞ; maxði; jÞÞ;

FIGURE 2 An example of two-point crossover.

ARMGA

685

c3 ðA10 ; . . . ; A1;i1 ; A2i ; . . . ; A2;j ; A1;jþ1 ; . . . ; A1k Þ; ðA20 ; . . . ; A2;i1 ; A1i ; . . . ; A1;j ; A2;jþ1 ; . . . ; A2k Þ; c4 pop temp [ c3 ; c4 ; pop temp end end end return pop temp; end Here, Function irandðkÞ returns a random integer ranged from 0 to k. Function mutateðc; pmÞ occasionally changes genes of chromosome c at a probability of pm, while also considering the fitness of c as an additional weight. chromosome mutateðc; pmÞ begin if ðfrandðÞ  fitðcÞ < pmÞ then begin irandðk  2Þ þ 1; c:A0 i irandðk  1Þ þ 1; irandðn  1Þ; c:Ai end return c; end Fitness Function As mentioned previously, our goal is to search the most interesting association rules. Hence, the fitness function should be crucial for determining the interestingness of a chromosome, and it does affect the convergence of the genetic algorithm. In model ARMGA, we define the fitness function as fitnessðcÞ ¼

suppðA1 . . . Ak Þ  suppðA1 . . . Aj ÞsuppðAjþ1 . . . Ak Þ suppðA1 . . . Aj Þð1  suppðAjþ1 . . . Ak ÞÞ

for a given chromosome c, as shown in Figure 1. It is, in fact, the positive confidence of the corresponding association rule. Initialization Given a seed chromosome s, we use the mutateðs; pmÞ function to produce an initial population pop½0, where we have pm ¼ 1. This initialization

686

X. Yan et al.

is shown in the following function. population initializeðsÞ begin pop½0 s; while sizeof ðpop½0Þ < popsize=2 do begin ;; pop temp for 8c 2 pop½0 do begin pop temp [ mutateðc; 1Þ; pop temp end pop½0 pop½0 [ pop temp; end return pop½0; end This function accepts a seed chromosome as its parameter and returns a population as the initial set of chromosomes. Function sizeof ðpop½0Þ returns the number of chromosomes in pop½0. popsize is a constant given by the user to express the maximum number of chromosomes in a population. ARMGA Algorithm Suppose that the current population is pop[i]. We first apply the select operator to pop½i, and produce a new population pop½i þ 1. Then any pair of chromosomes in pop½i þ 1 is crossed over at a probability of pc to reproduce two offspring. Each new chromosome mutates at a probability of pm. ARMGA generates a population at last, with high-quality chromosomes. population ARMGAðs; ps; pc; pmÞ begin i 0; pop½i initializeðsÞ; while not terminateðpop½iÞ do begin pop½i þ 1 ;; ;; pop temp for 8c 2 pop½i do if selectðc; psÞ then pop½i þ 1 pop½i þ 1 [ c; crossover ðpop½i þ 1; pcÞ; pop temp

ARMGA

687

for 8c 2 pop temp do pop½i þ 1 ðpop½i þ 1  cÞ [ mutateðc; pmÞ; i i þ 1; end return pop½i; end ARMGA stops, that is, the terminateðÞ function returns a nonzero value, if, and only if, one of the following cases occurs: 1. The difference between the best and the worst chromosome is less than a given value, a, which is small enough. 2. The number of iterations, i, is larger than a given maximum number maxloop.

COMPUTATION We use the Mushroom database from UCI (http://www.ics.uci.edu/ ~mlearn) to show the effectiveness of algorithm ARMGA. The dataset contains 8124 records and 23 attributes. Figure 3 gives the parameters when we run the computation. We run the program when maxloop is given from 10 to 100 stepped with 10, from 100 to 1000 stepped with 100, and from 1000 to 9000 stepped with 1000, respectively. For example, some results when maxloop is 10 are listed as follows. 34 34 34 34

85 < <