Maximum Independence and Mutual Information - Semantic Scholar

1 downloads 0 Views 228KB Size Report
rules, entropy, maximum independence, mutual informa- tion. I. Introduction ... ride the conceptual and practical limitations of the known algorithms 11], 19]{ 27].
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1,JANUARY 2002

1

Maximum Independence and Mutual Information Rosa Meo

values, will be proved. Second, a new denition of mutual information, di erent from the well known one, will be proposed and discussed. Third, the new denition and its properties will be applied to the solution of one of the most important problems of data mining, the identication of the most relevant data dependencies in a database. The new concepts and the algorithms that employ them to identify the data dependencies make it possible to override the conceptual and practical limitations of the known algorithms 11], 19]{27]. This paper is organised as follows. Section II summarizes the contributions of the previous work of the author. Section III presents the general properties of maximum independence estimates. Section IV contains the new denition of mutual information and the discussion of its meaning. Section V proposes a denition of data dependence, which is based on the new denition of mutual information, and I. Introduction applies it in a new algorithm for the solution of the central N a preceding work the author of this paper has in- problem of data mining. Finally, the experimental results troduced the concept of maximum independence of a relative to the new algorithm are briey presented and disk-plet and has applied it to an important problem of cussed. data mining. In particular, she has shown that, if I = II. Maximum independence estimate fI1 I2 : : : Ik g is a set of k boolean variables, all the k-thA. Notations order joint probabilities can be calculated as functions of Let I = fI1 I2 : : : Ik g be a set of k random boolean the joint probabilities of the orders less than k and a single k-th-order joint probability, and that the dependence variables. In data mining problems which will be discussed of the order k can be expressed in terms of a special single in Section V, the set I will be associated to a set of binary events fi1 i2 : : : ik g, with ij = ij or ij , in the value called maximum independence estimate 18]. This paper contains some new contributions deriving sense that when ij occurs, Ij = TRUE and when ik occurs, from the denition of maximum independence estimate. Ik = FALSE. In data mining problems, a basket of items is First, some properties and a basic theorem that can be any subset of the set fi1 i2 : : : ik g. For example, a basapplied to quickly calculate the maximum independence ket might be the set of the products purchased by a given customer of a grocery store or the set of the words of the Rosa Meo is with Universita di Torino, Dipartimento di Informatica, Corso Svizzera 185 - 10149 Torino, Italy. E- vocabulary contained in a given document. In formal terms, basket data can be viewed in terms of mail:[email protected]. Abstract |If I1

are random boolean variables and the joint probabilities up to the (k-1)-st order are known, the values of the k-th order probabilities maximizing the overall entropy have been dened as the maximum independence estimate. In the paper some contributions deriving from the denition of maximum independence probabilities are proposed. First, it is shown that the maximum independence values are reached when the product of the probabilities of the minterms i1 i2 ik containing an even number of complemented variables is equal to the products of the probabilities of the other minterms. Second, the new denition of group mutual information, as the dierence between the maximum independence entropy and the real entropy, is proposed and discussed. Finally, the new concept of mutual information is applied to the determination of dependencies in data mining problems. Keywords | Association rules, data mining, dependence rules, entropy, maximum independence, mutual information. I2

:::

I

:::

Ik

2

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1,JANUARY 2002

boolean indicator variables as follows. A set of baskets fb1, b2, : : :, bng is a collection of the n k-tuples from fTRUE, FALSEgk which represent a collection of value assignments to the k variables of I. Assigning the value TRUE to an attribute variable Ij in a basket represents the presence of item ij in the basket. The event a denotes A=TRUE, or equivalently, the presence of the corresponding item a in a basket. The complementary event a denotes A=FALSE, or, the absence of item a from a basket. The probability that item a appears in a random basket will be denoted by P(a)=P(A=TRUE). Likewise, P(a,b) = P(A=TRUE ^ B=FALSE) will be the probability that item a is present and item b is absent. B. Denition of maximum independence estimates

will be considered as the maximum independence estimate. For any fi1 i2 : : : ik g the maximum independence estimate will be indicated with the symbol P(i1, i2 , : : :, ik )MI . Such a denition assumes the uniqueness of the set of values P(i1 , i2 , : : :, ik ) for which the entropy H is maximum. This result will be proved in the next Subsection IIIB (Theorem on the uniqueness of the maximum independence value). Notice that if the joint probabilities up to the order k-1 and the maximum independence estimate of order k are known, a single number , dened as the di erence P(i1, i2, : : :,ik ) - P(i1, i2, : : :,ik )MI is sucient to describe all the k-th order joint probabilities.

Consider the case of a k-plet of boolean variables I1, I2, Indeed, P(i1, i2, : : :,ik )=  + P(i1, i2, : : :,ik )MI : : :, Ik , and assume we know all the joint probabilities up and, according to the uniqueness of the value mentioned at to the order (k-1): the beginning of this Subsection, all the k-th-order joint probabilities can be calculated as functions of the joint probabilities of the orders less than k and a single k-thP(i1), P(i2), : : : , P(ik;1 ), P(ik ) order joint probability. P(i1,i2 ), P(i1,i3), : : :, P(ik;1 ,ik ) ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: P(i1,i2 , : : :, ik;1 ), : : :, P(i2,i3, : : :, ik;1 , ik ). The denition of a maximum independence value is We want to determine the k-th-order joint probabilities rather important for our model. A known denition states like P(i1,i2, : : :, ik;1, ik ), P(i1,i2, : : :, ik;1, ik ), that I1, I2, : : :, Ik are independent if, and only if, P(I1 , and so on. It is easy to show that the knowledge of a single I2 , : : :, Ik )=P(I1)P(I2): : :P(Ik ) for all the combinak-th-order joint probability is sucient to determine all the tions of values of I1, I2, : : :, Ik . However, I1 , I2 , : : :, k-th-order probabilities. Ik might own only the dependence inherited by the (k-1)On the basis of such observation, we can state the fol- dependencies or their dependence might be stronger. In lowing denition which will be extensively used in the fol- the former case, P(i1, i2, : : :, ik ) is equal to P(i1, i2 , lowing. : : :, ik )MI and there is no real k-th-order dependence. In Denition 1 (Maximum independence estimate) If the the latter case, there is an evidence of a k-th-order dejoint probabilities up to order k-1 are known but no in- pendence whose value and sign depend on the di erences formation is available on the joint probabilities of order k, between P(i1, i2, : : :, ik ) and P(i1, i2, : : :, ik )MI .

then the estimate on P(i1 entropy of I1 I2 : : : Ik H=;

i2

:::

ik

) maximizing the joint

P P(i1, i2, : : :,ik)  log P(i1, i2, : : :, ik)

Notice that in the case of the pairs of variables fA,Bg the denition of maximum independence above proposed coincides with the more known denition of independence cited. Indeed, in this case, the joint entropy of A and B is

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1,JANUARY 2002

P

H = ; P(a b )  log P(a b ) = = ;X  log X ; P(a) ; X]  logP(a) ; X]+ ;P(b) ; X]  logP(b) ; X]+ ;1 ; P(a) ; P(b) + X]  log1 ; P(a) ; P(b) + X]

where X=P(a,b). It is easy to prove that function H has a maximum for X = P(a,b)MI = P(a)P(b). Notice also that the maximum value of the above introduced entropy function is HMAX =H(I1 , I2 , : : :, Ik;1)+ H(Ik jI1, I2 , : : :, Ik;1)MAX . I1 , I2 , : : :, Ik are independent when the ignorance on the value of a variable Ik , the other variables being known, is maximum. III. Basic Properties of Maximum Independence Estimates

A. Variability Field of a Vertex Probability

A vertex of the k-dimensional hypercube on which it is possible to represent the universe of the events hi1 i2 : : : ik i can be dened as even or odd according that it is labelled with an even or odd number of complemented literals. It is easy to prove that if X=P(i1, i2 , : : :, ik ), the probabilities associated to the even vertices e1, e2, : : :, ez (with z=2k=2) can be written as:

P (e1 ) P (e2 ) : : :: : :: : :: : : P (ez )

= = ::: =

X h2 + X : : :: : :: : :: : : hz + X

while the probabilities of the odd vertices o1 , o2 , : : :, oz are:

3

P (oz ) = kz ; X Constants hp 's and kq 's depend only on the values of the (k-1)-th order probabilities and must satisfy some nearly obvious requirements. First, k1 , k2 , : : :, kz must be larger than 0. Indeed, if for example k1 were less than 0, P(o1) would be less than 0 for any value of X=P(i1, i2 , : : :, ik ). Second, the maximum value of X is equal to the minimum value among k1 ,k2,: : :,kz . Indeed, if X were larger than kj , P(oj ) would be negative. Finally, X should be larger than the maximum among 0, -h2, : : :, -hz , because all the P(ej ) must be larger than 0. In data mining problems, generally P(i1, i2 , : : :, ik ) is always less than any other vertex probability. In this case, h2, h3, : : :, hz are all positive, and, therefore, the minimum value of X is 0. In the following, reference will be made to this simple case. This choice does not imply any loss of generality. Indeed, if this were not the case, another even node would take 0 as its minimum value and by introducing a new set of boolean variables fj1, j2, : : :, jk g, where some jk is the complement of a corresponding ik , we could transform the given problem into a new one where the minimumvalue of P(j1, j2, : : :, jk ) is zero. B. The Theorem of Products of Even and Odd Nodes Probabilities

Consider the relationship describing the entropy according to the denition of maximum independence: H(X) = ;X  log X ; (h2 + X)  log(h2 + X)+

P (o1 ) = k1 ; X P (o2 ) = k2 ; X : : :: : :: : :: : : : : : : : :: : :: : :: : :

: : : ; (hz + X)  log(hz + X)+ ;(k1 ; X)  log(k1 ; X) ; (k2 ; X)  log(k2 ; X)+ : : : ; (kz ; X)  log(kz ; X)

4

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1,JANUARY 2002

The derivative of H(X) is:

e(X = min + max;2 min ) and

D(H(X)) = ;1 ; log X ; 1 ; log(h2 + X)+ : : : ; 1 ; log(hz + X)+ +1 + log(k1 ; X) + 1 + log(k2 ; X)+ : : : + 1 + log(kz ; X) )  : : :  (kz ; X ) = log (k1 X; X(h)  (+k2X;) X  : : :  (h + X )

o (X = min + max ;2 min ). If e (X) > o (X), the search is continued at the medium point of the interval min min + max;2 min ] otherwise, attention is restricted to the other interval min + max;2 min max]. The dichotomic search is continued until the interval size is suciently small so that the desired accuracy is reached.

2

z

IV. A New Definition of Mutual Information

From this relationship, by remembering that (kj -X) and Importance and usefulness of the concept of mutual in(hj +X) are the probabilities associated to the j-th odd and formation relative to two variables A and B even nodes, respectively, and that these are always nonnegI (A B ) = H(A) ; H(A j B ) ative in the valid variability eld of X, it is easy to prove are well known and recognized. the following The extension of that denition to the case of more than Theorem 1: (Equivalence of the products of the two variables is interesting from a theoretical point of view even and odd nodes probabilities) The maximum independence value XMI of X = P(i1, i2 , as well. : : :,ik ) coincides with the value for which P(e1) P(e2) A number of important papers 12], 13], 14], 15], 16], : : :  P(ek) = P(o1) P(o2)  : : :  P(ok ), that is, the 17], have investigated the formal properties and the real product of the probabilities of even nodes coincides with signicance of those denitions, above all in connection with the set-theoretic structure of Shannon's information the product of probabilities of odd nodes. measures. However, the fact that the mutual information Notice that for X < XMI , the derivative of H(X) is always of three variables positive, whereas for X > XMI it is always negative. The I (A B  C ) = I (A B ) ; I (A B j C ) following theorem derives. can be negative has been sometimes considered as a limiTheorem 2: (Uniqueness of the maximum indepentation of that denitions (see, for example 12] and 18]). dence estimate) In this paper a di erent denition of mutual information The maximum independence value XMI of X = P(i1, i2 , is proposed which seems to be a more direct extension of : : :,ik ) is unique. the well known I (A B ) = H(A) ; H(A j B ). Besides, it C. A Simple Algorithm for the Determination of the Max- produces always positive values and can be adopted in the imum Independence Estimates solution of practical problems as is shown in the following The Theorem proved in previous Subsection can be ap- Section V. plied to quickly determine the maximum independence esDenition 2: (Group mutual information of k boolean timates by a dichotomic method. variables) Let min and max the minimum and maximum value of X, The group mutual information I of k boolean variables respectively, e (X) the product of the even vertex probabil- I1, I2, : : :, Ik is the di erence ities and o (X) the product of the odd vertex probabilities. I (I1 I2 : : : Ik ) = H(I1 I2 : : : Ik )MI ; H(I1 I2 : : : Ik ) Let us calculate

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1,JANUARY 2002

Remember that H(I1

I2

:::

Ik

)MI = H(I2 I3 : : : Ik ) + H(I1 j I2 I3 : : : Ik )MI = H(I1 : : : Ij;1 Ij+1 : : : Ik ) + H(Ij j I1 : : : Ij;1 Ij+1 : : : Ik )MI

for any j. Therefore, I (I1

I2

:::

Ik

) = H(Ij j I1 : : : H(Ij j I1 : : :

::: Ij+1 : : :

Ij;1 Ij+1 Ij;1

for any j. Notice also that in case of two variables A and B

)MI ; Ik ) Ik

5

Notice that the proposed denition as well as the other results described in this paper holds also in the case of non boolean variables. Such a generalization is based on the following extension of Theorem 2 on the uniqueness of the maximum independence estimate to the case of non binary variables. Its proof has been developed by one of the Reviewers of this paper. Theorem 3: (Uniqueness of the maximum independence estimate in the case of non binary variables)

The maximum independence value P(I1, I2 , : : :,Ik ) where I1 , I2 , : : :,Ik are not binary random variables is I (A B) = H(A j B)MI ; H(A j B) unique. But In order to prove that the maximum P is unique, supH(A j B)MI = H(A) pose that there are two distributions P1 and P2 that achieve since the maximum independence condition is the statistithe maximum. Let  be any real number satisfying 0 <  < cal distribution for which A and B are independent of each 1. Consider the weighted distribution other. So, the new denition I (A B) coincides with the well known P1 + (1 ; )P2 I (A B) = H(A) ; H(A j B) The meaning of the new denition follows from the following considerations. As in the case of two variables mu- It is easy to see that P1 +(1 ; )P2 has the same (k-1)tual information is the di erence of the number of bits th order marginal distributions which have been assigned. needed to know A in the worst case and the number of Therefore bits needed to know A when B is known so, for k variables, mutual information is the di erence of the number of bits H(P1 + (1 ; )P2 )  H(P1 ) = H(P2 ) (1) needed to know any variable Ij in the worst case (that is On the other hand, the concavity of Shannon entropy when Ij is at the maximum level of independence from the other variables) and the number of bits needed to know Ij implies that when the other variables are known. Since the information quantity necessary to know Ij when the other variables H(P1 + (1 ; )P2 )  H(P1 ) + (1 ; )H(P2 ) are known is less than H(Ij j I1 : : : Ij;1 Ij+1 : : : Ik )MI , I (I1 I2 : : : Ik ) is the information quantity relative to Ij carried by the other variables, and is a measure of the deThis, together with 1, implies pendence level of any variable Ij on the other ones. For this reason, it can be used in data mining as described in the following Section V. H(P1 + (1 ; )P2 ) = H(P1 ) + (1 ; )H(P2 ) (2)

6

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1,JANUARY 2002

By the strict concavity of Shannon entropy again, 2 is basket problem, but that it does not address other data possible if and only if P1 = P2 . Therefore, P is indeed mining problems. Among these there is the identication unique. of the dependencies between the presence of an itemset and the absence of an other one and the identication of the deV. Mutual Information and Data Mining pendencies between itemsets when the sets involved in the A. A new denition of dependence dependence are very frequent. Consider the purchase of tea (t) and co ee (c) in a groThe search for association rules in data mining has the aim to identify the phenomena that are recurrent in a data cery store and assume the following probabilities: set. The solution of this problem nds application in many P (c t) = 0:2 elds, such as analysis of basket data of supermarkets, failP (c t) = 0:7 ures in telecommunication networks, medical test results, P (c t) = 0:05 lexical features of texts, and so on. The extraction of assoP (c t) = 0:05 ciation rules from very large databases has been solved by researchers in many di erent ways and the proposed solutions are embedded in as many powerful algorithms 11], where c and t denote the events \coffee not 19], 20], 21], 22], 23], 24], 25], 26], 27]. All those purchased" and \tea not purchased", respectively. According to the preceding denitions, the potential rule solutions are based on the following concepts. An association rule X ) Y is a pair of two sets of items tea ) coffee has a support equal to 20% and a condence (called itemsets), X and Y , which are often found together equal to 80%, and therefore can be considered as a valid in a given collection of data. For instance, the associa- association rule. However, a deeper analysis shows that a tion rule X =fmilk, coffeeg ) Y =fbread, sugarg ex- customer buying tea is less likely to also buy co ee than tracted from the market basket domain has the intuitive a customer not buying tea (80% against more than 90%). meaning that a customer purchasing milk and co ee to- We would write tea ) coffee, but, on the contrary, the strongest positive dependence is between the absence of gether is likely to also purchase bread and sugar. co ee and the presence of tea. The validity of an association rule has been based on two In 18], the author of this paper has described a techmeasures: the support, the percentage of transactions of the database containing both X and Y  the condence, the per- nique based on the concept of maximum independence for centage of the transactions in which X and Y occur relative determining dependencies. In this Section a new algorithm to those transactions in which X occurs. For instance, with is proposed to solve the same problem, which is based on reference to the above example, a value of 2% of support the denition of group mutual information above formuand a value of 15% of condence would mean that in 2% lated. It overrides the conceptual limitations of the usual of all the transactions, customers buy together milk, cof- model of data mining based on the concepts of support fee, bread and sugar, and that the 15% of the transactions and condence. Indeed, it is able to determine the \rein which customers have bought together milk and co ee al" dependencies among data, according to the information quantity carried by each item of the database. Moreover, contain also bread and sugar. Recently in 28], Silverstein, Brin and Motwani have pre- as will be seen in the Subsection V-C on experiments it sented a critique of the concept of association rule and the can be executed quickly, that is, in times that are suitable related framework support-condence. They have observed to the management and knowledge extraction from large that the association rule model is well-suited to the market databases.

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1,JANUARY 2002

7

I (I1 I2 ::: Ik ) H(I1 I2 ::: Ik )

> k where I (I1 I2 : : : Ik ) is the group mutual information, H(I1 I2 : : : Ik) is the total entropy of the same variables and  is a given threshold, then I1 I2 : : : Ik are dened as connected by a dependence of order k. In this case, if P(i1,i2, : : :, ik;1, ik ) > P(i1,i2, : : :, ik;1, ik )MI the dependence is dened as positive otherwise, it will be dened as negative.

is proportional to the probability that the itemset occurs in a transaction of the database, its support. The program which has been specically developed to verify the ideas described in this paper is based on an algorithm for the determination of the itemsets having sucient support 29] which has been chosen in virtue of its speed, and its good characteristics when applied to large databases. The new algorithm here described, di erently from the other ones, requires that two types of information Dk (I1, I2, : : :, Ik )  0 are logically associated to each itemset hi1 i2 : : : ik i: Dk (I1, I2, : : :, Ik )  0 a) the values of the joint probabilities P(i1 i2 : : : ik ) reDk (I1, I2, : : :, Ik )  0 lated with the itemset hi1 i2 : : : ik i will be used to indicate the existence or not of a dependence b) the pointers to the itemsets parents in the lattice (i.e., of order k and its sign. itemsets with lower cardinality). Denition 3 (Dependence) If

B. An algorithm for the determination of the data dependencies in a database

Almost all the algorithms so far proposed for data dependencies identication in data mining are based on a rst step aimed at determining the k-plets of items (itemsets) having a sucient support, namely, a sucient statistical relevance. Such solutions are compatible with the following algorithm for determining all the relevant dependencies up to a certain order. 1. Determination of the itemsets having a su cient support. Most algorithms for determining the itemsets having a sufcient support proceed in the order of increasing cardinalities. In other terms, they rst determine the single items, then the pairs of items, the triplets, and so on. Such algorithms are well suited to the following procedure. Other algorithms, such as 26], 25], should be modied in order to examine a k-plet P after the (k-1)-plets contained in P. In these algorithms the determination of the itemsets having a sucient support consists in the exploration and pruning of a search space that is the lattice identifying all the itemsets of the problem. The determination of a single itemset hi1 i2 : : : iki consists essentially in the computation and storage in the main memory of the number of occurrences of that itemset in the database. This number

a

b

n

c

n

n

a

b

c

ab

ac

bc

n

n ab

ptr a ptr b

n ac

bc

ptr a ptr c

ptr b ptr c

abc n abc ptr ab ptr ac ptr bc

Fig. 1. The data structure of the itemsets.

The former information is needed in order to compute the entropy associated to an itemset the latter one in order to compute all the joint probabilities P(i1 i2 : : : ik ). In fact, for the sake of simplicity, this program is characterized by the choice of describing the itemset hi1 i2 : : : iki with a single datum, its probability P(i1 i2 : : : ik ), and to determine the probabilities of the related k-plets hi1 i2 : : : ik i on the basis of P(i1 i2 : : : ik ) and of the probabilitites of

8

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1,JANUARY 2002

the ancestors of the itemset hi1 i2 : : : ik i in the lattice, as described in the following. This choice makes it possible to store millions of itemsets in the main memory at the same time and to perform all the following computations without storing any partial results in the mass memory. Figure 1 shows the data structure used for the program, where ptr x denotes the pointer to itemset x). 2. Determination of all the joint probabilities of an itemset. The computation of the joint probabilities P(i1 ,i2 ,: : :,ik ) for all the combinations of values of i1 , i2 , : : :,ik can be performed by recursively applying the relationships presented in Subsection II-B. Of course, recursion proceeds towards the parents and the grandparents. For example, in the case of Figure 1, (

) = =

P a b c

( ) ; P(a b c) P(a) ; P(a c) ; P(a b) + P(a P a c

b c

)

where only the probabilities directly connected to the numbers of occurrences introduced in Figure 1 appear. 3. Determination of maximum independence estimates. The determination of the value X for which the joint entropy H(X) = ;

X P(i

1

i2

:::

ik

)  log(P(i1

i2

:::

ik

))

takes its maximum value is performed applying the dichotomic algorithm described in Subsection III-C. 4. Computation of the dependencies. The direct application of Denitions 1, 2 and 3 leads to the complete determination of the dependence for the itemset hi1 i2 : : : ik i on the basis of the probabilities P(i1 ,i2,: : :,ik ) of all the combinations of values of i1 , i2 , : : :,ik . Finally, the comparison between P(i1 i2 : : : ik ) and P(i1 i2 : : : ik)MI determines the sign of the dependence.

C. Experimental Results

The proposed approach has been veried with an implementation in C++ using the Standard Template Library. The program has been run on a PC Pentium II, with a 233 MHz clock, 128 MB RAM and running Red Hat Linux as the operating system. The algorithm has been applied to a class of databases that has been taken as benchmark by most of data mining algorithms on association rules. It is the class of synthetic databases that project Quest of IBM has generated for its experiments (see 21] for detailed explanations). We have made many experiments on several databases with di erent values of the main parameters and of the minimum support, but the obtained results are all similar to the ones here proposed. In particular, the experiments have been run with the value of minimum support equal to 0.2% and with a precision in the computation of the dependence values equal to 10;6 . In the generation of the databases we have adopted the same parameter settings proposed for synthetic databases: D, the number of transactions in the database has been set at 100 thousands N, the total number of items, set at 1000. T, the average transaction length, has been set at 10, since its value does not inuence the program behaviour. On the contrary, I, the average length of the frequent itemsets, has been varied, since its value really determines the depth of the lattice to be generated. Each database contains itemsets with sucient support having a di erent average length (3,4,: : :,8). The extreme values of the interval 2-10] have been discarded for the reasons that follow. The low value has been discarded because it does not make much sense to maximize the entropy related to itemsets having only two items: the direct approach, that compares the probability of such an itemset with the product of the probabilities of the two items, has been adopted in this case. The high values have been avoided in consideration of the fact that the longer are the itemsets, the higher is the probability that they go down under the threshold of the minimum support. In this case too few itemsets reveal

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1,JANUARY 2002

to be over the threshold and the comparison of the di erent experiments is no more fair. Thus, it happens that even if the \nominal" average length of the itemsets is increased, the actual average length of the itemsets with sucient a support appears to be signicantly lower. Table I reports the total number of itemsets with a sucient support and their nominal and actual average lengths in the experiments. In Figure 2 two execution times are shown. T1 is the CPU execution time needed for the identication of all the itemsets with a sucient support T2 is the time for the computation of the dependence values of the itemsets previously identied. Both the times have been normalized with respect to the total number of itemsets, since this one changes considerably in the di erent experiments. Execution times per itemset 0.050 0.045

T1 T2

0.040

CPU time[s]

0.035 0.030 0.025 0.020 0.015 0.010 0.005 0.000 2.81

3.28

3.67 4.05 Average itemset length

Fig. 2. Experimental results.

4.82 4.96

9

Thus, this fact is not surprising. On the other side, you can observe that the increments are moderate with the exception of the experiment having the average itemset length equal to 4.82 (corresponding to a nominal average itemset length equal to 6). In that experiment, as Table I reports, the total number of itemsets that exceed the minimum support threshold, compared to the itemsets length, grows suddenly: in these conditions, the generated lattice is very heavily populated. Therefore, the high dimension of the output explains the result. Finally, these experiments show that this new approach to the discovery of knowledge on the itemsets dependencies is feasible and suitable to the high resolution researches that are typical of data mining. VI. Conclusions

In this paper three contributions have been proposed. First, the Theorem of the products of even and odd nodes probabilities has been proved and applied to determining dependencies. Second, a new denition of mutual information has been suggested and discussed. Third, a new data mining technique based on the new denition of mutual information has been presented and analyzed. The three contributions show the usefulness of the old well known principles of information theory also in the new area of data mining.

Acknowledgments. I would like to thank the anonymous reviewers that helped me in preparing the nal version of the paper. In particular, I wish to thank one of them, who suggested me the proof of Theorem 3.

You can notice that time T1 decreases with the actual average itemset length. This is a particularity of the algoReferences rithm adopted (called Seq) for the rst step, because this 1] R. Agrawal, T. Imielinski, and A. Swami, \Database mining: one builds, during its execution, temporary data structures A performance perspective," IEEE Transactions on Knowledge and Data Engineering, vol. 5, no. 6, pp. 914{925, December that do not depend on the itemset length. Furthermore, 1993. Seq has been proved to be suitable to very long databases 2] M. A. W. Houtsma and A. Swami, \Set-oriented mining for and to searches characterized by very high levels of accuassociation rules in relational databases," in 11th International Conference on Data Engineering, Taipei, Taiwan, March 6-10 racy. 1995. On the contrary, time T2 increases with the average item- 3] Charu C. Aggarwal and Philip S. Yu, \Online generation of asset length, since the depth of the resulting lattice increases. sociation rules," in Proceedings of the 14th International Con-

10

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1,JANUARY 2002 TABLE I The number of itemsets and their average lengths in the experiments.

Total number nominal average actual average of itemsets length of itemsets length of itemsets

6080 11084 14913 38640 23663 42512

4]

5]

6] 7]

8]

9] 10] 11] 12] 13] 14]

3 4 5 6 7 8

ference on Data Engineering, Orlando, Florida, May 1998, pp. 402{411. J. Han, Y. Cai, and N. Cercone, \Knowledge discovery in databases: An attribute-oriented approach," in Proceedings of the 18st VLDB Conference, Vancouver, British Columbia, Canada, 1992, pp. 547{559. M. Holsheimer, M. Kersten, H. Mannila, and H. Toivonen, \A perspective on databases and data mining," First International Conference on Knowledge Discovery and Data Mining (KDD), pp. 150{155, 1995, Montreal, Canada, AAAI Press. J. F. Elder IV and D. Pregibon, \A statistical perspective on kdd," KDD-95, pp. 87{93, 1995. R.Srikant and R.Agrawal, \Mining quantitativeassociation rules in large relational tables," in Proceedings of the ACM-SIGMOD International Conference on the Management of Data, San Jose, California, May 1996. R. Krishnamurthy and T. Imielinski, \Practitioner problems in need of database research: Research directions in knowledge discovery," SIGMOD Record, vol. 20, no. 3, pp. 76{78, September 1991. T. Imielinski, \From le mining to database mining," In Proceedings of SIGMOD-96 Workshop on Research Issues on Data Mining and knowledge Discovery, pp. 35{39, May 1996. R. Srikant and R. Agrawal, \Mining generalized association rules," in Proceedings of the 21st VLDB Conference, Zurich, Switzerland, September 1995. J. Han and Fu, \Discovery of multiple-level association rules from large databases," in Proceedings of the 21st VLDB Conference, Zurich, Switzerland, September 1995. I. Csiszar and J. Korner, Information Theory: Coding Theorems for Discrete Memoryless Systems, Academic Press, New York, 1981. R. W. Yeung, \A new outlook on Shannon's information measures," IEEE Transactions on Information Theory, vol. IT-37, pp. 466{474, May 1991. T. Kawabata and R. W. Yeung, \The structure of the i-measure of a markov chain," IEEE Transactions on Information Theory,

2.81 3.28 3.67 4.82 4.05 4.96

15] 16] 17] 18] 19]

20]

21] 22]

23]

24]

vol. IT-38, pp. 1146{1149, May 1992. Z. Zhang and R. W. Yeung, \A non-Shannon type inequality of information quantities," IEEE Transactions on Information Theory, vol. IT-43, pp. 1982{1986, November 1997. Z. Zhang and R. W. Yeung, \On characterization of entropy function via information inequalities," IEEE Transactions on Information Theory, vol. IT-44, pp. 1440{1452, July 1998. R. W. Yeung, T. T. Lee, and Z. Ye, \An information theoretic characterization of markov random elds and its applications," in MIT Cambridge, MA USA, August 1998, p. 73. Rosa Meo, \Theory of dependence values," ACM Transactions on Database Systems, vol. 25, no. 3, pp. 380{406, September 2000. R. Agrawal, T. Imielinski, and A. Swami, \Mining association rules between sets of items in large databases," in Proc.ACM SIGMOD Conference on Management of Data, Washington, D.C., May 1993, British Columbia, pp. 207{216. R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Inkeri Verkamo, \Fast discovery of association rules," in Knowledge Discovery in Databases, Padhraic Smyth Usama M. Fayyad, G. Piatetsky-Shapiro and Ramasamy Uthurusamy (Eds), Eds., vol. 2. AAAI/MIT Press, Santiago, Chile, September 1995. R. Agrawal and R. Srikant, \Fast algorithms for mining association rules in large databases," in Proceedings of the 20th VLDB Conference, Santiago, Chile, September 1994. A.Savasere, E.Omiecinski, and S.Navathe, \An ecient algorithm for mining association rules in large databases," in Proceedings of the 21st VLDB Conference, Zurich, Swizerland, 1995. Sergey Brin, Rajeev Motwani, Jerey D. Ullman, and Shalom Tsur, \Dynamic itemset counting and implication rules for market basket data," in Proceedings of the ACM SIGMOD International Conference on Management of Data, New York, May13{ 15 1997, vol. 26,2 of SIGMOD Record, pp. 255{264, ACM Press. H.Toivonen, \Sampling large databases for association rules," in Proceedings of the 22nd VLDB Conference, Bombay (Mumbai), India, 1996.

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1,JANUARY 2002 25] Dao I.Lin and Zvi M.Kedem, \Pincer-search: A new algorithm for discovering the maximum frequent set," in Proceedings of EDBT-98 Conference, Valencia, Spain, 1998. 26] R.J.Bayardo, \Eciently mining long patterns from databases," in Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, 1998, pp. 85{93. 27] J. S. Park, M. Shen, and P. S. Yu, \An eective hash based algorithm for mining association rules," in Proceedings of the ACM-SIGMOD International Conference on the Management of Data, San Jose, California, May 1995. 28] Craig Silverstein, Sergey Brin, and Rajeev Motwani, \Beyond market baskets: generalizing association rules to dependence rules," Data Mining and Knowledge Discovery, vol. 2, no. 1, pp. 39{68, 1998. 29] Rosa Meo, \A new approach for the discovery of frequent itemsets," in Proceedings of the 1st International Conference on Data Warehousing and Knowledge Discovery, Firenze, Italy, August/September 1999, pp. 193{202.

Rosa Meo holds a Dr. Ing. degree in Elec-

trical Engineering and a Ph.D. in Computer Engineering, both from Politecnico di Torino. She is now assistant professor at the Department of Computer Science at the University of Torino. Her current research interests are in the eld of databases, in particular active databases and data mining.

11