An Efficient FUFP-tree Maintenance Algorithm for ... - Semantic Scholar

3 downloads 0 Views 323KB Size Report
[email protected]. Chun-Wei Lin. Department of Computer Science and Information Engineering. National Cheng Kung University. Tainan, 701, Taiwan ...
Submitted manuscript

An Efficient FUFP-tree Maintenance Algorithm for Record Modification Tzung-Pei Hong Department of Computer Science and Information Engineering National University of Kaohsiung Department of Computer Science and Engineering National Sun Yat-sen University Kaohsiung, {811, 804}, Taiwan, R.O.C. [email protected]

Chun-Wei Lin Department of Computer Science and Information Engineering National Cheng Kung University Tainan, 701, Taiwan, R.O.C. [email protected]

Yu-Lung Wu Department of Information Management I-Shou University Kaohsiung, 84008, Taiwan, R.O.C. [email protected]

ABSTRACT. The Frequent-Pattern-tree (FP-tree) is an efficient data structure for association-rule mining without the generation of candidate itemsets. It is used to compress a database into a tree structure, which stores only large items. When the underlying data is updated, the FP-tree, however, needs to process all the transactions in a batch way. In this paper, we thus attempt to extend the FP-tree construction algorithm for the efficient handling of record modification. An expeditious FP-tree (FUFP-tree) structure is used to ease the tree update process. An FUFP-tree maintenance algorithm is also proposed for reducing the execution time in reconstructing the tree when records are modified. Experimental results show that the proposed FUFP-tree maintenance algorithm for record modification runs faster than the batch FP-tree construction algorithm for handling updated records and generates nearly the same tree structure as the FP-tree algorithm. The proposed approach can thus achieve a good trade-off between execution time and tree complexity. Keywords: data mining, FP-tree, FUFP-tree, record modification, maintenance

1. Introduction. Data mining involves applying specific algorithms to extract patterns, features or rules from data sets in a particular representation. One common type of data mining is to derive association rules from transaction data, such that the presence of certain items in a transaction will imply the presence of some other items. Many mining approaches have been proposed to achieve this purpose [1][2][3][6][7][9][10][11][13][14]

[16][17][18]. For example, Agrawal and his co-workers proposed several mining algorithms based on the concept of large itemsets to find association rules from transaction data [1][2][3]. In their approaches, candidate itemsets had to be first generated to determine large itemsets and association rules. Cheung et al. proposed a noticeable incremental mining algorithm, called the Fast Updated Algorithm (FUP) [4] for avoiding the shortcomings of batch mining. The FUP algorithm modified the Apriori mining algorithm [2] and adopted the pruning techniques used in the DHP (Direct Hashing and Pruning) algorithm [14]. It first calculated large itemsets mainly from newly inserted transactions, and compared them with the previous large itemsets from the original database. According to the comparison results, FUP determined whether re-scanning the original database was required, thus saving some time in maintaining the association rules. Han et al. proposed the Frequent-Pattern-tree (FP-tree) structure for efficiently mining association rules without generation of candidate itemsets [8]. The FP-tree [8] was used to compress a database into a tree structure, which stored only large items. It was condensed and complete for finding all the frequent patterns. The construction process was executed tuple by tuple, from the first transaction to the last. After that, a recursive mining procedure called FP-Growth was executed to derive frequent patterns from the FP-tree. They demonstrated that the approach could have a better performance than Apriori. The FP-tree mining approach belongs to batch mining; that is, all transactions must be processed in a batch way. Many mining methods for finding association rules based on the FP-tree structure have also been proposed. Qiu et al. proposed the QFP-growth mining approach to mine association rules [15]. It could generate frequent patterns without the usage of conditional FP-trees. Its computational time and space was reduced when compared to the original FP-tree approach. The QFP-growth and the batch FP-tree mining algorithms can not, however, deal with the problem of incremental mining. Whenever the records are changed by insertion, deletion or modification, the trees must be re-constructed and re-built. Even when the number of the processed records is small, the traditional methods should still be processed in a batch way. To deal with the incremental mining problem, Ezeife thus constructed a generalized FP-tree, which stored all the large and non-large items, for incremental mining without rescanning databases [5]. All the non-large items had to be kept, thus requiring a large amount of space. Hong et al. also proposed an efficient mining algorithm based on the FP-tree for handling the insertion of records [12]. In that approach, an expeditious FUFP-tree structure was used to simplify the tree update process [12]. It was similar to the FP-tree structure except that the links between the parent and child nodes were bi-directional. The counts of the sorted frequent items were kept in the Header Table of the FP-tree algorithm as well. Based on the bi-directional linking and the items with sorted counts in the Header Table, it assisted to hasten the maintenance process. In addition to record insertion, record modification is also commonly seen in real-world applications. Although using insertion and deletion procedures can perform handling record modification, it requires twice the computational time needed for a single procedure.

Therefore, developing an efficient maintenance algorithm for record modification is essential. In this paper, we thus propose a maintenance algorithm based on the FUFP-tree for the efficient handling of record modification. When records are modified from the database, the proposed algorithm will process them to maintain the FUFP-tree and the Header Table. The count difference is first formed by comparing the counts of each updated item before and after record modification. The proposed maintenance algorithm then partitions the items into four sections according to whether they are large in the original database and whether their item difference is positive or negative (including zero). Each section is then processed in its own way. The Header Table and the FUFP-tree are correspondingly updated whenever necessary. The remainder of this paper is organized as follows. Related works are reviewed in Section 2. The proposed algorithm for record modification is described in Section 3. An example to illustrate the proposed algorithm is given in Section 4. Experimental results for showing the performance of the proposed algorithm are provided in Section 5. Conclusions are given in Section 6. 2. Review of the Frequent Pattern Tree. Han et al. proposed the Frequent-Pattern-tree structure (FP-tree) for efficiently mining association rules without the generation of candidate itemsets [8]. The FP-tree mining algorithm consists of two phases. The first phase focuses on constructing the FP-tree from the database, and the second phase focuses on deriving frequent patterns from the FP-tree. They are described below. 2.1. Construction of an FP-tree. The FP-tree [8] is used to compress a database into a tree structure storing only large items. It is condensed and complete for determining all the frequent patterns. Three steps are involved in FP-tree construction. The database is first scanned to find all items with their frequency. The items with their supports larger than a predefined minimum support are selected as large 1-itemsets (items). Next, the large items are sorted in descending frequency. Finally, the database is scanned once more to construct the FP-tree according to the sorted order of large items. The construction process is executed tuple by tuple, from the first transaction to the last. . After all transactions are processed, the FP-tree is completely constructed. The Header Table, built to facilitate tree traversal, includes the sorted large items and their pointers (called frequency head) linked to their first occurrence nodes in the FP-tree. If more than one node has the same item name, they are also linked in sequence. Note that the links between nodes are uni-directional, from parents to children. Below, a simple example is given to illustrate the process of the FP-tree construction. Assume there are five transactions shown in Table 1. Each transaction has its transaction identifier (TID) and the items purchased. Each item is denoted by a symbol. Also assume the minimum support is set at 50%. The FP-tree is constructed in the following way [8].

TABLE 1. A database with five transactions

TID 100 200 300 400 500

Items a, c, d, f, g, i, m ,p a, b, c, f, l ,m, o b, f, h, j, o b ,c, k, s, p a, c, e, f, l, m, n, p

First, the database is scanned to find large items. In this example, the five transactions are scanned to find the items with their counts shown in Table 2, in which the large items are marked. TABLE 2. All the items with their counts Item a b c d e f g h i

frequency 3 3 4 1 1 4 1 1 1

item j k l m n o p s w

frequency 1 1 2 3 1 2 3 1 1

From Table 2, it can be observed that the set of large 1-itemsets, named L1, includes {a:3, b:3, c:4, f:4, m:3, p:3}, where the number after an item represents its count. Next, the items in L1 are sorted according to their descending frequency. The sorted L1, named L1’, is {f:4, c:4, a:3. b:3, m:3, p:3}. At last, the database is scanned again to construct the FP-tree. The transactions with only the sorted large items are shown in Table 3 for illustrating the construction process easily. TABLE 3. The transactions with only the sorted large items TID 100 200 300 400 500

Sorted frequent items f ,c, a, m, p f, c, a, b, m f, b c, b, p f, c, a, m, p

In Table 3, the first transaction is (f, c, a, m, p). The root of the FP-tree is first set Null. This transaction is then inserted into the FP-tree as the first branch. Each node in the branch is attached a count of 1. The results after the first transaction is processed are shown in Figure 1.

{} TID Frequent items 100 f, c, a, m, p 200 f, c, a, b, m 300 f, b 400 c, b, p 500 f, c, a, m, p

f:1 c:1 a:1 m:1 p:1

FIGURE 1. The FP-tree after the first transaction is processed. The second transaction is next processed. It shares the same prefix (f, c, a) as the first branch of the FP-tree. The counts of nodes f, c and a are then incremented by 1, and a new node (b:1) is created and linked to (a:2) as its child. Another new node (m:1) is then created and linked to (b:1). Besides, a link is created between the two nodes of m. The results after the second transaction is processed are shown in Figure 2.

{} TID Frequent items 100 f, c, a, m, p 200 f, c, a, b, m 300 f, b 400 c, b, p 500 f, c, a, m, p

f:2 c:2 a:2 m:1 b:1 p:1 m:1

FIGURE 2. The FP-tree after the second transaction is processed. The same process is then executed for the other transactions. After all the transactions are processed, the resulting Header Table and FP-tree are shown in Figure 3.

{} f:4

Header Table Item

f c a b m p

c:1

frequency head

c:3

b:1

a:3

b:1 p:1

m:2

b:1

p:2

m:1

FIGURE 3. The resulting Header_Table and FP-tree in the example. 2.2. Mining of Large Itemsets. After the FP-tree is constructed from a database, a mining procedure called FP-Growth [8] is executed to find all large itemsets. FP-Growth does not need to generate candidate itemsets for mining, but derives frequent patterns directly from the FP-tree. It is a recursive process, handling the frequent items one by one and bottom-up according to the Header Table. A conditional FP-tree is generated for each frequent item and from the tree the large itemsets with the processed item can be recursively derived. Specifically, a conditional FP-tree is generated in the following way. Let a prefix path of an item I in the FP-tree be the preceding part of a branch above I. The corresponding prefix paths for a large item I are first extracted from the FP-tree. The count of each node in a prefix path is set as the count of I in the same branch. The counts of an item appearing in different prefix paths are then calculated.. The items with their counts larger than or equal to the minimum count are selected to build the conditional FP-tree for I. Each prefix path, like a transaction, is used to build the conditional FP-tree as in the FP-tree construction. A conditional FP-tree is thus similar to a sub-FP-tree with the processed item lying at its leaves. An itemset composed of the original item I and each item in the conditional FP-tree is certain to be large. The process is recursively executed until all the items in a conditional FP-tree are processed. 3. The Proposed FUFP-tree Maintenance Approach for Record Modification. 3.1. Design Concept. Assume an FUFP-tree has been built in advance from the original database before records are modified. The FUFP-tree construction algorithm is the same as the FP-tree algorithm [8] except that the links between parent and child nodes are bi-directional. Bi-directional linking will help to hasten the process of item modification in the maintenance process. The counts of the sorted frequent items are recorded in the Header Table as well.

When records are modified from the database, the proposed algorithm will process them to maintain the FUFP-tree. The count difference is first formed by comparing the counts of each updated item before and after record modification. The proposed maintenance algorithm then partitions items into four sections according to whether they are large in the original database and whether their count difference is positive or negative (including zero). Each section is then processed in its own way. The Header Table and the FUFP-tree are correspondingly updated whenever necessary. Considering an original database and some records to be modified, the following four cases (illustrated in Figure 4) may arise. Case 1: An item is frequent in an original database and has a positive count difference. Case 2: An item is frequent in an original database and has a negative (including zero) count difference. Case 3: An item is not frequent in an original database and has a positive count difference. Case 4: An item is not frequent in an original database and has a negative (including zero) count difference.

Item difference Positive Negative (zero) difference difference

Original Original database database

Large items

Case 1

Case 2

Small items

Case 3

Case 4

FIGURE 4. Four cases when records are modified from an existing database. Since items in Case 1 are large in the original database and have a positive count difference, they will remain large after the database is updated. Similarly, items in Case 4 will remain small after the records are modified. Thus, Cases 1 and 4 will not affect the final large items. Items in Case 2 are large in the original database and have negative (or zero) count difference. Some existing large items may be removed after the database is modified. It is easily decided since the counts of the original large items are kept in the Header Table. At last, items in Case 3 are small in the original database and have a positive count difference. Some large items may thus be added. The original database must be rescanned to detect the original counts of these items. The summary of the four cases and their results is given in Table 4.

TABLE 4. Four cases and their results for record modification Cases: Original – Difference Case 1: Large – Positive Case 2: Large – Negative (or zero) Case 3: Small –Positive Case 4: Small – Negative (or zero)

Results Always large Determined from the Header Table Determined by rescanning the original database Always small

In the maintenance process of the FUFP-tree for record modification, item deletion is completed before item insertion. When an originally large item becomes small, it is directly removed from the FUFP-tree and its parent and child nodes are then linked together. On the contrary, when an originally small item becomes large, it is added to the end of the Header Table and then inserted into the leaf nodes of the FUFP-tree. It is reasonable to insert the item at the end of the Header Table since, when an originally small item becomes large due to the modified records its updated support is usually only a little larger than the minimum support. The FUFP-tree can at least be updated accordingly, and the performance of the proposed maintenance algorithm can be greatly improved. The entire FUFP-tree can be re-constructed in a batch way when a sufficiently large number of transactions are deleted. The notation used in this paper is first described below. 3.2. Notation. D: the original database; T: the set of modified records (after modification); T’: the set of records to be modified (before modification); D-: the set of unchanged records, i.e., D - T; U: the entire updated database; M: the set of items appearing in the updated records before and after modification; d: the number of records in D; t: the number of records in T; d-: the number of records in D-; SD(I): the number of occurrences of I in D; SM(I): the count difference of I from the updated records, I  M; SU(I): the number of occurrences of I in U; Sup: the support threshold for large itemsets; Decrease_Items: the set of items with which the updated records before modification (i.e. in T’) are reprocessed to decrease the corresponding counts in the FUFP-tree; Increase_Items: the set of items with which the updated records after modification (i.e. in T) are reprocessed to increase the corresponding counts in the FUFP-tree; Rescan_Items: the set of items for which the unmodified records in the original database are rescanned; Rescan_Transactions: the set of unmodified records with at least one item in the set of Rescan_Items;

The details of the proposed algorithm are described below. The Proposed Algorithm: INPUT: An old database, its corresponding Header Table storing the frequent items in descending order, its corresponding FUFP-tree, a support threshold Sup and a set of t modified records. OUTPUT: A new FUFP-tree for the updated database. STEP 1: Find all the items in the t records before and after modification. Denote them as a set of modified items, M. STEP 2: Find the count difference (including zero) of each item in M for the modified records. STEP 3: Check whether the items in M are large or small in the original database. STEP 4: For each item I in M, which has a positive count difference and is large in the original database (appearing in the Header Table), do the following substeps (Case 1): Substep 4-1: Set the new count SU(I) of I in the entire updated database as: SU(I)=SD(I)+SM(I), where SD(I) is the count of I in the Header Table (original database) and SM(I) is the count difference of I after record modification. Substep 4-2: Update the count of I in the Header Table as SU(I). Substep 4-3: Put I in both the sets of Increase_Items and Decrease_Items, which will be further processed in STEP 7. STEP 5: For each item I in M, which has a negative (or zero) count difference and is large in the original database (appearing in the Header Table), do the following substeps (Case 2): Substep 5-1: Set the new count SU(I) of I in the entire updated database as: SU(I)=SD(I)+SM(I). U Substep 5-2: If S (I)  d*Sup, item I will still be large after the database is updated; update the count of I in the Header Table as SU(I) and add I to both the sets of Increase_Items and Decrease_Items. Substep 5-3: If SU(I)