the application of association rule mining to remotely ... - CiteSeerX

1 downloads 0 Views 61KB Size Report
Association rule mining is one of the important data mining techniques. Association ... one box beer and ten boxes beer, it is unable to tell the difference between ...
THE APPLICATION OF ASSOCIATION RULE MINING TO REMOTELY SENSED DATA Jianning Dong, William Perrizo, Qin Ding and Jingkai Zhou North Dakota State University Fargo, ND 58105 [email protected]

KEYWORDS Association rule mining, data mining, remotely sensed data

ABSTRACT The explosive growth in data and database has generated an urgent need for new techniques and tools that can intelligently and automatically transform the processed data into useful information and knowledge. Data mining is such a technique. In this paper, we consider the mining of association rules from remotely sensed data and its application in precision. Based on the characteristics of the remotely sensed data and the problem itself, we present a bit oriented formal model and discuss the issues of partitioning quantitative attributes into equal, unequal and discontinuous partitions. We propose two new pruning techniques and compare the performances with a base algorithm. An improvement in performance is shown when using these pruning techniques.

1. INTRODUCTION In the last decade, we have seen an explosive growth in our capabilities to both generate and collect data. Advances in scientific data collection (e.g. from remote sensors and satellites), the widespread introduction of bar codes for almost all commercial products, the computerization of many business (e.g. credit card purchases) and government transactions (e.g. tax returns) have generated a flood of data. Advances in data storage technology, such as faster, higher capacity, and cheaper devices (e.g. magnetic disks, CD-ROMS), better database management systems, and data warehouse technology, have allow us to transform this data deluge into “mountains” of stored data. But, what we really want is information. Data mining, which is also referred to as knowledge discovery in databases, give users the tools to extract previously unknown and potentially useful information (such as knowledge rules, constraints, and irregularities) form data in database [CHY96]. Data mining has been recognized as one of the promising areas of research encompassing database, statistics and machine learning.

The primary goal of data mining in practice is to assist strategic decision making. It does that by building a model of the real world based on data collected from a variety of sources. Models can be used to predict as well as describe [Fl98]. Association rule mining is one of the important data mining techniques. Association rules are primarily useful for description of the behavior that is captured in the database. It can be expressed as: X ⇒ Y, where X and Y are set of items. The intuitive meaning of such a rule is that transactions of the database which contain X tend to contain Y. Each rule has two measures of value, support and confidence which indicates the frequencies of the occurring patterns, and confidence denotes the strength of implication in the rule, respectively. The support of the rule X ⇒ Y is support (X ∪Y). The confidence c of the rule X ⇒ Y in the transaction set D means c% of transaction in D that contain X also contain Y, which can be written as the radio support (X ∪Y) / support (X). Given a user specified minimum support and minimum confidence, the problem of mining association rules is to find all the association rules whose support and confidence are larger than the respective thresholds. Thus, it can be decomposed into two sub-problems: * Finding the large itemsets which have support above the predetermined minimum support. * Deriving all rules, based on each large itemset, which have more than predetermined minimum confidence. Efficiency is an important issue for a mining algorithm. Many interesting and efficient data mining algorithms have been proposed ([AIS93], [ASR94], and [CNF96]). The performance improvement is an essential concern for a particular algorithm because mining association rules may require multiple scan of database. In this paper, we proposed two new pruning techniques for generating candidate itemset to mine association rules from remotely sensed data in precision agriculture.

2. BASE ALGORITHM ALGORITHM)

(APRIORI

Various algorithms have been proposed to discover the large itemsets [AIS93], [ASR94]. The Apriori algorithm is one of the most popular algorithm in the mining of association rules in a centralized database. The main idea of Apriori is outlined in the following [FUM96].

Table1. Notation for mining algorithm K-itemset Lk

An itemset having k items. Set of large k-itemset.

Ck

Set of candidate k-itemset.

Apriori algorithm is shown in Figure 1. L1 = { large 1-itemsets }; For (k =2; Lk-1 ≠ 0; k++ ) do begin Ck= apriori-gen ( Lk-1 ); // New candidates

Forall transactions t ∈ D do begin Ct = subset ( Ck, t); // Candidates contained in t Forall condidates c ∈ Ct do c.count ++; end Lk = { c ∈ Ck | c.count ≥ minsup }

End Answer = ∪ K LK ; Figure 1. Apriori algorithm

3. MINING ASSOCIATION RULES FROM IMAGERY DATA 3.1 Problem Definition An example is described as following. We have an image scene and a yield map. They are both digit photos taken from the same field. The image scene contains red, blue and green bands. It represents the reflectance levels of each pixel of the scene. The yield map contains red, blue and green bands. There is a formula to convert these three bands into a gray scale (one band). Now we have five attributes in our database. They are: pixel-number, red (band1), blue (band2), green (band3) and yield (band4). The problem of mining association rules from this imagery data is to discover the associations between band1, band2, band3 and band4. This will help farmers to understand what combination of spectral bands will have a high crop yield.

consecutive integers, such that order of the intervals is preserved [SRI96]. We separate the value in equal, uneven, and discontinuous depth without loss of generality. We then have a set of pairs. The “item” would be “1” if attribute had value within the interval 1 in the original tuple, and “0” otherwise.

3.3 Finding Large Itemsets from Imagery Data In data mining, it is essential to collect a sufficient amount of data that we can derive meaningful conclusion from them. As a result, the amount of these data tends to be huge. There are 40 million pixel in one image scene. Efficiency is an important issue for a mining algorithm. Finding large itemset algorithm can be done iteratively in the sense that the large itemsets discovered in one iteration will be used as the basis to generate the candidate itemsets for the next iteration for example, in Apriori algorithm, at kth iteration, large k-itemset is generated. In the next iteration, a heuristic is used to extend large k-itemsets into candidate (k+1)-itemsets. The heuristic to construct the candidate itemsets of large itemsets is crucial to performance. Clearly, in order to be efficient, the heuristic should only generate candidates with high likelihood of being large itemsets because for each candidate, we need to count its appearances in the database. The larger the candidate itemsets, the more processing cost required to discover the large itemsets. A performance study [PCY97] shows that the candidate itemsets generated during an early iteration is generally, in orders of magnitude, larger than the set of large itemsets it really contains. Therefore, the initial candidate set generation, especially for the large 2itemset, is the key issue to improve the performance of data mining.

3. NEW PRUNING TECHNIQUES FOR FAST DATA MINING Two effective pruning techniques for candidate itemset generation are proposed to progressively improve the efficiency of the base algorithm.

3.2 Partitioning Quantitative Attributes

4.1 Technique one

The Apriori algorithm can only handle category basket data. When we have quantitative attributes in database, we can not directly apply techniques of mining association rules in basket database. In basket database, we say a specific transaction include a set of items. It means some items appear in this transaction. So we can use a boolean value TURE or FALSE to express the relationship between item and transaction. But for quantitative attributes, say one box beer and ten boxes beer, it is unable to tell the difference between these two values in this way. To deal with this type of data, several approaches were proposed. The most common method was simply partition the attribute values into intervals, the intervals are mapped to

Lemma 1 A pixel value can not belong to two different intervals from the same band. Lemma 2 The combination of k intervals (k > 1) from same band has support zero. From lemma 2, we know, for any candidate k-itemsets (1 < k < n) generated from same band, the support for these candidate sets must be zero. It is impossible for these candidate k-itemsets to become large k-itemsets later. The new pruning technique proposed in this paper is that we do not need to generate the candidate k-itemsets among those “items” which are in same band with different interval.

We compare Apriori algorithm and our new algrithm to generate candidate 2-itemsets. The notation is shown in table 2. Table 2. Notation for comparison Ck Lk | Ck | | Lk | * Rj

Candidate k-itemsets Large k-itemsets Number of itemset in candidate k-itemsets Number of itemset in large k-itemsets An operation for concatenation Number of intervals in bandj

1. Apriori algorithm: According to the fact that any subset of a large itemset must also be large itemset, Apriori use L1 * L1 to generate a candidate set of itemsets C2. | C2 | apriori = | L1 | ( | L1 | - 1) / 2 |L1 | = 2 2. New algorithm: Based on Lemma 2, we will not choose the intervals from the same band to generate candidate 2itemset. Assume | L 1 | = R1 + R2 + … + Rn. | C2 |new =R1 (R2 + R3 + … + Rn) + R2 (R3 + R4 + … + Rn) + … + Rn-2 (Rn-1 + Rn) + Rn-1(Rn) n n-1 Rj Σ Rk = Σ 1 k=j+1 j =1 1

yield”. It means farmers are only interested in those rules that reflectance bands appear in the antecedent and other bands appears in the consequent. The mining task is to find the association rules X ⇒ Y, where X = { collection of reflectance interval }, Y = { output interval }, and X ∩ Y = Φ. This gives us some heuristic that if the large itemsets we mined do not contain Y, the rules derived later should be uninteresting to the user. In this paper, we propose another pruning technique for candidate itemset generation base on this heuristic. Assume there are n bands in an imagery data, 1, .. , R are indexes of reflectance bands, and R+1, .., N are indexes of output bands. The association rules we mined should have the form: band1 ∧ … ∧ bandR ⇒ band(R+1) ∧ … ∧ bandN. 1.

Consider only one band “bandN” in output. The association rule is the form: band1∧ … ∧ band(N-1) ⇒ bandN.

For candidate 2-itemsets, since we know that band N should be in every candidate itemset, the new algorithm only generate those candidate itemset in which one item is bandN interval and another is bandk (k = 1, .., N-1) interval. Assume the candidate 2-itemsets is: {, }, (k = 1, .., N-1). Then, the number of candidate 2-itemset RN

N-1 Σ Rj j=1 1

| C2 |new = The number of candidate 2-itemsets generated by new algorithm is much less than by Apriori. | C2 |prune 1 = | C2 |apriori - | C2 |new n Rj =Σ 2 j=1 Note that when n is large and Rj is large, | C2 |prune 1 becomes an extremely large number. For example, if the imagery data has eight bands and each band has sixteen intervals, the number of pruned candidate 2-itemsets is (8 x 16 x (16 - 1) / 2 ) = 960. It sharply reduce the process cost. New algorithm employs effective pruning techniques to progressively reduce the number of candidate 2-itemsets, thus improving the performance bottleneck of the whole process.

4.2 Technique two During the process of data mining, allow user interaction with the mining engine and use users’ prior knowledge will help to speed up the mining algorithms by restricting the search space. In application of precision agriculture, farmers like to know “combinations of intervals of reflectance from certain bands imply certain agricultural phenomena like high

1

The reason is that we are not interested in those itemsets which do not contain bandN. We will prune those candidate itemset in which none of the interval is chose from bandN. The number of pruned candidate 2-itemset is

N-2 | C2 |prune 2 = Σ j=1

Rj

N-1 Σ Rk k=j+1

1

1

Compare to Apriori algorithm, apply new pruning technique described in technique one N | C2 |prune1 = Σ j=1

Rj 2

The total number of pruned candidate 2-itemset | C2 |prune = | C2 |prune 1 + | C2 |prune 2 N-1 Σ Rj RN = j=1 + 2 2

And the remaining steps are the same as Apriori algorithm. 2. Consider two bands in the output. The association rule is the form: band1 ∧ … ∧ band(N-2) ⇒ band(N-1) ∧ bandN. For candidate 2-itemsets, the new algorithm only generate those candidate itemset in which one item is in either band(N-1) intervals or bandN intervals. Assume the candidate 2-itemsets is: { , }(k + 1, .., N-2) or { , }(k = 1, .., N-1). The number of cnadidate 2-itemset

| C2 |new =

R(N-1) 1

N-2 Σ Rj j=1

+

RN 1

1

N-1 Σ Rj j=1 1

We will prune those candidate itemset in which none of the interval is chosen from neither band(N-1) nor bandN. The number of pruned candidate 2-itemset is: N-2 N -3 Rj Σ Rj k=j+1 | C2 |prune2 = Σ j=1 1 1 Compare to Apriori algorithm, apply new pruning technique described in technique one, N Rj | C2 |prune1 = Σ j =1 2

We will prune those candidate itemset in which none of the interval is chosen from bandj (j = M+1,…, N). The number of pruned candidate 2-itemset is:

M-1 | C2 |prune2 = Σ j =1

M Σ Rk k =j+1 1

Rj 1

Compare to Apriori algorithm, apply new pruning technique described in technique one, N | C2 |prune1 = Σ j=1

Rj 2

The total number of pruned candidate 2-itemset | C2 |prune = | C2 |prune1 + | C2 |prune2

=

M Σ Rk k=1 1

+

N Σ j=M+1

Rj 1

And the remaining steps are the same as Apriori algorithm. From mathematics point of view, if Rj is large and N is large, it means the number of intervals is large and number of bands is large, our new algorithm is efficient for pruning unnecessary candidate 2-itemset. Thus it greatly improves the performance of the whole mining process.

4.3 Summary of the new algorithm The total number of pruned candidate 2-itemset is: | C2 |prune = | C2 |prune1 + | C2 |prune2 And the remaining steps are the same as Apriori algorithm. 3. In general case, the association rule is the form: band1 ∧ … ∧ bandM ⇒ band(mM+1) ∧ … ∧ bandN. It means there are (N-M) bands in output. We only generate those candidate itemsets in which at least one is in bandj intervals (j=M+1, …, N). Assume the candidate 2-itemsets is: {, } where k = 1,…, M, j + M+1, …, N. or {, } where m = M+1,…N-1 and n = m+1, .. N. The number of candidate 2-itemset M N Rj Σ Rj N-1 Rj ) k=1 + Σ | C2 |new = ( Σ j =m+1 1 1 j=M+1 1

N Σ Rk k=j+1 1

The following summaries the phases of the new algorithm by adding new pruning techniques to the base algorithm. Phase1. Choose one of the partition method (equal depth, uneven depth and discontinous partition) to determine the intervals. Phase2. From large 1-itemset, apply new pruning technique (technique one and technique two) to generate candidate 2-itemset. Phase3. Applying remaining steps of Apriori algorithm.

4.4 Performance analysis The following figure shows how the two pruning techniques improve the performance. Suppose we have 8 bands here and each band has the same number of intervals. From the figure, the greater the number of intervals in each band, the less number of the candidate k-itemsets (in this example considers candidate 2-itemsets), and the greater the performance improvement. From the graph we can see that the number of comparisons required under the Apriori algorithm increases markedly as the number of intervals increases. A reduced number of comparisons is required when using pruning technique number 1. However, using pruning technique number 2, the

number of comparisons is greatly reduced for larger numbers of intervals. From this analysis, we conclude that, for larger numbers of intervals, these pruning techniques, especially technique number 2, will substantially reduce the execution time of the Apriori algorithm. Since the execution time of the Apriori algorithm grows exponentially in the number of intervals, this is a very important issue.

Comparison of Candidate 2-itemsets 9000 8000 7000 6000 5000 4000 3000 2000 1000 0

C2 using Apriori c2 using pruning 1

Table 5. Diameter three for band2 and band3 [0, 31] Band2 B21 Band3 B31

[32, 63] B22 B32

[64, 95] B23 B33

[96, 127] B24 B34

[128, 159] B25 B35

[160, 191] B26 B36

[192, 225] B27 B37

[226, 255] B28 B38

After selecting partition method, we map each value in table 3 into intervals. Figure 3 shows are result. Pixel b b b b b b b b b b b b b b b b 11 12 13 14 21 25 26 28 31 32 37 38 41 42 43 44 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 2 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 3 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 4 0 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 5 0 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0

c2 using pruning 2

Figure 3. An example of partition the value into intervals 1

2

3

4

5

Interval Numbers = 2 ^^ (x-1)

Figure 2. Comparison of Candidate 2-itemsets

4.5 An Example for Applying New Algorithm

Phase 2 We use Apriori algorithm as base mining algorithm, and apply our new pruning technique for candidate 2-itemset generation. Assume the minsup = 40% and minconf = 60%. 1.

Consider the following imagery data. Assume this table contains four attributes (bands) and has five tuples (pixels). Table 3. An example database for new algorithm Pixel 1 2 3 4 5

Band1 40 50 45 100 110

Band2 140 130 135 180 170

Band3 200 210 210 50 40

Band4 240 250 190 100 120

From the domain knowledge, we know that band1, band2 and band3 refer to reflectance data and band4 refer to yield data. The association rules the user likes to mine are of the form: band1 ∧ band2 ∧ band3 ⇒ band4.

2.

3.

Candidate 1-itemset: all the intervals we separated belong to candidate 1-itemset. They are: {b11, b12, b13, b14, b21, b22, b23, b24, b25, b26, b27, b28, b31, b32, b33, b34, b35, b36, b37, b38, b41, b42, b43, b44}. Large 1-itemset: Since minsup = 40%, the minimum support for the number of transaction is at least two. We scan the database and count the support for each candidate 1-itemset, prune those whose support is less than two, we can get the large 1-itemset {b11(3), b12(2), b25(3), b26(2), b32(2), b37(3), b42(2), b44(2)} (the number in the brace refer to support). Candidate 2-itemsets: apply the new algorithm, we generate the candidate 2-itemsets | C2 |new =2 x (2 + 2 + 2) =12.

They are: {{b42, b11}, {b42, b12}, {b42, b25}, {b42, b32}, {b42, b37}, {b44, b11}, {b44, b12}, {b44, b25}, {b44, b26}, {b44, b32}, {b44, b37}}.

Now we exactly know that band1, band2 and band3 are input data and band4 is output data. Following is the three phases for applying new algorithm.

According to the formula we derived, apply pruning technique one, | C2 |prune1 = 1 + 1 + 1 + 1 = 4,

Phase 1. Assume user select equal depth for partitioning. Diameter two is for band1 and band4, diameter three is for band2 and band3. The number of intervals equals to 2d. (See notation for corresponding band interval and diameter in table 4 and table 5).

apply pruning technique two, | C2 |prune2 = 2 x (2 + 2) + 2 x 2 = 12,

Table 4. Diameter two for band1 and band4

Band1 Band4

[0, 63] B11 B41

{64, 127] B12 B42

[128, 191] B13 B43

[192, 255] B14 B44

the total pruned number of candidate 2-itemsets is | C2 |prune1 + | C2 |prune2 = 4 + 12 = 16. Applying Apriori algorithm, the number of candidate 2itemset | C2 |apriori = (8 x 7) / 2 = 28 The percentage of pruning is 57%. The execution efficiency of mining process is improved.

Phase 3 Remaining steps are the same as Apriori algorithm. 4.

Large 2-itemsets: Scan the database and count the support for each candidate 2-itemsets, we get the large 2-itemsets: { {b42, b12}(2), {b42, b26}(2), {b42, b32}(2), {b44, b11}(2), {b44, b25}(2), {b44, b37}(2) } (the number in the brace refer to support). 5. Candidate 3-itemsets: Apply the Apriori algorithm, we have: { {b42, b12, b26}, {b42, b12, b32}, {b42, b26, b32}, {b44, b11, b25}, {b44, b11, b37}, {b44, b25, b37} }. 6. Large 3-itemsets: Scan the database and count the support for each candidate 3-itemsets, we get the large 3-itemsets: { { b42, b12, b26}(2), {b42, b12, b32}(2), {b42, b26, b32}(2), {b44, b11, b25}(2), {b44, b11, b37}(2), {b44, b25, b37}(2) } (the number in the brace refer to support). 7. Candidate 4-itemsets: Apply the Apriori algorithm, we have { {b42, b12, b26, b32}, {b44, b11, b25, b37} }. 8. Large 4-itemsets: Scan the database and count the support for each candidate 4-itemsets, we get the large 4-itemsets { {b42, b12, b26, b32}(2), {b44, b11, b25, b37}(2) }. 9. Candidate 5-itemset is empty. The large itemset generation algorithm terminates. 10. Derive association rules. Apply domain knowledge, we are interested in those rules in which band1, band2, and band3 appear in the antecedent and band4 in the consequent. We derive following association rules: b12 ∧ b26 ∧ b32 ⇒ b42, with support = 40% and confidence = 66.7%. b11 ∧ b25 ∧ b37 ⇒ b44, with support = 40% and confidence = 100%. In precision agriculture, users’ domain knowledge help us eliminate some unnecessary steps in derive rules. For example, we need not to compute the confidence for those rules, such as b11 ⇒ b44 ∧ bb25 ∧ b37, b25 ⇒ b44 ∧ b11 ∧ b37, b37 ⇒ b44 ∧ b11 ∧ b25, bb11 ∧ b25 ⇒ b44 ∧ b37, b11 ∧ b37 ⇒ b44 ∧ b25, and b44 ⇒ b11 ∧ b25 ∧ b37. It’s obvious that for one large 4-itemset, we only need to compute one rule out of eight. The percentage of improvement is (1-1/8) = 87.5%.

5. CONCLUSION In this paper, we defined a new data mining problem --mining association rules from imagery data and its application in precision agriculture. Since the efficiency of a mining algorithm is a very important issue of data mining, we proposed two simple

and effective pruning techniques for candidate 2-itemset generation. Our performance analysis showed that by exploiting the nature of the problem and characteristics of imagery data, we can prune significant number of unnecessary candidate itemsets during the very early phase of mining process. We presented an algorithm that applied new pruning techniques to the base algorithm.

6. REFERENCES [AIS93] R. Agrawal, T. Imielinski, and A. Swami, Mining Association Rules Between Sets of Items in Large Database. Proc. ACM-SIGMOD International Conference, May 1993. [ASR94] R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules. Proc International Conference on Very Large Databases, Santiageo, Chile, September 1994. [CHY96] Ming-Syan Chen, Jiawei Han, and Philip S. Yu, Data Mining: An Overview from a Database Perspective. IEEE Transaction on Knowledge and Data Engineering, Vol.s, No. 6, December 1996. [CNF96] D.W.Chung, U.T. Ng, A.W.Fu and Y.J.Fu, Efficient Mining of Association Rules in Distributed Databases. IEEE Transaction on Knowledge and Data Engineering, Vol. 8, No. 6, December 1996. [DSE] Stuart E. Dreyfus, The Art and Theory of Dynamic Programming . ACADEMIC PRESS, INC. [FUM96] U.M. Fayyad, G. Piaesky-shapiro, P. Smyth and R.Uthurusamy editors, Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. [HKK96] E.H. Han, G. Karypis and Vipin Kumar, Clustering Based On Association Rule Hypergraphs. 1997 SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, May 1997. [PCY97] J.S. Park, M.S. Chen and P.S. Yu, Using a Hash-Based Method with Transaction Trimming for Mining Association Rules. IEEE Transaction on Knowledge and Data Engineering, Vol. 9, No. 5, September/October 1997. [SR196] R. Srikant and R. Agrawal, Mining Quantitative Association Rules in Large Relational Tables SIGMOD, June 1996. [WAY91] J. Way and E.A. Smith, The Evolution of Synthetic Aperture Radar System and Their Progression to the EOS SAR. IEEE Transaction on Geoscience and Remote Sensing, 1991 [YSJ96] Show-Jane Yen and Arbee L.P. Chen, An Efficient Approach to Discovering Knowledge from Large Database. Fourth International Conference on PDIS 1996.