IEEE Paper Template in A4 (V1)

4 downloads 0 Views 1MB Size Report
Krzysztof Koperski, and Jiawei Han. 1995. Discovery of spatial association rules in geographic information databases. Advances in spatial databases, Springer ...
Volume 5, Issue 6, June 2015

ISSN: 2277 128X

International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com

Association Rule Mining for Ground water and Wastelands Using Apriori Algorithm: Case Study of Jodhpur District Mainaz Faridi* Department of Computer Science Banasthali University, India

Seema Verma Department of Electronics Banasthali University, India

Saurabh Mukherjee Department of Computer Science Banasthali University, India

Abstract— The advancement and improvement in data collection and storage techniques have led to collect and store terabytes of data on daily basis. This large volume of data hides meaningful and interesting information that need to be brought in light. This has made data mining as one of the profoundly researched domain of the recent years. Uncovering and finding out the non- trivial, previously unknown and hidden information from large data repositories and data warehouses is the primary goal of data mining. Data mining when applied to spatial data sets is called Spatial Data Mining or Geographic Data Mining, where it can be used to characterize spatial data, interrelate spatial and non spatial data and depict hidden and veiled spatial patterns. Data mining has many methods for discovering the previously unseen patterns and trends such as clustering, classification, prediction, regression, outlier detection, association rule mining etc. In this research paper, authors propose to mine association rules between ground water and wastelands using spatial data mining techniques. The salt-affected waste lands and waste lands without scrubs showing higher ground water level underneath can be irrigated using this water thereby increasing the area under cultivation. Keywords— Spatial Data Mining, Association Rule Mining, Apriori Algorithm, Wastelands, Ground Water. I. INTRODUCTION WRIS and BOOSAMPDA are two major projects run by ISRO (Indian Space Research Organization) and NRSC (National Remote Sensing Centre) providing country wide information on ground water and data relevant to land cover across India in form of maps respectively, producing huge amount of data related to ground water and land-cover[1]. The tremendous volume of numeric and geospatial data stored in different formats, databases and data repositories imposes a need for a wide range of tools and techniques to analyze, query, uncover data patterns or even predict phenomenon where human intelligence alone is not sufficient to solve complex cases [2] New technologies and methods are needed to explore these large databases for hidden and implicit knowledge, special patterns, or correlation between spatial and non spatial attributes[3]. Recent research activities on knowledge discovery on large spatial databases have paved a foundation for spatial data mining techniques. A. Spatial data mining Spatial data mining i.e. discovery of interesting, implicit knowledge in spatial databases, provides means for understanding and use of spatial data- and knowledge- bases. Spatial data mining is also referred to as Geographical Data Mining [4] and Knowledge Discovery in Spatial Database [5]. The main difference between data mining and spatial data mining is that in spatial data mining tasks we use not only non-spatial attributes (as it is usual in data mining in nonspatial data), but also spatial attributes. Traditional data mining has no or very little dependence between the studied variables and lacks the ability to correlate non-spatial attributes with spatial information [6]. Spatial data mining is the process to find and uncover useful and interesting patterns which are hidden in large spatial datasets. Revealing interesting and potentially useful patterns from large spatial datasets is much more complex than extracting the corresponding patterns from conventional numeric and categorical data sets. The complexity of spatial data types, relationships and autocorrelation of spatial attributes account to this difficulty [7]. B. Association Rule Mining using Apriori Algorithm Association Rule Mining (ARM) is an important and widely used technique of data mining. This is one of the extensively used and studied methods of data mining, having a wide range of application areas. The most common example is the market basket analysis where association between different consumer products is figured out which can assist in taking effective business and marketing decisions. Other application domains which provide large data sets where ARM can be applied are finance, insurance, banking, fraud detection, medical, bioinformatics, demographic studies, telecommunication, GIS, remote sensing, e-commerce and retailing. More recently association rule mining is also applied to areas like pharmaceutics, law and justice, aviation management, agriculture, weather forecast etc. Let there are T transactions in database D and X and Y are disjoint itemsets containing collection of items i.e. there intersection is null, (X ∩ Y = ∅). An association rule can be written in form X → Y, where X is the antecedent (left hand side of the rule) and Y is the consequent (right hand side). A rule may contain more than one item in antecedent and consequent of rule. The strength and reliability of an association rule is measured by two factors: support and confidence. © 2015, IJARCSSE All Rights Reserved

Page | 751

Faridi et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6), June- 2015, pp. 751-758 Support (prevalence) is percentage of database transactions that contains X and Y or it can be viewed as the probability where X and Y occur together i.e. σ (X∪Y). Support s for rule (X→Y) can be calculated as: σ X ∪Y Support(s) for (X →Y) = (1) N

Confidence (predictability) is the percentage of database transactions containing X that also contain Y. In other words, it could be seen as the conditional probability, σ(Y|X). It can be calculated as: σ(X ∪ Y ) Confidence(c) for (X → Y) = (2) σ(X)

Support provides statistical significance to the rule. If it is too low then it may be possible that the rule has occurred mere by chance. On the other hand, confidence measures reliability or predictability of the rule. If it is kept high then one can easily infer that Y is also present in transactions containing X. Therefore, to select only those rules which have high interestingness threshold levels are set on support and confidence values, called as minsup and minconf, respectively. Generally a low minsup and a high minconf are set to ensure that all the possible interesting rules have been mined. Association rules are mined in two phases. In first step (Frequent itemset generation), using minsup all the itemsets are found whose support is greater than minsup. Such itemsets are called frequent itemsets. In the next phase, all the rules are pruned from frequent itemsets, who satisfy the minconf threshold (Rule generation) [8]. 1)Apriori Algorithm: Many algorithms have been proposed for association rule mining. But the eminent one remains the Apriori Algorithm, proposed by Agrawal et. al in 1994 [9]. This has remained the much studied and researched algorithm even after many years of its introduction. Many advancements and extensions have been proposed for this algorithm, but its applicability to many areas has still to be utilized. Apriori algorithm works on the principle of downward closure property or anti monotone property. In order to generate frequent itemsets by searching all the possible itemsets, whole database needs to be scanned. To reduce the number of candidate itemsets during frequent itemset generation, anti monotone property is used. It states that if an itemset is frequent then all its subsets will also be frequent or if an itemset is not frequent then its supersets are also not frequent. Let P be the power set and X be the subset of Y. Reference [8] shows that a measure f is anti monotone if ∀ X, Y ∈ P: (X ⊆ Y) → f(Y) ≤ f(X). Apriori algorithm uses breadth-first technique to search the candidate itemsets. It uses itemsets with k-1 length to generate itemsets of k length (join step). Then it uses the anti monotone property to generate frequent itemsets (prune step). Association rules can be generated by using frequent itemsets such that X → Y-X. Those rules whose confidence does not satisfy minconf threshold are dropped out and only the remaining strong rules are chosen. 2)Pseudocode: The pseudo code for the algorithm is stated as follows: ALGORITHM. Apriori Input: D, a database of transactions; minsup, the minimum support count threshold. Output: Lk, frequent itemsets in D. L1= {frequent 1-itemsets}; for(k= 2; Lk-1 !=∅; k++) { Ck = candidates generated from Lk-1 //that iscartesian product Lk-1 x Lk-1 and eliminating any k-1 size itemset that //is not frequent for each transaction t in database do{ #increment the count of all candidates in Ck that are contained in t Lk = candidates in C k with minsup }//end for each }//end for return ⋃kLk; } II. AIM AND OBJECTIVES Land and water are undoubtedly the two major natural resources which are essential for the very existence of life. With the increase of population the demand for land has raised many folds. Therefore, objective of the study is to find those barren lands having a substantial ground water level, so that these lands can be used for cultivation of crops and fodder for animals. The study aims to unearth association rules between ground water and wastelands of Jodhpur District. The outcomes will reveal some useful patterns helping us to relate ground water and wastelands. III. RESEARCH METHODOLOGY A.Study Area Jodhpur district comes under arid zone of the Rajasthan situated between 250 51’ 08” & 270 37’ 09” North latitude and 0 71 48’ 09” & 730 52’ 06” East longitude. It covers 11.60% of total arid area of the state. Jodhpur district, part of Jodhpur Division covers a geographical area of 2256405 hectares and is divided into 5 sub-divisions that are Jodhpur, Shergarh, Pipar City, Osian & Phalodi. The district has 07 tehsils & 09 blocks. The district is bounded by Bikaner in North, Nagaur in East, Jaisalmer in west, and Barmer and Pali in the South. © 2015, IJARCSSE All Rights Reserved

Page | 752

Faridi et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6), June- 2015, pp. 751-758

Fig .1. Map showing study area location B.Data Collection The study required information about land use, ground water and soil in the study area in GIS format. For the proposed system the data has been collected from Indian Space and Research organization (ISRO) Jodhpur Center. The center provided the data for land use, ground water and soil for Jodhpur district for the year 2005 in GIS format. The different types of dataset and their basic characteristics pertaining to this study are briefly described as follows: 1) Landuse Data of Jodhpur District: Land use Map of Jodhpur shows the division of land into Agricultural Land, Built-up, Forest, Waste-land, Water bodies and Wetlands. 2)Ground Water Data of Jodhpur District: Jodhpur District is classified into different regions depending upon on the level and quality of ground water viz. Good, Good but saline, Good to Moderate, Moderate, Moderate to Poor, Poor, poor to Nil, Saline, Settlement, Very Good to Good and Water Body mask. C. Tools/ Softwares used ArcMap 10 is used for creating thematic maps and overlays. Weka 3.6 is used for generating Association rules. D. Methods The methodology developed for this study is shown below in figure 2. Each block represents the sub-processing step to reach up to the final output.

Fig. 2. Overall approach of the study. 1)Pre-processing of Data: The spatial datasets are preprocessed to create a transactional database before association rule mining can be applied. The preprocessing of spatial data may include selection of non spatial attributes, feature selection, dimension reduction, carrying out join, union or intersection operations, data categorization etc [10].The study required two different types of data set for ground water and waste lands. The pre-processing of data was carried in three steps: © 2015, IJARCSSE All Rights Reserved

Page | 753

Faridi et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6), June- 2015, pp. 751-758 a.Thematic layer with the required attributes is created for waste land data. b.Thematic layer with the required attributes is created for ground water data. c.Intersection is performed on the waste land and ground water layers to get a new intersection layer and a new thematic layer is created that shows those areas of Jodhpur district which are either salt-affected waste lands or waste lands without scrubs having good ground water beneath. The details of the above pre-processing steps are as follows: a. Thematic Layers for Waste Land Land use data of Jodhpur district as provided by the ISRO center Jodhpur, classifies the land use into following types: Agricultural Land, Built-up, Forest, Waste-land, Water bodies and Wetlands. The table I shows the land use pattern in the order of decreasing area and figure 3 shows the land use map. Table I: Land Use Pattern Land –Type Agriculture Waste-lands Built-up Water bodies

Area(Hectares) 1940925.7 675378.7 29594 20406.8

Forest Wetlands

14164.8 6110.4

Fig. 3. Land Use map of Jodhpur District Out of all the above classified lands, the study focuses on waste-lands only. Therefore, to get the waste-land distribution pattern a new thematic layer is prepared showing only waste lands. The figure 4 and table II show the newly created thematic layer for waste land only. The layer shows that the waste lands are again classified into Sandy-desertic Land, Salt Affected, Land Mining/ Industrial waste, Land without scrub, Land with scrub, Gullied/Ravenous Land, Barren Rocky/ Stony waste land. Table II: Waste Land Pattern Waste-land Type

Area(Hectares)

Sandy-desertic Land

213737.9

Land without scrub

155328.8

Land with scrub Barren Rocky/Stony waste Mining Industrial waste

154027

Salt Affected Land

3716.7

Gullied/Ravenous Land

2816.9

141733.4 4017.7

Fig. 4. Waste Land distribution of Jodhpur District Among all the types of waste lands only waste lands that are either salt affected or without scrubs are chosen for further study. The reason behind it is that all other types of waste-lands are either already contain some vegetation(Land with scrub) or are not suitable for growing any type of vegetation(Sandy-desertic Land, Land Mining/ Industrial waste, Gullied/Ravenous Land, Barren Rocky/ Stony waste land). Therefore, a new thematic layer for “Land Without Scrubs” and “Salt Affected Waste Land” is created. The figure 5 and table III show this layer. Table III: Waste Land (Salt affected/ Without Scrub) Wasteland Land without scrub Salt Affected Land

Area(Hectares) 155328.80 3716.73

Fig. 5. Waste Land (Salt affected/Without Scrub) distribution. © 2015, IJARCSSE All Rights Reserved

Page | 754

Faridi et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6), June- 2015, pp. 751-758 Thus, the above process can be summarized as:

Fig. 6. Thematic layers of Land use data b. Thematic Layers for Ground water Ground water data, as provided by the ISRO Center, Jodhpur is classified into different types like Good, Good but saline, Good to Moderate, Moderate, Moderate to Poor, Poor, poor to Nil, Saline, Settlement, Very Good to Good and Water Body mask. Based on this classification Jodhpur District is divided into these regions .This distribution of ground water is shown in the figure 7 and table IV. Table IV: Ground water Pattern Ground Water

Area (Hectares)

Good

40115.58

Good but Saline

11168.37

Good to moderate

27975.06

Moderate

582198.82

Moderate to Poor

1460483.16

Poor

313362.70

Poor to Nil

98935.60

Saline

3345.65

Settlement

31270.99

Very good to good

266028.01

Water Body Mask

21649.42

Fig. 7. Ground water distribution of Jodhpur. Out of these classified regions, only those regions of Jodhpur District are selected having Good, Good but saline, Good to Moderate and Very Good to Good ground water level. As a next step, new thematic layer for ground water is created containing only the selected attributes as showed in figure 8 and table V. Table V: Good ground water Pattern Area(Hectare

Ground Water Good

s)

40115.58

Good but Saline

11168.37

Good to moderate

27975.07

Very good to good

266028.01

Figure 8: Good ground water distribution of Jodhpur District. © 2015, IJARCSSE All Rights Reserved

Page | 755

Faridi et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6), June- 2015, pp. 751-758 Thus, the above process can be summarized as shown in figure 9:

Fig. 9. Thematic layers of Ground water data. c. Overlays and Intersection of Thematic Layers As the next step overlay maps of waste lands (salt affected and without scrubs) and good ground water is created. An overlay operation is much more than a simple merging of linework, all the attributes of the features taking part in the overlay are carried through, as shown in the figure 10 below, where wastelands (polygons) and good ground water (polygons) are overlayed to create a new polygon layer.

Fig. 10. Overlay Map of Wasteland (Salt affected/ Without Scrub) and Good Ground Water. Then a new layer is created for those areas of the district having waste lands which are salt affected or without scrub and have good ground water beneath, by using intersection. The newly constructed layer is shown in the figure 11. Table VI shows the area under mining pattern. Table VI: Area under mining pattern. Waste Land Land scrub Salt Land Total

without Affected

Area(Hectare s) 13308.98 329.96 13638.94

Fig.11. Intersect Map of Wastelands (Salt affected/Without Scrub) and Good Ground Water. 2)Association Rules Generation: For generating Association rules, a tool called Weka 3.6 is used. The database file obtained from the above map (figure 11) is converted into ARFF format on which association rules are generated using Apriori algorithm. IV. RESULTS AND DISCUSSION Apriori algorithm was run in Weka using the arff file created after the preprocessing of data. Three attributes were chosen viz. Taluk, WasteLandType and GroundWaterType from the database file as predicates. Six itemsets of size1, 7 itemsets of size 2 and 2 itemsets of size 3 were discovered from a total of 285 instances of data in 17 cycles. Minimum support and minimum confidence kept were 15% (0.15) and 90% (.9) respectively. Tables VII,VIII and IX show large item sets found in the data. © 2015, IJARCSSE All Rights Reserved

Page | 756

Faridi et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6), June- 2015, pp. 751-758 Table VII. Large Itemsets L(1) Item 1 Count Taluk=Bilara 99 Taluk=Jodhpur 76 Taluk=Phalodi 61 WasteLandType=Landwithout scrub 280 GroundWaterType=Very good to good 181 GroundWaterType=Good 44 Table VIII. Large Itemsets L(2) Item 2 WasteLandType=Land without scrub

Item 1 Taluk=Bilara

Count 99

Taluk=Bilara

GroundWaterType=Very good to good

81

Taluk=Jodhpur

WasteLandType=Land without scrub

76

Taluk=Jodhpur

GroundWaterType=Very goog to good

75

Taluk=Phalodi

WasteLandType=Land without scrub

57

WasteLandType=Land without scrub

GroundWaterType=Very good to good

181

WasteLandType=Land withut scrub

GroundWaterType=Good

44

Item 1 Taluk=Bilara Taluk=Jodhpur

Table IX. Large Itemsets L(3) Item 2 Item 3 WasteLandType=Land without GroundWaterType=Very good to scrub good WasteLandType=Land without GroundWaterType=Very good to scrub good

Count 81 75

The best rules found after applying Apriori algorithm are listed in the table X below. Table X. Association Rules Mined for Ground Water and Waste Lands of Jodhpur District. Implies Head Support

S.No.

Body

1.

GroundWaterType=Very good to good Taluk=Bilara Taluk=Bilara GroundWaterType=Very good to good 81 Taluk=Jodhpur 76 Taluk=Jodhpur GroundWaterType=Very good to good 75 GroundWaterType=Good 44 Taluk=Jodhpur 76

==>

WasteLandType=Land without scrub

81

100

==> ==>

WasteLandType=Land without scrub WasteLandType=Land without scrub

99 81

100 100

==> ==>

WasteLandType=Land without scrub WasteLandType=Land without scrub

76 75

100 100

==> ==>

44 76

100 99

==>

76

99

9.

Taluk=Jodhpur WasteLandType=Land scrub 76 Taluk=Jodhpur 76

WasteLandType=Land without scrub GroundWaterType=Very good to good GroundWaterType=Very good to good

76

99

10.

Taluk=Phalodi 61

WasteLandType=Land without scrub GroundWaterType=Very good to good WasteLandType=Land without scrub

61

93

2. 3.

4. 5.

6. 7. 8.

without ==>

==>

Conf %

Results show that 13638.94 hectares of land fall under mining pattern. Analysis of results is shown in form of a graph in figure 12. It shows that Bilara has the maximum (6481.05 hectares) waste lands distribution of the mined pattern. The area mined is substantially a large one that can be utilized for vegetation production using the water underneath. The same results presented above are obtained by implementing the WEKA Apriori Algorithm in own Java code. © 2015, IJARCSSE All Rights Reserved

Page | 757

Faridi et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6), June- 2015, pp. 751-758

Fig.12. Graph showing distribution of Wastelands in taluks of Jodhpur District. V. CONCLUSION The analysis of pattern shows that majority of wastelands without scrubs having very high groundwater lie in Bilara region of Jodhpur District. Having good amount of water underneath, these lands can be used to produce firewood and fodder for animals. Plant species like Acacia jacquemontii, Acacia leucophloea, Acacia senegal, Albizia lebbeck, Azadirachta indica, Anogeissus rotundifolia, Prosopis cineraria, Salvadora oleoides, Tecomella undulata, Tamarix articulate, Leucaena leucocephala, Tephrosia purpurea and Crotalaria medicaginea can be grown. Farmers can be advised to cultivate crops using ground water irrigation. If we know that a land has good ground water level, then land can be irrigated using this water. Even if the water underneath is saline, then also salt resistant species of plants can be grown. In this way we can effectively utilize waste-lands. VI. FUTURE WORK A wide variety of research is being carried in the field of spatial data mining.  As the next level of this research, Fuzzy Spatial Association Rules could be determined.  Soil and crop data could also be used along with the ground water and wasteland data.  Also spatio-temporal association rules could be determined as an extension to this current research. Hence, a lot of research is needed to be carried out in these emerging areas, focusing on its applicability to agriculture, data mining and GIS, which will provide means for better utilization of natural resources. ACKNOWLEDGMENT The authors would like to thank ISRO, Jodhpur Centre for providing necessary data about the research scenario. REFERENCES [1] Mainaz Faridi, Seema Verma and Saurabh Mukherjee. 2012. Impact of ground water level and its quality on fertility of land using GIS and Agriculture Business Intelligence. In Proceedings of Geomatrix’12- An International Conference on Geospatial Technologies and Applications, IIT Bombay (Feb 2012). [2] Yuan, May, B. Buttenfield, M. Gahegan, and Harvey Miller. 2004. Geospatial data mining and knowledge discovery. Chapter 14 (2004): 365-388. [3] Krzysztof Koperski, and Jiawei Han. 1995. Discovery of spatial association rules in geographic information databases. Advances in spatial databases, Springer Berlin Heidelberg. vol 6, 47-66. [4] Stan Openshaw. 1999. Geographical data mining: key design issues. In Proceedings of GeoComputation, vol. 99. [5] Krzysztof Koperski, Jiawei Han, and Nebojsa Stefanovic. 1998. An efficient two-step method for classification of spatial data. In Proceedings of International Symposium on Spatial Data Handling (SDH 1998), Vancouver, BC, Canada. 45-54. [6] Hong Tang and Simon McDonald. 2002. Integrating GIS and spatial data mining technique for target marketing of university courses. In ISPRS Commission IV, Symposium, Ottawa Canada, (Jul 2002). [7] D. Rajesh. 2011. Application of Spatial Data Mining for Agriculture. International Journal of Computer Applications 15,2 (2011), 7-9. [8] Tan, Pang-Ning, and Vipin Kumar. 2005. Chapter 6. Association Analysis: Basic Concepts and Algorithms." Introduction to Data Mining. Addison-Wesley. ISBN 321321367 (2005). [9] Rakesh Agrawal, and Ramakrishnan Srikant. 1994. Fast algorithms for mining association rules. In Proceedings of 20th int. conf. very large data bases, VLDB, (1994), vol. 1215, 487-499. [10] Chen, Junming, Guangfa Lin, and Zhihai Yang. 2011. Extracting spatial association rules from the maximum frequent itemsets based on Boolean matrix. In Geoinformatics, 2011 19th International Conference on, IEEE (2011), 1-5. © 2015, IJARCSSE All Rights Reserved

Page | 758