Visualization to assist users in association rule ... - Semantic Scholar

6 downloads 0 Views 164KB Size Report
The principle of Hooke's Law shall be used to dynamically map attraction and repulsion interactions between the items by associating virtual springs to the graph ...
Visualization to assist users in association rule mining tasks Claudio H. Yamamoto1 , Maria Cristina F. de Oliveira1 1

Instituto de Ciˆencias Matem´aticas e de Computac¸a˜ o – Universidade de S˜ao Paulo (USP) Caixa Postal 668 – 13.560-970 – S˜ao Carlos – SP – Brazil {haruo,cristina}@icmc.usp.br

Degree: Doctor’s Degree Graduate Degree Program: Computer Science and Computational Mathematics University: Universidade de S˜ao Paulo Admission: March, 2004 Conclusion date (estimated): March, 2008 Abstract. Since the definition of the association rule mining problem in Agrawal’s 1993 paper, several efficient algorithms have been introduced to tackle the problem. Nevertheless, most of the approaches focuses on increasing algorithm efficiency, neglecting the fact that association rule mining poses many practical difficulties to users. Mining useful information is not straightforward, and neither is interpreting the set of mined rules or establishing their actual relevance to the knowledge extraction problem at hand. In the last five to six years many approaches to mine association rules based on statistics, pruning, visual techniques and a combination of these have been introduced in the literature. But no general approach in which the user can understand and control the process by interacting with the analytical association rule mining algorithm along its execution has been introduced. In this work, we propose a dynamical framework for association rule mining that integrates interactive visualization techniques in order to allow users to drive the association rule finding process, giving them control and visual cues to ease understanding of both the process and its results.

1. Introduction First introduced by Agrawal [Agrawal et al. 1993], association rule mining is one of the most important data mining tasks. Let I = I1 , I2 , ..., In be a set of items, and T a transaction database in which each transaction contains a set of items from I. The task consists of obtaining a set of rules of the form X → Y such that X, Y ⊂ I, and X ∩ Y = ∅. The rules must satisfy a minimum support constraint s, meaning that at least s transactions of the database must contain X ∩ Y , and a minimum confidence constraint c meaning that at least c% of the transactions that contain X also contain Y . Even though many algorithms have been introduced to perform this task, e.g. [Agrawal et al. 1993, Agrawal and Srikant 1994, Brin et al. 1997, Han et al. 2000, Houtsma and Swami 1995, Savasere et al. 1995, Zaki et al. 1997], several relevant problems require additional treatment. Firstly, it is very hard for analysts to infer the ideal settings for support and confidence levels [Ivkovic et al. 2003]. Secondly, association ∗

This work is sponsored by CNPq and FAPESP.

rule mining algorithms typically produce too many rules, their operation is not intuitive to many users and interpreting the results can also be hard [Hofmann et al. 2000]. Thirdly, it may be very tedious for a user to find interesting knowledge in a corpus that holds hundreds or even thousands of rules [Blanchard et al. 2003]. Finally, even strong correlations between attributes are not always obvious from the discovered rules [Hofmann et al. 2000]. The aim of this work is to tackle the difficulties faced by users when performing association rule mining by providing a dynamical framework for association rule mining enhanced with interactive visual representations.

2. Related Work Several works reported in the literature tackle the problem of rendering the knowledge discovery processes more intuitive to data analysts. They typically resort to visualization approaches, and can be categorized into three major groups based on the role played by visualization in the process: visual data exploration for mining, visualization of mining models and visual data mining. In the first group, we find techniques that handle large amounts of raw, usually multidimensional data. The goal is to help analysts in inspecting the data prior to applying any statistical analysis or mining task. The second group comprises techniques that present users with a visual representation of the patterns obtained from a data mining task. Finally, the third group includes approaches that aim at inserting the user into the mining loop, resorting to visual techniques to create representations of intermediate results at different stages of the mining effort. Here, users drive the process interacting with the mining algorithm along its execution. A good survey of solutions that fall within these different approaches may be found elsewhere [de Oliveira and Levkowitz 2003]. Approaches targeted specifically at association rule mining that support user interaction at varying extents can be categorized into four main groups: (1) pre-processing approaches in which the user directs rule discovery, (2) approaches that allow user interaction along the association rule mining task, but provide no visual support, (3) approaches that support post-processing of rules, but provide no visual support, and (4) approaches that support post-processing of rules with visual support. The first group includes approaches that allow processing by users prior to the execution of the rule mining algorithm. Yoon & Kerschberg [Yoon and Kerschberg 1998] propose a solution whose underlying rationale is to discover association rules based on user’s queries. In Agrawal & Yu’s [Aggarwal and Yu 1998] pre-processing phase, the user can query the pre-processed data online and obtain an instantaneous answer, resembling the idea of OLAP (OnLine Analytical Processing). In Park’s work [Park et al. 1997], sampling is performed to adjust the support and confidence levels. The second group comprises approaches that allow users to interfere in the online algorithm execution. Hidber [Hidber 1999] allows users to freely adjust the support and confidence measures during the association rule mining. By keeping a superset of all large itemsets, the approach gives the correct result at the end of the execution. Goethals [Goethals and den Bussche 2001] allows users to specify conditions (queries) on the associations to be generated, in so-called sessions, also during algorithm execution. The third group of user-driven solutions is characterized by many approaches targeted at assisting users in analyzing the rules obtained. Focused

at helping users to get additional insight, such approaches support rule filtering [Goethals and den Bussche 1999], rule clustering [Zhao et al. 2004], identification and removal of redundant rules [Domingues 2004], multi-parameter analysis of the rules [Melanda and Rezende 2004], summarization [Jorge 2004] and rule navigation [das Neves 2002]. In [Bayardo and Agrawal 1999], the authors reveal that the most interesting rules reside along a border determined by the best rule according to a variety of interest measures, such as confidence, support, gain, chi-squared, value, gini, entropy gain, laplace, lift and conviction. Approaches in the fourth group typically present users with some visual representation of the corpus of rules extracted. A range of visual representations is used, such as tables, two-dimensional matrices and variations [Chakravarthy and Zhang 2003, Wong et al. 1999], graphs and bar charts [Klemettinen et al. 1994], Mosaic Plots and Double Decker Plots [Hofmann et al. 2000], grids and variations [Ong et al. 2002], a defined visual metaphor [Blanchard et al. 2003], parallel coordinates [Yang 2005], or a combination of different techniques [Bruzzese and Davino 2003]. None of the above approaches give users full awareness on the execution of the association rule mining algorithm. We believe that user awareness and control can be achieved with interactive visual techniques that allow users to interact with the association rule mining algorithm during its execution, e.g., showing a visual preview of the rules being identified that may be enhanced by filtering, browsing, details on demand, etc.

3. Project Proposal The difficulties discussed in the previous sections concerning the association rule mining problem lay out the basis for this proposal. Our goal is to revisit solutions for association rule mining and propose a dynamical framework that integrates interactive visualization techniques into the association rule mining task. Our proposal is based on two premises. One is that the most difficult step in association rule discovery is the first one – obtaining the most frequent itemsets. Therefore, considerable effort should be directed towards this step in order to ensure a solution that really meets users’ needs. The second premise is that, although there are many efficient algorithms to find association rules, major issues remain related to encouraging and favoring user participation, making the most to help users grasp a full understanding of the process and its results. Such issues must be tackled in order to significantly advance usage of rule mining algorithms. Given these premises, we may introduce the work proposal, which lies in four main points. 1. Dynamic adjustment of the minimum support threshold – Many approaches attempt to break up with the black box paradigm of the association rule mining problem, but no general successful solution has yet been attained. The black box paradigm consists of blindly setting the initial values for parameters such as minimum support and minimum confidence at the beginning of the execution of an association rule mining algorithm. In this case, the user has no means to interfere in the algorithm execution or to adjust parameters along the process. Our goal is to give users a “feeling” of the execution before the algorithm stops and outputs the complete set of mined rules, coupled with the possibility of adjusting parameters dynamically during execution, so that users can set them more or less restrictive based on the partial information they get. In order to allow for this dynamic adjustment, it is necessary to keep a superset of the large itemset to preserve a correct an-

swer, due to the anti-monotonicity property of the support. We intend to configure the size of the large itemsets by specifying suitable upper and lower bounds for the support. Based on the fact that, as the process is reaching its end it is more likely that the support threshold is correctly set (and unlikely to be drastically changed by the user, either up or down), we assume that such bound can be set narrower as the process gets closer to its end. 2. Dynamical visualization of the space of items – The basic idea is to show a dynamical graph-based visualization, in which items are mapped as graph nodes and node positioning is determined by some metric. The principle of Hooke’s Law shall be used to dynamically map attraction and repulsion interactions between the items by associating virtual springs to the graph edges. A similar approach has been proposed before and used to draw aesthetically nice graphs [Eades 1984]. A possible node positioning metric is to associate the relative frequency of pairs of items in the transactions database with the size of the springs connecting the items. According to this approach, pairs of items with high frequency of occurrence in the transactions database are pushed closer together than those with lower frequency. An alternative approach would be to use a metric distance to calculate the distance between items in a multi-dimensional space and generate a 2D visualization of the space using some dimension reduction technique. As the algorithm executes, nodes and springs are dynamically added to the visualization, which is changed by the forces acting over the springs, until it reaches an equilibrium state. We assume that equilibrium will be reached at some point, provided that the springs can be stretched to infinity, that is, they do not break as a result of the acting forces. Users may visualize and manipulate items during algorithm execution, and from their spatial distribution obtain insight that can prove useful to understand the data and adjust execution parameters. In an environment with multiple virtual springs, as is the case of the above visualization, the resulting force upon a node is the vector sum of all spring forces. Figure 1 exemplifies the influence of the spring forces over the nodes. Spring s1 is in equilibrium, so there are no forces from s1 influencing either A or B. s2 is compressed, so there are two forces, namely FCB and FBC , over B and C, respectively, separating them. Finally, s3 is stretched, so there are two forces, FCA and FAC approximating them. Note that C is affected by multiple forces, thus, its movement is given by the vector sum of all the vectors acting upon it, giving FR .

Figure 1. Example of an environment with many springs.

In addition to the spring-based graph visualization tailored to represent item associations, other supporting visualizations may be integrated into the interac-

tive framework. These visualizations shall be coordinated, whenever applicable [North and Shneiderman 2000]. 3. Interaction techniques – Appropriate interaction and distortion techniques [Keim 2002] must be provided for users to interact with and query the visualizations. We consider integrating at least three standard interaction techniques into the environment: filtering, zooming and distortion. Filtering will allow exploring partitions of interest (subsets of items), so users can browse them, perform queries or ask the system to generate association rules involving the remaining (un-filtered) items. The underlying assumption is that users will tend to select items that are somehow spatially grouped in the springbased graph visualization. For zooming, we consider allowing users to zoom in an item, in order to observe the real distance from this particular item to others, by stretching or compressing the correspondent springs. An alternative approach for zooming is to indicate future trends of approximation and separation of items, according to a number of n previous transactions (during the transactions database scan). For distortion, we consider merging two or more items into one when they are very close (when the frequency of the pair of items is very high) e.g. when a region of the visualization is pointed with the mouse. As additional features, users shall be allowed to preview changes in the visualization as a result of modifying support and confidence settings; and also to play a “movie” of the previous movements of items in the visualization to observe trends of approximation or separation between items along a time interval. 4. Adjustment of the user interference level – It is commonsense that computers and human beings have very different and complementary skills. From this perspective, one may state that a “cooperative” work between human and computer is preferable over a totally manual (user controlled) or a totally algorithmic (computer controlled) solution to the association rule mining problem. For the visualization framework suggested in this paper it is our goal to implement different levels of “cooperation” between user and computer, as described below. A similar approach was adopted by Ankerst in handling classification tasks [Ankerst et al. 1999]. Level 0 – Based on different heuristics (to be defined) the computer sets and changes the support and confidence levels during algorithm execution. The user interacts with the visualization, but cannot affect algorithm execution. Level 1 – The user defines the support and confidence levels prior to algorithm execution. The user interacts with the visualization, but cannot change setting during algorithm execution. This is a standard execution of the algorithm enhanced with visualization. Level 2 – The user defines when he/she can change the support and confidence levels and when the computer can change these levels during algorithm execution. Level 3 – The user defines and changes the support and confidence levels along algorithm execution. This is similar to the previous one, however the user can change the support/confidence levels at all times and algorithm execution is affected by the setting.

4. Methodology and Status of the Work The methodology adopted for project development comprises four major steps: (1) definition and implementation of the resources required for the interactive rule mining frame-

work using a platform/language with 2D technology support (e.g. Java); (2) execution of experiments with real users, observing and comparing their performance regarding the number of algorithm executions, the amount of time spent to mine rules, the level of comprehension of the rules and the level of user satisfaction with different approaches; (3) analysis of the experimental results; and (4) refinement of the framework introducing improvements identified during the evaluation. By July, 2005, the class courses credits required have already been taken. The work proposal is currently being refined as part of the preparation stage for the Qualifying Examination, with deadline in August, 2005.

5. Expected Results Our thesis is that the framework proposed will ease the task of association rule mining by giving users greater control over the mining task and by improving their ability to interpret the rules, evaluate their relevance and obtain insight on the knowledge mined from large datasets. We rely on interactive visualizations as an efficient approach to bridge the gap between task automation and user control in mining tasks. The expected improvement on efficiency brought by the interactive visually enhanced framework for association rule mining relies on four major characteristics. First, it is well known that guessing suitable values to set support and confidence constraints is not straightforward and typically requires many algorithm iterations. We expect that users interacting with the visual framework will, on average, reduce the number of iterations required for parameter tuning. Second, we hope to reduce the overall time spent in mining interesting rules. Third, we hypothesize that the level of comprehension of the rules will improve due to the visual cues and the interactive, user controlled execution of the algorithm. Finally, as a combination of the previous points, we hope to increase the level of user satisfaction with the mining environment. Since the last two criteria are somehow subjective, they are to be evaluated directly with the users using interviews, questionnaires etc.

6. Relevance and Contributions Applicability Despite significant advances, two important aspects of knowledge discovery processes, namely ease of use and knowledge interpretation, have been somehow neglected in favor of improved algorithm efficiency. In the past five to six years, this weakness is being dealt with by applying a range of approaches, including interaction and visualization techniques. However, there is no integrated framework available in which users can interact with and control the execution of association rule mining algorithm as a means to improve ease of use and understanding of the results. This is the issue addressed in this work. Our goal is to provide users with enhanced tools that facilitate the task of obtaining insight from the analysis of large datasets. As such, we believe that the proposed framework can empower users and extend data mining usage from experts to novices. Experiments with different data sets and user tasks shall reveal which types of users, tasks and data sets can be favored by the approach.

References Aggarwal, C. C. and Yu, P. S. (1998). Online generation of association rules. In Proc. Int’l. Conf. on Data Engineering, pages 402–411, Orlando, FL, USA. Agrawal, R., Imielinski, T., and Swami, A. N. (1993). Mining association rules between sets of items in large databases. In Proc. ACM Int’l. Conf. on Management of Data, pages 207–216, Washington, DC, USA.

Agrawal, R. and Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In Proc. Int’l. Conf. on Very Large Data Bases, pages 487–499, Santiago de Chile, Chile. Ankerst, M., Elsen, C., Ester, M., and Kriegel, H.-P. (1999). Visual classification: An interactive approach to decision tree construction. In Proc. ACM Int’l. Conf. on Knowledge Discovery and Data Mining, pages 392–396, San Diego, CA, USA. Bayardo, R. J. and Agrawal, R. (1999). Mining the most interesting rules. In Proc. ACM Int’l. Conf. on Knowledge Discovery and Data Mining, pages 145–154, San Diego, CA, USA. Blanchard, J., Guillet, F., and Briand, H. (2003). A user-driven and quality-oriented visualization for mining association rules. In Proc. IEEE Int’l. Conf. on Data Mining, pages 493–496, Melbourne, FL, USA. Brin, S., Motwani, R., Ullman, J. D., and Tsur, S. (1997). Dynamic itemset counting and implication rules for market basket data. In Proc. ACM Int’l. Conf. on Management of Data, pages 255–264, Tucson, AZ, USA. Bruzzese, D. and Davino, C. (2003). Visual post-analysis of association rules. J. of Visual Languages & Computing, 14(6):621–635. Chakravarthy, S. and Zhang, H. (2003). Visualization of association rules over relational DBMSs. In Proc. ACM Symp. on Applied Computing, pages 922–926, Melbourne, FL, USA. das Neves, J. M. P. M. (2002). Ambiente de P´os-processamento para Regras de Associac¸a˜ o. PhD thesis, Faculdade de Economia – Univ. do Porto. de Oliveira, M. C. F. and Levkowitz, H. (2003). From visual data exploration to visual data mining: A survey. IEEE Trans. on Visualization and Computer Graphics, 9(3):378– 394. Domingues, M. A. (2004). Generalizac¸a˜ o de Regras de Associac¸a˜ o. PhD thesis, Instituto de Ciˆencias Matem´aticas e de Computac¸a˜ o, Univ. de S˜ao Paulo. Eades, P. (1984). A heuristic for graph drawing. Congressus Numerantium, 42:149–160. Goethals, B. and den Bussche, J. V. (1999). A priori versus a posteriori filtering of association rules. In Proc. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pages 1055–1061, Philadelphia, PA, USA. Goethals, B. and den Bussche, J. V. (2001). A tight upper bound on the number of candidate patterns. In Proc. IEEE Int’l. Conf. on Data Mining, pages 155–162, San Jose, CA, USA. Han, J., Pei, J., and Yin, Y. (2000). Mining frequent patterns without candidate generation. In Proc. ACM Int’l. Conf. on Management of Data, pages 1–12, Dallas, TX, USA. Hidber, C. (1999). Online association rule mining. In Proc. ACM Int’l. Conf. on Management of Data, pages 145–156, Philadelphia, PA, USA. Hofmann, H., Siebes, A. P. J. M., and Wilhelm, A. F. X. (2000). Visualizing association rules with interactive mosaic plots. In Proc. ACM Int’l. Conf. on Knowledge Discovery and Data Mining, pages 227–235, Boston, MA, USA.

Houtsma, M. A. W. and Swami, A. N. (1995). Set-oriented mining for association rules in relational databases. In Proc. Int’l. Conf. on Data Engineering, pages 25–33, Taipei, Taiwan. Ivkovic, S., Yearwood, J., and Stranieri, A. (2003). Visualizing association rules for feedback with the legal system. In Proc. of 9th Intl. Conf. on Artificial Intelligence and Law, pages 214–223, Edinburgh, Scotland, UK. Jorge, A. (2004). Hierarchical clustering for thematic browsing and summarization of large sets of association rules. In Proc. SIAM Intl. Conf. on Data Mining, Lake Buena Vista, FL, USA. Keim, D. A. (2002). Information visualization and visual data mining. IEEE Trans. on Visualization and Computer Graphics, 8(1):1–8. Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., and Verkamo, A. I. (1994). Finding interesting rules from large sets of discovered association rules. In Proc. ACM Int’l. Conf. on Information and Knowledge Management, pages 401–407, Gaithersburg, MD, USA. Melanda, E. A. and Rezende, S. O. (2004). P´os-processamento de regras de associac¸a˜ o. In Proc. Simp. de Teses e Dissertac¸o˜ es do ICMC, S˜ao Carlos, SP, Brazil. North, C. and Shneiderman, B. (2000). Snap-together visualization: A user interface for coodinating visualizations via relational schemata. In Proc. Conf. on Advanced Visual Interfaces, pages 128–135, Palermo, Italy. Ong, K.-H., Ong, K.-L., Ng, W.-K., and Lim, E.-P. (2002). Crystalclear: Active visualization of association rules. In Proc. Int’l. Workshop on Active Mining, in conj. IEEE Int’l. Conf. on Data Mining, Maebashi, Japan. Park, J. S., Yu, P. S., and Chen, M.-S. (1997). Mining association rules with adjustable accuracy. In Proc. ACM Int’l. Conf. on Information and Knowledge Management, pages 151–160, Las Vegas, NV, USA. Savasere, A., Omiecinski, E., and Navathe, S. B. (1995). An efficient algorithm for mining association rules in large databases. In Proc. Int’l. Conf. on Very Large Data Bases, pages 432–444, Zurich, Switzerland. Wong, P. C., Whitney, P., and Thomas, J. (1999). Visualizing association rules for text mining. In Proc. IEEE Symp. on Information Visualization, pages 120–123, San Francisco, CA, USA. Yang, L. (2005). Pruning and visualizing generalized association rules in parallel coordinates. IEEE Trans. on Knowledge and Data Engineering, 1(17):60–70. Yoon, J. and Kerschberg, L. (1998). Query initiated discovery of interesting association rules. In Discovery Science, pages 232–243. Zaki, M. J., Parthasarathy, S., Ogihara, M., and Li, W. (1997). New algorithms for fast discovery of association rules. In Proc. ACM Int’l. Conf. on Knowledge Discovery and Data Mining, pages 283–286, Newport Beach, CA, USA. Zhao, Y., Zhang, C., and Zhang, S. (2004). Discovering interesting association rules by clustering. In Proc. Australian Joint Conf. on Artificial Intelligence, pages 1055–1061, Cairns, Australia.