Research Issues in Knowledge Discovery ... - Semantic Scholar

3 downloads 167677 Views 110KB Size Report
Apr 26, 2002 - Affordable storage media and improved connectivity .... For example, these domains ... semantic concepts and domain knowledge can guide.
Research Issues in Knowledge Discovery, Management and Visualisation Report ILA-02-003 Friday, April 26, 2002

John F. Roddick Flinders University of South Australia PO Box 2100 Adelaide 5000 South Australia [email protected]

A number of trends in ICT have indicated that tools for the automated and semi-automated analysis of data will be essential for competitive and efficient organisations in the future. These trends include the rapidly increasing volumes of data being collected from an growing number of sources and the increasing complexity of that data. Both business and government are recognising the importance of data for strategic advantage while, at the same time, a more mobile workforce is resulting in losses to the institutional memory of those organisations. Advances in automated and semi-automated data analysis techniques such as data mining and knowledge discovery, together with more sophisticated methods of managing and visualising that data (and any abstractions of that data) are thus essential if the volumes of information being collected are to be usefully exploited.

Research Issues in Knowledge Discovery, Management and Visualisation

Motivation Affordable storage media and improved connectivity, with data availability through the internet and other sources, together with an increased complexity of data has led to improvements in the accessibility of data but also to systems (and organisations) that are data-rich, knowledge-poor. At the same time, there is an increased recognition of the value of knowledge for strategic and operational advantage - a fact that is hindered by the growing mobility of the workforce and the consequential loss of institutional memory. Listed below are a number of the factors that are motivating the search for automated and semiautomated knowledge discovery solutions - ie. systems which can assist in the elicitation of useful knowledge from potentially vast repositories of information. Storage Media - The oft-quoted Moore's Law applies not only to processing capacity but also to the growth in both the capacity and the affordablility of disk storage media. This has resulted in organisations choosing to retain data that otherwise would have put on offline storage (or deleted entirely) and to collect information hitherto considered only marginally of interest. It has also resulted in changes to storage regimes such as the use of local (ie. duplicate) copies where previously this was not affordable. Interestingly, access times for data, particularly I/O channel speed, have not increased at the same rate resulting in previously CPU-bound processes becoming I/O-bound. The combination of these factors means that even for simple (linear time complexity) tasks such as data backup, (and information analysis is generally non-linear) the time required for many tasks has been steadily increasing over the past 20 years, despite improvements in technology. Complexity - Generally speaking, the conventional databases of the 1990s were static, location-less and of low-dimensionality. With the growth of spatial and temporal data models (support for which is currently being embedded in the SQL standard) and the development of multi-dimensional data Report ILA-02-003

warehouse technology this is changing. This complexity means that assumptions that were made of the data (generally due to a lack of contradictory evidence), such as I assume this data is not seasonal, or I assume this data applies equally to Australia, need now not be made as the historical or locational data can be inspected more thoroughly. This extra inspection, however, takes time and thus while the results may be more accurate, the cost of extracting knowledge from the data increases1. Improved Connectivity - Part of the reason for increases in data availability is the increased connectivity possible and the ability to integrate external data sources for comparative, reference or other purposes. This has also meant that real-time data analysis has become possible with real-time analysis of shares, weather conditions and traffic conditions now common. Value of Knowledge - The Knowledge Economy rests on the use of knowledge for strategic advantage. This is now widely recognised and mechanisms (and staff) which enable the capture and analysis of data that might provide a competitive edge or a useful insight are now highly prized. Unfortunately for many organisations, the mobility of staff is increasing resulting in valuable institutional memory being lost and thus new staff spending time merely rediscovering existing knowledge. The broad area of Knowledge Discovery, Management and Visualisation therefore attempts to address these important problems through research in the development of techniques for the discovery, management and presentation of useful knowledge from potentially extremely large and complex data repositories.

1

Interestingly, the ability to perform more checks on data means that it could be seen as irresponsible not to do so and to simply continue with existing procedures. The Australian Bureau of Statistics, for example, is now issuing more complex caveats and explanations with their data so that they can be interpreted correctly.

22nd April 2002

2 of 4

Research Issues in Knowledge Discovery, Management and Visualisation

Research Agenda

Data and Rule Visualisation

The vision for this area is the construction of integration frameworks and flexible systems that enable the interrogation, analysis and mining of knowledge from large, complex and potentially distributed datasets in a manner that facilitates understanding of the semantics of that knowledge and thus enables appropriate decisions to be made. Such data will come in a variety of formats including visual, audio and textual, and will include temporal and spatial characteristics. In many cases the appropriate presentation mechanisms will be multi-dimensionally graphical.

While fully-automated analysis is useful, the combination of computer-generated graphical presentations and human pattern recognition abilities means that visualisation techniques offer considerable opportunity for spotting trends, anomalies and associations in data.

There is also a strong ethical and legal aspect to this area with information security also being a major consideration. The key to success in this area is the fusion of multiple, often real-time, and commonly highly detailed data sources. Of particular current interest in this area is research into knowledge discovery systems, the semantics of complex data, data and rule visualisation, data fusion, vision, speech and speaker understanding, web-based search and complex decision support systems.

Indicative Research Areas Listed below are four indicative research areas that lie within the capabilities of the ILA and which have synergies other researchers in Australia, notably those associated with the Horizons Institute.

Temporal and Spatial Mining The mining of data with significant temporal and/or spatial semantics presents particular challenges if the particular opportunities offered (for example, in the ability to spot a cause and effect correlation) are not to be missed. For example, these domains represent a difficult research area in that synthetic datasets, often used in other fields of ICT, are generally of little use and thus real-world data must be obtained and used. While considerable work has been done in some aspects of this area, notably in Canada, Germany, Australia and the US, significant open problems remain.

Report ILA-02-003

Present mining algorithms do not adequately constrain the generation of rules to those that are of interest to the user and thus this area also includes the broader field of interactive data mining. Interactive mining techniques aim to alleviate this problem by involving the user in the mining process, so that a user's understanding of abstract semantic concepts and domain knowledge can guide the discovery process, resulting in accelerated mining with improved results.

Rule Mining To date, most data mining algorithms and frameworks have concentrated on the extraction of interesting rules directly from the collected data. This area investigates the generic modelling of these rules and the utility of deriving rules from the results of other data mining routines, that is, mining from rulesets. The approach offers three significant advantages. Firstly, with the expansion of dataset size, the tractability of mining from the complete dataset may be difficult on a regular basis. Secondly, changes in observations (and therefore in the observed system) can be more easily discovered by inspecting changes in extracted rules over time (or over any other sequential progression). Finally, the nature of the rules extracted by this process are that they contain different semantics from that exhibited by first order discovery process. In many cases such rules are closer to the sorts of rule frequently used to describe everyday phenomena.

Ethical Issues The development of data mining is presenting significant ethical and social issues that must be addressed if the new technology is to widely accepted. This area explores a range of these issues including the issues of privacy, data accuracy, database security, legal liability and the broader research dilemmas. While some are more widely

22nd April 2002

3 of 4

Research Issues in Knowledge Discovery, Management and Visualisation

relevant, the development of data mining technologies has shown, for example, that much legislation is inadequate and even contradictory.

Application Contexts In common with the other areas, specific contexts will be employed as they represent complex but valuable application domains for this work. In this agenda, the domains of defence and health will be targetted. Moreover, the management of knowledge within mobile environments and between organisations means that the Knowledge Discovery, Management and Visualisation area will interact substantially with the other research agenda.

References Chen, M.-S., J. Han, and P. Yu, S., Data mining: an overview from database perspective. IEEE Transactions on Knowledge and Data Engineering, 1996. 8(6): p. 866-883. Ester, M., H.-P. Kriegel, and J. Sander. Knowledge Discovery in Spatial Databases. in 23rd German Conference on Artificial Intelligence, KI'99. 1999. Bonn, Germany: Springer.Bell, D.A., S.S. Anand, and C.M. Shapcott. Data Mining in Spatial Databases. in International Workshop on Spatio-Temporal Databases. 1994. Benicassim, Spain. Ester, M., H.-P. Kriegel, and J. Sander. Spatial Data Mining: A Database Approach. in Fifth Symposium on Large Spatial Databases (SSD'97). 1997. Berlin, Germany: Springer. Fayyad, U.M., G. Piatetsky-Shapiro, and P. Smyth, From Data Mining to Knowledge Discovery: An Overview, in Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, et al., Editors. 1996, AAAI Press/ MIT Press. p. 1-34. Frawley, W.J., G. Piatetsky-Shapiro, and C.J. Matheus, Knowledge discovery in databases: an overview. AI Magazine, 1992. 13(3): p. 57-70. Also in Piatetsky-Shapiro, G. and Frawley, W.J., (eds.) 1991. Knowledge discovery in databases. AAAI Press/MIT Press, Menlo Park, CA. 1-30. Frawley, W.J., G. Piatetsky-Shapiro, and C.J. Matheus, Knowledge discovery in databases: an overview, in Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W.J.

Report ILA-02-003

Frawley, Editors. 1991, AAAI Press: Menlo Park, CA, USA. p. 1-27. Holsheimer, M. and A.P.J.M. Siebes, Data mining: the search for knowledge in databases. 1994, CWI, The Netherlands. Koperski, K., J. Adhikary, and J. Han. Knowledge Discovery in Spatial Databases: Progress and Challenges. in ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery. 1996. Montreal, Canada. McLeish, M., et al., Discovery of Medical Diagnostic Information: An Overview of Methods and Results, in Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W.J. Frawley, Editors. 1991, AAAI Press/MIT Press: Cambridge, MA. p. 477-490. Roddick, J.F. and B.G. Lees, Paradigms for Spatial and Spatio-Temporal Data Mining, in Geographic Data Mining and Knowledge Discovery, H.J. Miller and J. Han, Editors. 2001, Taylor and Francis: London. p. 33-49. Roddick, J.F. and M. Spiliopoulou, A Survey of Temporal Knowledge Discovery Paradigms and Methods. IEEE Transactions on Knowledge and Data Engineering, 2002. 14(3).

22nd April 2002

4 of 4