Requirements for coordinated multiple view ... - IEEE Xplore

2 downloads 0 Views 473KB Size Report
Requirements for coordinated multiple view visualization systems for industrial applications. Anton Heijs. Treparel Information Solutions anton.heijs@treparel.
Requirements for coordinated multiple view visualization systems for industrial applications Anton Heijs Treparel Information Solutions [email protected] Abstract In this paper we discuss the importance of commercial visualization application development on the progress of research on coordinated multiple view visualization techniques. There is a need for scalable infovis visualization solutions which is in fact the need to control the growing complexity in business data sets. The CMV visualization systems users need solutions which bring insight the complexity of their data. The progress in CMV visualization will be determined by the effort companies put in these problems to help clients with business data in combination with their collaboration with the CMV researchers at the universities. Keywords— Coordinated multiple view, data mining, text mining, visualization systems.

1

Introduction The important thing is not to stop questioning. Albert Einstein

The field of visualization has always worked on data sets which are to difficult to understand with a set of derived values or simple plots. In the beginning the focus was mainly on scientific data sets where one of the key challenges was to use visualization techniques to show the important patterns in the data such that the researchers could understand the data better. When they gained insight they designed better experiments or simulations which almost always resulted in more data. The challenge to deal with ever growing data sets exist therefore already for a long time. A big advantage for scientific visualization researchers is that the data comes in many cases from well understood knowledge domains, such as physics and chemistry, for which good research questions can be formulated since there is a priori knowledge. Information visualization is addressing data which is often not coming from the exact sciences, which is often unstructured in nature and there is no well defined knowledge domain as a reference framework in which the data is defined. Marketing data is a good example, which can be

very large in size and complex, such as marketing data of large banks with many clients and many possibly uncorrelated variables which banks store about these clients. Data mining techniques are then used to extract the important patterns from the data and in combination with visualization all the patterns can then be analyzed. The data mining and the visualization research communities are quite separate although both communities would benefit much from each other and they have much in common. The data mining researchers are looking for the best algorithms and the visualization researchers put a lot of effort in finding the best visualization algorithms although generally speaking for the most common data types there are only a small number of approaches. Mineset [3] is an example of a system that in 1997 combined data mining and visualization techniques. Infovis researchers need to be able to extract patterns from their data and therefore data mining and machine learning techniques are important for them since these algorithms can help them in the filtering step in the visualization pipeline. Strangely enough this is still very much an open field, although visual analytics is now getting a lot of attention.

2

The need for scalable visualization solutions The important thing in science is not so much to obtain new facts as to discover new ways of thinking about them. Sir William Bragg

The real challenge for scientific and information visualization will not be scalability for the size of the data or the dimensionality of the data but the complexity of the many patterns which are in the data. This must be handled with the coordination of multiple view visualizations of the patterns and the interactions between the visualizations. The research on CMV visualization is rather young. It is area of great importance and the work of for instance Jonathan Roberts and Nadia Boukhelifa [1] or the work of Chris Weaver [6] push the CMV road map. There are many open issues on how to develop CMV visualization systems

Fifth International Conference on Coordinated and Multiple Views in Exploratory Visualization (CMV 2007) 0-7695-2903-8/07 $25.00 © 2007

Figure 1: Example visualization of multiple data mining models and their relationships. which can handle many complex patterns from one or more data sets in a formal generic framework. These are exactly the issues which are important for people who need to analyze business data. The question is now more if research can push the state of CMV visualization forward or if industry requirements are going to pull the state of CMV forward. Application users from commercial companies do not have the required knowledge to develop new CMV visualization techniques. The future progress should therefore come from the universities or from companies focusing on these visualization challenges. Treparel is such a specialized company. Treparel focuses on combining machine learning techniques (data,tex,image and graph mining) with visualization techniques. Treparel has develop a generic software platform, called KMX 1 .

Companies which focus on development of software which addresses the commercial opportunities of visualization should especially focus on CMV visualization and therefore also on data mining research. If they do not invest in pushing this research area forward they can only do visualization consultancy or data mining consultancy but then they miss the opportunities coming from todays research. I want to illustrate with two cases coming from Treparel. 1 KMX

3

Data mining and CMV visualization

A first example comes from the pharmaceutical industry, although this case applies to all cases where experts need control and insight in production processes. Treparel works for clients in the pharmaceutical domain. For the production of medicine these companies collect a large amount of data from their batch driven production process. When there is an issue with a batch they need to do a trouble-shoot analysis. Given the large number of possible hypothesis from all parameter combinations it is difficult to determine which parameters are to be changed. We therefore developed a system which uses advanced data mining techniques to analyse all parameters and their relationships for each batch process. Then we use CMV visualization techniques to help the experts to very fast visually inspect all patterns (hypothesis) which can be the cause of an issue with a specific batch. CMV visualizations techniques are essential here. An example of such a CMV visualization is show in Figure 1.

4

Text mining and CMV visualization

A second example comes from the intellectual property (IP) industry but this case applies to all cases where experts need insight in large complex text documents. Treparel works for clients in this domain where the analysis and classification of large complex patent collections is very

stands for Knowledge Mapping and explorations and uses Oracle for the data mining and management.

Fifth International Conference on Coordinated and Multiple Views in Exploratory Visualization (CMV 2007) 0-7695-2903-8/07 $25.00 © 2007

(a)

(b)

Figure 2: Examples of patent map visualization where the supervised and unsupervised visualization must be coupled. (a) Example of visualization of the classification probability distribution of patents over 7 categories. (b) Example of cluster visualization where all patents belong to 7 clusters as indicated by 7 colors. important. For the automated classsification Treparel developed a system which allows patent searchers to select a small set of positive and negative learning documents for their subject where they want to build an classifier for. Then they apply this classifier to their set of documents which will then be classified with a high precision and recall. This is shown in the Figure 4(a) where all blue lines in this parallel coordinates visualization show the decay of the classification probability distribution of patents over 7 categories. The lines which show a fast decay have a high degree (> 80%) to be automatically classified correctly. For complex subject areas, as is often the case with patents, it is important to select a good set of positive and negative learning documents. Therefore we use projection techniques as described in research papers of one of the university partners of Treparel, the university of Sao Paulo [2, 4, 5]. The results of this work is visualized in Figure 2(b) where good learning documents are now easily identified in the 7 clusters on the outside of the cluster regions. CMV visualization techniques make it possible to combine advanced text mining with visualization in a system which help patent searchers to search and classify large complex patent sets. The interactive coupling between the visualization of the classification and the clustering results is an important part for the users to be able to work fast and obtain very good results. Visualization can help obtaining insight in the document data with support for the comparison of classification with various clustering algorithms. This will help to gain better insight in complex patent portfolios and to be able to improve the supervised and unsupervised analysis. An example of such a CMV visualization is show in Figure 2. The user should, for instance be able to alter the parameters of the algorithms and get visual feedback on how it affected the results, working towards an optimized algorithm. It can provide insight in for instance : ’Why a certain document is classified wrong but clustered right?’.

5

Will breakthrough applications of CMV visualization come from a research push or industry pull ? Never let the future disturb you. You will meet it, if you have to, with the same weapons of reason which today arm you against the present. Marcus Aurelius Antoninus

For the near future the value increase of visualization applications is expedted to come from those techniques which adress the scalability problem of dealing with large and high dimensional data sets, especially dealing with data sets which contain many patterns in the data. Coordinated multiple view techniques will be one of the more important techniques. The need to solve these problems already exists for those visualization solutions which adress business data sets. It is expected that commerical companies, whose focus it is to solve client infovis visualization problems, have the need to develop new CMV techniques which can scale with large number of complex patterns in business data. They talk with their cleints, the experts who know the data very well. The knowledge of these clients is also needed to be able to define the specific CMV requirements of the visualization solutions. The commercial visualization companies on the other hand benefit if they work on these problems in close collaboration with researchers at the universities on CMV visualization techniques. The CMV researchers alone cannot push the field to a higher level since they mis the close contacts with clients at large commercial companies and therefire it is not expected that researchers alone will pull the field substiantially further.

6

Conclusions

There are several forces which can and will drive the future development of visualization and especially CMV visualization. In this paper two cases are described showing the need for collaboration between research and industry to

Fifth International Conference on Coordinated and Multiple Views in Exploratory Visualization (CMV 2007) 0-7695-2903-8/07 $25.00 © 2007

make fast progress. These contacts between the CMV visualization developers (research and industry) should preferably also inlcude contacts with the data mining and machine learning community to enable future progress.

Acknowledgements I wish to acknowledge Charl Botha and Rosane Minghim who’s collaboration and discussions are very much appreciated.

References

[3] Cliff Brunk, James Kelly, and Ron Kohavi, MineSet: An Integrated System for Data Mining, In the The Third International Conference on Knowledge Discovery and Data Mining, 1997. [4] F. V. Paulovich and R. Minghim, Text Map Explorer: a Tool to Create and Explore Document Maps, IV ’06: Proceedings of the conference on Information Visualization, 2006, pp 245–251, Washington, DC, USA, IEEE Computer Society Press.

[1] Nadia Boukhelifa and Jonathan C. Roberts and Peter Rodgers, A Coordination Model for Exploratory Multi-View Visualization, In Proceedings of the International Conference on Coordinated and Multiple Views in Exploratory Visualization, 2003.

[5] G.P. Telles and R. Minghim and F.V. Paulovich Normalized Compression Distances for Visual Analysis of Document Collections, Computer & Graphics, Special Issue on Visual Analytics (to appear), 2007.

[2] A.A. Lopes and R. Pinho and R. Minghim and F.V. Paulovich, Visual Text Mining using Association Rules, Computers & Graphics Journal, Special Issue on Visual Analytics (to appear), 2007,

[6] Chris Weaver. Patterns of Coordination in Improvise Visualizations. Proceedings of the IS&T/SPIE Conference on Visualization and Data Analysis, San Jose, CA, January 2007.

Fifth International Conference on Coordinated and Multiple Views in Exploratory Visualization (CMV 2007) 0-7695-2903-8/07 $25.00 © 2007