Knowledge Discovery and Case-Based ... - Semantic Scholar

0 downloads 0 Views 59KB Size Report
knowledge discovery for CBR is exemplified by a real application scenario in .... experience, with no explicit rules or guidelines available, in the worst case ... Cases containing time series have been explored in medical applications in particu-.
Knowledge Discovery and Case-Based Reasoning in Medical Applications with Time Series Peter Funk, Markus Nilsson, Ning Xiong Department of Computer Science and Electronics Mälardalen University, SE 721 23 Västerås, Sweden {peter.funk, markus.nilsson, ning.xiong}@mdh.se

Abstract. This paper discusses the role and integration of knowledge discovery in case based reasoning systems. The general view is that knowledge discovery is complementary to the task of knowledge retaining and it can be treated as a separate process outside the traditional CBR cycle. Unlike knowledge retaining which is mostly related to case-specific experience, knowledge discovery aims at the elicitation of new knowledge that is more general and valuable by improving the functionality of the different CBR sub-steps. The importance of knowledge discovery for CBR is exemplified by a real application scenario in stress medicine in which time series of patterns of breathing cycles are to be analyzed and classified. As single breathing patterns cannot convey reliable information, sequences of patterns are deemed more adequate evidence for diagnosis by experts. Hence it is advantageous if sequences of patterns can be identified in a CBR approach to a diagnosis and co-occurrence is discovered in categories of patients with similar status.

1 Introduction Clinicians use both explicit knowledge obtained from guidelines and regulations, and implicit knowledge based on their own experience and that of the patients and other clinicians, and experience in the form of past cases [7]. Implicit knowledge may be knowledge that is known by many or all clinicians. But since medical case libraries also include case outcomes, they may be a valuable source of implicit knowledge not previously recognized by clinicians. A large proportion of medical research is directed towards discovering new causal relations, e.g. [16]. The main contribution of knowledge discovery (KD) is that it makes possible the recognition of previously unknown and potentially useful information. It may be defined as the non-trivial process of identifying novel, valid and potentially useful data patterns, and ideally, to also understand these data patterns [5]. Knowledge discovery is related to machine learning, statistics, data bases and data visualization, and uses a variety of techniques such as statistical techniques, decision trees, decision nets, clustering techniques and neural nets. Some CBR systems contain a proportionately large part that could be labeled as knowledge discovery, others contain no knowledge discovery and the learning process is purely based on new experience in the form of new cases stored in the case library during the retain step. The retain step

may include a number of sub-steps of which the learning of more general knowledge (e.g. for improved indexing), based on the new case, is one sub-step [1]. CBR could itself be classified as knowledge discovery, especially when it gains general knowledge from experience, e.g. learns new classes, prototypes or concepts [12]. We take the approach of treating knowledge discovery as a separate process, integrated into and extending the CBR cycle. In the CBR systems we have studied, it is easy to distinguish between learning based on new experience (new cases and their indexing) and knowledge discovery. Our extended definition of knowledge discovery in CBR systems is that it is learning that is not naturally associated with a new specific case. It may be argued that if the KD part is proportionately large, it may be separated from the CBR cycle. While the CBR cycle is often directly triggered by a new problem, the KD process is often naturally suitable as a separate background task. The case library as a whole and additional knowledge are the input to the KD process and the result is delivered back to the system, e.g. as in [15]. In an industrial CBR system diagnosing faults in industrial robots on the basis of sound patterns, the knowledge discovery process was performed manually, discovering new features and general knowledge that is used in the system to improve matching results [11]. The discovery of new features is now considered for automation in an off line approach using all cases as the source for the discovery process. For the identification of breathing dysfunctions to diagnose stress, a CBR system is used to classify individual breathing cycles [9]. The knowledge discovery process in this application is too complex to be performed manually but experts in the field are convinced that knowledge discovery would identify new and valuable co-occurrences in categories of patients measurement data, and in particular in the time series of classified breathing patterns. Such new knowledge would increase the usefulness of the CBR system in the diagnosis of patients and would encourage research as experts can use the new knowledge to improve and refine existing models. Based on these arguments we have extended the CBR cycle with a knowledge discovery process separate from the retain step. The KD part identifies new relations among time series of classified breathing cycles and uses the knowledge discovered to facilitate retrieval. A KD process integrated in the CBR cycle may also be advantageous since advancements in computing technology will enable more sophisticated approaches to the discovery of knowledge and the return of new general knowledge to the knowledge container (Fig. 1 contains a detailed description and illustration). The black arrow from the case library to knowledge discovery and that from knowledge discovery to retrieve knowledge show where knowledge discovery is performed in the example given in this paper. Input may be from any knowledge container and the knowledge discovered may also be returned to any knowledge container, including the case library (e.g. as prototype or stereotype cases). It may be desirable to validate and verify discovered knowledge automatically or manually, e.g. if the system is able to reclassify all the cases in the case library as effectively as or more effectively than the system without the discovered knowledge, it may be argued that it is safe to add the discovered knowledge. Fig. 1 shows the five different knowledge containers [13] according to the CBR knowledge structure. Fig. 1 is structured to show the value of a separate knowledge

discovery process discovering new knowledge that in our example is added to the general application knowledge used in the retrieval process. The retain step typically stores a new case in the case library or may modify some existing cases and usually contains a number of sub-steps [1]. It is the actual learning phase that typically differentiates CBR from other approaches. Learning is typically included in the CBR cycle and generally only contains learning that stems directly from the newly solved case. Knowledge Discovery

Retrieve knowledge

Maintenance

Case library (experience) Retain knowledge

Reuse knowledge

Revise knowledge

Fig. 1. Case Library, knowledge containers and the knowledge discovery process.

The knowledge discovery process is most often too elaborate to fit into the CBR cycle naturally. Knowledge learned is general for the application and directly connected to one or more knowledge containers, not a case. The example showing the integration of CBR with KD is from the medical domain, in which the classification of Respiratory Sinus Arrhythmia (RSA) based on sensor readings is becoming increasingly important in physiological/psychophysiological medicine [16]. When clinicians diagnose patients, one important procedure is to classify RSA into 12 or more different categories, a task being automated with a case based reasoning system [9]. The full patient case is information-rich and cases store the time series of classified breathing cycles, today used manually by clinicians in making a final diagnosis. Time series of breathing cycles are used in the diagnosis process because clinicians know that classification based on a single breathing cycle is not sufficient for the reliable diagnosis of a breathing dysfunction. Classifying complex time series manually is tedious and often requires long experience, with no explicit rules or guidelines available, in the worst case leading to incorrect diagnoses by less experienced clinicians [13]. Cases containing time series have been explored in medical applications in particular but also in some industrial cases. The value of CBR in medical applications has also been investigated and confirmed in a number of research projects. Four successful CBR applications in medical classifications are: the Auguste project [6]; the CARE-PARTNER system [2]; the MNAOMIA system, a CBR system able to create hypotheses in the area of eating disorders [3], [4] and Schmidt and Gierl’s unnamed system for time series analysis and prediction of kidney function [14]. For a more extensive overview of state-of-the-art medical CBR systems see [8].

2 Knowledge Discovery as Complementary to Knowledge Retaining Case-based reasoning uses domain knowledge to retrieve relevant cases from the case library. In complex applications, as many medical applications are, a large body of domain knowledge is often needed to enable the system to identify and retrieve the relevant cases. If a system performs weakly in retrieving appropriate cases, either the retrieval knowledge is insufficient or the cases do not include all the essential features necessary to retrieve the most relevant cases. If the retrieval knowledge is insufficient, it may be necessary to optimize the weighting or if a more complex domain, to optimize the similarity functions [17]. If, however, essential features are missing or concealed in relations between other features, considerable effort may be needed to identify them. For some systems the separation of the KD makes the CBR system more transparent and reduces the complexity of research and implementation. It may even make it easier to apply CBR to applications for which it has not been suitable previously because of their complex nature and which today are only explored manually or intuitively by experts. Fig. 2 shows the input to and output from the knowledge discovery process. The cases in the case library may stem from different case libraries. In the medical domain it may not normally be acceptable to distribute medical cases between hospitals without a medical reason, but it may be acceptable if they are to be used by a knowledge discovery process to obtain new knowledge to be made available to all CBR systems of the same kind. In our example the retrieval knowledge is input to the knowledge discovery process to determine if there are previously undiscovered co-occurrences. In a medical environment the knowledge containers of all the CBR systems may have the same content and share prototypical cases, but in general, they contain different cases. Case library Case library (experience) (experience)

Knowledge Knowledge Knowledge Container Container Container

Previously known general domain knowledge

New non trivial general domain knowledge not previously known

Knowledge Discovery Using Clustering, Statistics, Bayesian Nets, Graphs, Neural Nets, Evolutionary Algorithms

Fig. 2. Knowledge discovery and its integration with CBR

The discovered knowledge is returned to the knowledge container. If validation or verification of the new knowledge is needed, this may be performed before the new knowledge is used. In the next section we will outline a concrete knowledge discovery task for time series classification in a medical CBR system.

3 An Example of knowledge discovery in time series As noted in [9] classification of individual RSA patterns is one of the main procedures used by clinicians to classify RSA dysfunction. Clinicians also emphasize that classification based on a single RSA pattern is not sufficient for a reliable diagnosis of an RSA dysfunction. This is because RSA reflects the net effect of the complex interaction of the many different systems involved. The pattern classification system [9] identifies dysfunctional RSA patterns directly from sensor readings. A clinical session usually lasts 18 minutes divided into a number of different phases (e.g. normal breathing, provoking stress, breathing deeply, etc.), with an average of 5-15 seconds per respiration period. These phases are handled as individual cases and each phase contains a series of classified RSA patterns (example in section 3.1). A series of breaths contains on average 40-80 respiration periods (inhalation-exhalation cycles). The pattern classification system eliminates most sensor noise but there is still the possibility that there may be some misclassifications caused by distortions in sensor reading data. In the following we will only refer to series or breathing series and not phases. The ability of experienced clinicians to identify RSA series is based on their experience. They are able to explain them once they recognize such a series, but the knowledge is not so explicit that they are able to describe such a series in advance. It should also be noted that the complexity of the systems reflected in RSA, their behavior and their interaction are not fully understood and more theoretical work is needed [16]. For this reason we have developed a method: • which permits the recognition of similarity of RSA sequences, • which makes possible the identification of new RSA sequences of importance for the diagnosis of patients, • which detects co-occurrence between RSA sequences and patient status, that may lead to the discovery of new causal relations as results of clinical experiments isolating the causal factors. The method and relevant terminology are explained in the following sections 3.1 Important Sequences of RSA patterns Assume that each dysfunctional RSA pattern is assigned a number between 1 and 9 (assuming for simplicity that there are only 9, there being in reality more than 10 different dysfunctional RSA patterns) and that one normal RSA pattern is assigned the number 0. A sequence of classified RSA patterns for a session can be illustrated by a series of numbers consisting of the number of breathing cycles (usually 40 to 80, in the example below some parts of the series have been omitted): RSA series: [0003000001060003000240050003020030020000700009020000] (1)

For a reliable diagnosis it would be advantageous to be able to identify recurring sequences in the RSA series. Such sequences are exemplified in (2). Significant sequences: [302], [3002] and [30002]

(2)

The sequences in (2) can be classified by clinicians to indicate RSA dysfunction, especially if they recur a number of times during a series. If they occur in this particular order (a RSA sequence) this may be a strong indication of a dysfunction, but if these RSA patterns (3 and 2) occurred in a different order or in a random order, then a clinician may not regard them as an indication of a dysfunction. Hence a way to automatically recognize recurring sequences of possible importance would be of value to clinicians. 3.2. Detecting Important Sequences as Distinguishing Features Since varying RSA sequences may be interpreted in the same way by a clinician, the matching must be similarity-based, e.g. there being three, four or five normal breathing cycles between patterns 3 and 2, may be less significant. More complex sequences may also occur. One solution is to store these sequences as expressions with variations, e.g. [“3”, n * “0”, “2”] with * denoting n time repetition of the following label and n ∈ {1..3}. The example expression in the previous sentence captures the sequences in (2). This expression means that there is first a RSA pattern 3 followed by one, two or three normal breathing patterns and finally an RSA pattern of class 2. When a clinician is performing a measurement session, a search for similar, but not exactly matching sequences may be relevant and hence a similarity-based matching is preferred. This may indicate a variation of an RSA dysfunction or even a new type of RSA dysfunction not previously encountered. In Fig. 3 a series of classified RSA patterns is given as input (from the left). This series is a result from a measurement signal, classified by the HR3Modul. The HR3Modul has classified each RSA pattern in the measurement signal. “0” is a normal breathing cycle with no indication of dysfunction. The library of important expressions above contains sequences of importance for classifying dysfunctions. The library of expressions may, as mentioned previously, stem from experienced experts, but may also contain expressions automatically generated as described in section 3.3. The “identify sequences of relevance” process in the middle is the matching process, discovering similar expressions formalized in the expression library. In the resulting output series on the right, the identified sequences are underlined. Such identified relevant sequences can be considered as distinguishing features for depicting the case of the input pattern series.

Library of important expressions

Expres-

ExpresExpression a1 Expression 12 sion [“3”, n 12 * 12 “0”, “2”] sion n ∈ {1..3}

Series of classified RSA patterns 0030002400500030200300200

Identified relevant sequences Identify sequences of relevance

003000240050003020030020

Fig. 3. Feature identification for a pattern series

The result will present the recognized sequences in the RSA pattern series to the clinician in a suitable way (sequences may overlap each other so how the sequences are visualized for the clinician must be chosen carefully). This will help less experienced clinicians in making an overall diagnosis of the patient and also ease the work load on experienced clinicians. The library of important expressions may be a mixture of expressions formalized by clinical experts in the field and produced through data mining as described in the following section. 3.3 Discovering new important RSA sequences from data examples As can be seen from the previous section, the expression library plays a crucial role in the detection of dysfunctional RSA. A number of RSA sequences of importance in the diagnosis process are provided by clinicians with extensive experiences. But there may be RSA sequences not yet discovered by clinicians that may indicate RSA dysfunction. Experts in the field state that the discovery of new sequences is important for improving the reliability of the diagnosis process. The association of relations with probabilities would also be valuable, e.g. the occurrence of a certain expression in 75% of all patients with a particular diagnosis and in only 50% of all other patients is very valuable information for clinicians and may lead to important discoveries once the causal relations have been verified. New RSA sequences that may be important may be discovered using a data mining approach. In Fig. 4 a model of how such discovery can be made is shown. The input (the top in Fig. 4) is the hypothetical space of all plausible expressions. A procedure with some domain knowledge ranks the expressions, the most plausible first. In the next step, the search to find relations between expressions and RSA series that belong to patients that have been diagnosed begins. Clinicians classify patients in 14 different classes. Each expression can be classified in relation to all the patient classes. By inspecting all series from patients in the same class, and determining in how many percent the expression occurs, a value can be given to this relation. The expressions shown in the right bottom corner in Fig. 4 are followed by two values, these being the relations between the expression and the two most significant patient

diagnoses. This percentage is also determined for expressions already known since this will specify an exact value for the clinician’s experience. The hypothetical space of all expressions

Important expressions already known

Expression Expression Expression 12 1212n * “0”, [“3”,

Generate

New expressions with potential importance

Classified RSA series from previously diagnosed patients

Expression Expression Expression 12 serie 1212s1348

“2”] n ∈ {1..3} Known and new expressions with probabilities Search for co-occurrence relations between expressions and diagnosis

Expression n4 ["4",n*("0"|"1"),"2”] n∈{1..4} diagnosis c1: 65% diagnosis d1: 53% …

Fig. 4. Discovery of new important expressions and their causal relations.

If the causal relationship is small (a threshold set by an expert in the field), the expression is discarded (this only applies to the expressions generated automatically). The process is a typical off line approach since searching through a space of all legal sequences will require much processor capacity. Much also depends on the quality of the ranking of expressions. The method is a knowledge discovery method and would help clinicians and experts in the field to accelerate the progress of research First of all we need to discover RSA sequences potentially indicating dysfunction. This can be made by analyzing a large number of an RSA series from patients with known diagnosis. Once a RSA sequence occurs frequently in different series, a data mining tool is able to discover a co-occurrence between the sequences and diagnosis. Such a tool may use clustering methods and statistics, and identify all cases with this particular sequence and identify any relation to a specific diagnosis. If such a relation is established, it can be used to aid clinicians in their diagnosis process. Adding such a sequence, translated to an expression, to the expression library explained in section 3.2 will give such aid. An experienced clinician may use this discovery tool to perform research in the process of knowledge discovery. Since the number of series in the database may be limited, an experienced clinician may add personal experience by modifying the series (or more precisely, by modifying the expression identifying this series).

4. Summary and Conclusions We have in this paper discussed the value of combining knowledge discovery and case-based reasoning in medical applications in which time series and patterns of events in these time series are relevant. We propose to treat knowledge discovery as a separate process, outside the traditional CBR cycle. In contrast to knowledge retaining which is directly related to case-specific experience, the purpose of knowledge discovery is to discover new knowledge that is more general and, by adding this new knowledge to improve the overall performance of the CBR system. The approach is exemplified in a medical domain (diagnosis of stress) in which the diagnosis is based on time series of classified breathing patterns. Knowledge discovery is used to discover patterns in previously classified time series of breathing cycles. Single classified breathing patterns are not always sufficiently reliable for classification. New pattern that may have a causal correlation to different diagnoses are generated and thereafter evaluated against all classified sequences of time series. If there is a correlation between the pattern and a particular diagnosis, the pattern is saved and used for improved classification of new unclassified series of breathing patterns. We propose that the approach is general for applications in which time series of classified events are of relevance for classification and diagnosis. An implementation will validate the approach fully.

References 1. Aamodt, A. and Plaza, E.: Case-based reasoning: foundational issues, methodological variations and system approaches. Artificial Intelligence Com. 7 39-59 (1994) 2. Bichindaritz, I. Kansu, E. and Sullivan, K. M.: Case-based reasoning in care-partner: Gathering evidence for evidence-based medical practice. In Advances in CBR: The Proceedings of the 4th European Workshop on Case Based Reasoning 334–345 (1998) 3. Bichindaritz, I.: Mnaomia: Improving case-based reasoning for an application in psychiatry. In: Artificial Intelligence in Medicine: Applications of Current Technologies, AAAI 14–20 (1996) 4. Bichindaritz I.: Case-Based Reasoning adaptive to several cognitive tasks. In: International Conference on Case-Based Reasoning (ICCBR-95), Aamodt A. and Veloso M. (edts), Sesimbra, Springer LNAI 1010 391-400 (1995) 5. Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P.: From data mining to knowledge discovery. In Advances in Knowledge Discovery and Data Mining, MIT Press 1-36 (1996) 6. Marling, C., and Whitehouse, P.: Case-based reasoning in the care of Alzheimer’s disease patients. In Case-Based Research and Development 702–715 (2001) 7. Montani, S. and Bellazzi R.: Intelligent Knowledge Retrieval for Decision Support in Medical Applications. In MEDINFO 2001, V. Patel et al. (Eds), IOS Press 498-502 (2001) 8. Nilsson, M. and Sollenborn, M.: Advancements and trends in medical case based reasoning: An overview of systems and system development. In: Proceedings of the 17th International FLAIRS Conference 178–183 (2004) 9. Nilsson, M. and Funk, P.: A Case-Based Classification of Respiratory Sinus Arrhythmia. In: Proceedings of the 7th European Conference on Case-Based Reasoning, Madrid, Springer 673-685 (2004)

10. Olsson, E., Funk, P. and Bengtsson, M.: Fault Diagnosis of Industrial Robots using Acoustic Signals and Case-Based Reasoning. In: Proceedings of the 7th European Conference on Case-Based Reasoning, Madrid, Springer 686-701 (2004) 11. Olsson, E., Funk, P. and Xiong N.: Fault Diagnosis in Industry Using Sensor Readings and Case-Based Reasoning. Journal of Intelligent & Fuzzy Systems, vol. 15, ISSN 1064-1, IOS Press, December (2004) 12. Perner, P.: Are Case-Based Reasoning and Dissimilarity-Based Pattern Recognition two Sides of the same Coin? In: Machine Learning and Data Mining in Pattern Recognition, Springer Verlag 35-52 (2001) 13. Richter, M.: The knowledge contained in similarity measures. In: M Veloso and A Aamodt, editors, Case-Based Reasoning Research and Development: Proceedings of the 1st International Conference on Case Based Reasoning, Sesimbra, Portugal (1995) 14. Schmidt, R. and Gierl, L.: Temporal abstractions and case-based reasoning for medical course data: Two prognostic applications. In: Machine Learning and Data Mining in Pattern Recognition, volume 2123 of Lecture Notes in Computer Science, 23–34 (2001) 15. Sollenborn, M. and Funk, P.: Category-Based Filtering in Recommender Systems for Improved Performance in Dynamic Domains. In: Proceedings of the 2nd International Conference on Adaptive Hypermedia and Adaptive Web Based Systems, Malaga, Spain, May 436 - 439 (2002) 16. von Schéele, B.: Classification Systems for RSA, ETCO2 and other physiological parameters, PBM Stressmedicine, Technical report, www.pbmstressmedicine.se, Sweden 1-8 (1999) 17. Xiong, N. and Funk P.: A Novel Framework for Similarity Modeling in Case Based Reasoning, In CI2005, International Conference on Computational Intelligence, Calgary, Canada (2005)