A CBR Solution for Missing Medical Data

0 downloads 0 Views 73KB Size Report
There can be different reasons for randomly missing data. Generally, .... In our former work [1] we made first steps to combine classical statistical methods with ...
A CBR Solution for Missing Medical Data Olga Vorobieva1), Alexander Rumyantsev2), Rainer Schmidt1) 1) Institute for Medical Informatics and Biometry, University of Rostock, Germany [email protected] 2)

Pavlov State Medical University, St.Petersburg, Russia

Abstract. In this paper, a CBR approach that deals with missing data is presented. In the conversational ISOR system different knowledge sources are involved, including medical experts. In the case base rules and formulae are stored that support the restoration of numerous missing values. The task is to restore missing values in an observed medical data set. The presented method is used for a set of physiological and biochemical measurements of dialysis patients. The measurements were taken at four time points during a year in which the patients participated at an especially developed physical training program. To analyse the obtained data a restoration of missing values is really necessary.

1

Introduction

Databases with many variables have specific problems. Since it is very difficult to overview their content, usually a user a priory does not know how complete the set is. Are any data missing? How many of them and where are they located? We have a set of observed medical data and want to extract as much information as possible out of it. Extraction is planed stepwise. At each step we put and solve a data-mining problem. The first problem, “why do some exceptional patients do not fit a specific hypothesis”, was presented in our paper at last years “CBR in the health science” workshop [1]. When solving this problem, we were confronted with further problems, especially with a missing data problem. In this paper, we deal with data that are missing randomly and without any regularity. There can be different reasons for randomly missing data. Generally, the most frequent causes are input failures, misprints, lost data, and data that were not measured at all. Here we deal with a situation where the main cause is that many measurements did not happen. Since many data were missing, we want to fill in these empty spaces in our data set. We assume that the data set contains groups of interdependent variables but a priory we do not know how many such groups there are, what kind of variables are dependent, and in which way they are dependent. However, we intend to make use of all possible forms of dependency to restore missing data. One possibility to obtain information from observed data is modelling, especially creating mathematical models. Our way of modelling is a combination

of a relatively simple initial model with a rather demanding CBR approach [1]. It requires main factors (selected from the list of observed parameters) to set up an initial model and additional data, especially from parameters not used in the model, to find an explanation for exceptional cases. Therefore, the more complete our observed data base is, the easier it is to find explanations and the better the explanations should be. Otherwise, missing data force a researcher either to shorten the sample by excluding patients with missing data or to select those parameters as main factors with few missing data, which are not necessarily the most important parameters. Data analysis methods are often valued according to its tolerance to missing data as in [2]. In principle, there are two main approaches to the missing data problem. The first approach is the direct restoration of missing values. Statistical restoration of missing value is usually based on non-missing values from other records. A missing value is substituted by some function of existent values for example it can be an average of some values. The second approach suggests methods that accept the absence of some data. Those methods can be differently advanced, from simply excluding cases with missing values up to rather sophisticated statistical models [3, 4]. Gediga and Düntsch propose to use CBR to restore missing data. Since their approach does not require any external information, they call it a non-invasive imputation method. Missing data are supposed to be replaced by their correspondent values of the most similar retrieved cases [5]. Why don’t we just apply statistical methods? Firstly, statistical methods require homogeneity of the sample. We have no reasons to expect the set of our patients to be a homogenous sample. Instead, we are going to find out the common tendency of the sample, fix all exceptions of the main tendency and study every exception separately. Our observed data has the additional advantage that many variables were measured. From given values for many variables sometimes missing values can be calculated or estimated. Therefore, for calculation of missing values we do not use given values of other cases but given values of the actual case. Secondly, statistical methods are developed for a closed information system. It means that no external knowledge is considered besides those containing in the statistical model, whereas in ISOR external knowledge can be considered. In fact, ISOR was especially designed to make use of various external knowledge sources.

2

Dialysis and fitness

We deal with medical data observed on a group of dialysis patients. Dialysis means stress for a patient’s organism and has significant adverse effects. Fitness is the most available and a relative cheap way of support. It is meant to improve a physiological condition of a patient and to compensate negative dialysis effects. At our University clinic in St. Petersburg, a specially developed complex of physiotherapy exercises including simulators, walking, swimming etc. was offered to all dialysis patients but only some of them actively participated, whereas some others participated but were not really active. The

purpose of this fitness offer was to improve the physical conditions of the patients and to increase the quality of their lives. One of the intended goals is to convince the patients of the positive effects of fitness and to encourage them to make efforts and to go in for sports actively. This is important because dialysis patients usually feel sick, they are physically weak, and they do not want any additional physical load [6]. The theoretical hypothesis is that actively participating in the fitness program improves the physical condition of dialysis patients. Instead of reliable theoretical knowledge and intelligent experience, we have just this theoretical hypothesis and a set of measurements. In such situations the usual question is, how do measured data fit to theoretical hypotheses. To statistically confirm a hypothesis it is necessary, that the majority of cases fit the hypothesis. Mathematical statistics determines the exact quantity of necessary confirmation [7]. However, usually a few cases do not satisfy the hypothesis. We examine them to find out why they do not satisfy the hypothesis. Our system, ISOR, offers a dialogue to guide the search for possible reasons in all components of its data system. These exceptional cases belong to the case base. This approach is justified by a certain mistrust of statistical models by doctors, because modelling results are usually unspecific and “average oriented” [8], which means a lack of attention to individual "imperceptible" features of concrete patients. In our former work [1] we made first steps to combine classical statistical methods with Case-based Reasoning. A rather simple but easily interpretable statistical model and a limited number of all measured parameters were used. In [1] a binary model is described that involves just three selected main variables. It is reasonable to assume that such a simple model based on very few variables should have some inaccuracies. However, it is supposed to express the general tendency, especially as we managed to overcome most model inaccuracies by applying CBR techniques and considering additional measured parameters from the observed data set. The idea to combine CBR with other methods is not new. For example CarePartner resorts to a multi-modal reasoning framework for the co-operation of CBR and Rule-based Reasoning (RBR) [9]. Another way of combining hybrid rule bases with CBR is discussed by Prentzas and Hatzilgeroudis [10]. The combination of CBR and model-based reasoning is discussed in [11]. Statistical methods are used within CBR mainly for retrieval and retention (e.g. [12, 13]). Arshadi proposes a method that combines CBR with statistical methods like clustering and logistic regression [14]). 2.1

Data

For each patient a set of physiological parameters is measured. These parameters contain information about burned calories, maximal power, oxygen pulse (volume of oxygen consumption per heartbeat), lung ventilation and many others. Furthermore, there are biochemical parameters like haemoglobin and other laboratory measurements. All these parameters are supposed to be measured four times during the first year of participating in the fitness program. There is an initial measurement followed by a next one after three months, then after six

months, and finally after a year. Since some parameters, e.g. the height of a patient, are supposed to remain constant within a year, they were measured just once. The other ones are regarded as factors with four grades, they are denoted as F0 – the initial measurement of factor F, and F3, F6, and F12 – the measurements of factor F after 3, 6, and 12 months. All performed measurements are stored in the observed database, which contains 150 records (one patient – one record) and 460 variables. 12 variables are constants the other 448 variables represent 112 different parameters. The factors can not be considered as completely independent from each other, but there are different types of dependency among specific factors. Even a strict mathematical dependency can occur, as e.g. in this triple: time of controlled training, work performed during this time and average achieved power, expressed as Power = Work/Time. Less strict are relations between factors of biochemical nature. For example, an increase of parathyroid hormone implies an increase of blood phosphorus. Those and many other principally existent relations between factors enable us to control recorded measurements and to fill in numerous missing data in the data set. Unfortunately, not all planned measurements took really place, some of them were made somehow wrong, and some measurements were wrongly recorded or even lost. Therefore, the observed database contains many missing and definitely wrong data. It is necessary to note that measurements of dialysis patients essentially differ from measurements of parameters of non-dialysis patients, especially of healthy people, because dialysis destroys the natural relationships between physiological processes in an organism. In fact, for dialysis patients all physiological processes behave abnormal. Therefore, the correlation between parameters differs too. For statistics, this means difficulties in applying statistical methods based on correlation and it limits the usage of a knowledge base developed for normal people. 2.2

The role of ISOR

ISOR has been developed as conversational system that helps a medical expert to find explanations for exceptional cases that do not fit the (mostly statistical) model. Its role is to organise an exchange of information among all “members” of the conversation. Two members are humans: a medical expert and the “casebased reasoner” (the system developer). The other “members” are various databases, the two main ones are the observed data set and the case base. We did not attempt to create a medical knowledge base, because the list of observed variables is rather long and there are many possible relationships among the variables. A knowledge base that contains information about possible relations would be too large. Furthermore, we do not know in advance which variables required for the model and/or further research may have missing data. A complete knowledge base should contain all possible relations between the variables, whereas just a small part is required. ISOR is a dialogue system that co-operates with a medical expert, who – when urgent - provides ideas, suggestions, and theoretical knowledge. So the

expert can be seen as a source of external medical knowledge, but not in the traditional way as a knowledge provider to build up a knowledge base. From the CBR point of view, the help of the medical expert is required at the adaptation stage. Since his help is considered as “expensive”, the adaptation process should be as much automatically as possible. The second human member of the conversation is the program developer, called the “case-based reasoner”. Of course, this conversation part is not restricted to one person. Its role is to assist during the adaptation, especially to make verbal statements of the medical expert understandable for ISOR. This is not just a translation into predefined ISOR inputs but the “case-based reasoner” may have to make changes in the knowledge sources of the program.

3

Restoration of missing data

In ISOR, CBR is applied to restore missing data, the calculated values are filled in the observed database. The whole knowledge is contained in the case base, namely in form of solutions of former cases. 3.1

Types of solutions

There are three types of numerical solutions: exact, estimated, and binary. Some examples and restoration formulas are shown in table 1. All types of solutions are demonstrated by examples below in section 3.2. Table 1. Some examples of solutions and of restoration formulas. Abbreviations: BC = Breath consumption, BF = Breath frequency, BV = Breath volume, HT = Hematocrit, P = Phosphorus, PTH = Parathyroid hormone, PV =plasma volume Missing parameter PTH

HT HT WorkJ BC Oxygen pulse

Description Numeric solution (examples) Binary 1 If P(T) >= P(t) then PTH(T) >= PTH(t) Else PTH(T) < PTH(t) Exact 36,2 HT = 100 * (1–PV/0.065 * Weight) Estimated 29,1 Y(6) = Y(3)*0.66 + Y(12) * 0.33 Exact 30447,1 WorkJ = MaxPower * Time * 0.5 Exact 15,6 BC = BF * BV Estimated 10,29 Linear regression Type of solution

Parameters

Time points

P, PTH

0 and 6

PV, Weight HT

6

Time, MaxPower BF, BV O2plus

12

3 and 12

12 0 and 3 and 12

When a missing value could be completely restored, it is called exact solutions. Exact solutions are based on other parameters. A medical expert has defined them as specific relations between parameters. He has done it during the use of ISOR. The developer has translated them as input for ISOR. As soon as

they have been used once, they are stored in the case base of ISOR and can be retrieved for further cases. Since estimated solutions are usually based on domain independent interpolation, extrapolation, or regression methods, a medical expert is not involved. An estimated solution is not considered as full reconstruction but just as estimation. A binary solution is a partly reconstruction of a missing value. Sometimes ISOR is not able to construct neither an exact nor an estimated solution, but the expert may draw a conclusion about increasing/decreasing of the missing value. So, a binary solution expresses just the assumed trend. “1” means that the missing value should have increased since the last measurement, whereas “0” means that it should have decreased. Binary solutions are used in the qualitative models of ISOR [1]. 3.2

Examples

By three typical examples we want to demonstrate how missing data are restored in the ISOR system. First example: Exact solution. The value of hematocrit (HT) after 6 months is missing. Hematocrit is the proportion of the blood volume that consists of red blood cells. So, the hematocrit measurements are expressed in percentage. The retrieved solution (the third line of table 1) requires two additional parameters, namely plasma volume (PV) and the weight of the patient. For the query patient these values (measured after six months) are “weight = 74 kg and PV = 3,367”. These values are inserted in the formula and the result is a hematocrit value of 30%. This restoration is domain dependent, it combines three parameters in such a specific way that it can not be applied to any other parameter. However, the formula can of course be transformed in two other ways and so it can be applied to restore values of PV and the weight of the patient. The formula contains specific medical knowledge that was once given as a case solution by an expert. Second example: Estimated solution. It is the same situation as in the first example. The value of hematocrit that should have been measured after six months is missing. Unlike the first example, now the PV value that is required to apply the domain dependent formula is also missing. Since no other solution for exact calculation can be retrieved, ISOR attempts to generate an estimated solution. Of course, estimated solutions are not as good as exact ones but are acceptable. ISOR retrieves a domain independent formula (fourth line of table 1) that states that a missing value after six months should calculated as the sum of two-thirds of the value measured after three months and one-third of the value measured after twelve months. This general calculation can be used for many parameters. Third example: Binary solution.

The value of parathyroid hormone (PTH) after six months is missing and shall be restored. The retrieved solution involves the initial PTH measurement and the additional parameter phosphorus (P), namely the measurement after six months, P(6), and the initial measurement, P(0). Informally, the solution states that with an increase of phosphorus goes along an increase of PTH too. More formal the retrieved solution states: If P(6) >= P(0) then PTH(6) >= PTH(0) else PTH(6) < (PTH(0) So, here a complete restoration of the missing PTH value is not possible but just a binary solution that indicates the trend, where “1” stands for an increase and “0” for a decrease. 3.3 Case-based Reasoning In ISOR, cases are mainly used to explain further exceptional cases that do not fit the initial model. Just a sort of secondary application is the restoration of missing data. The solutions given by the medical expert are stored in form of cases so that they can be retrieved for solving further missing data cases. Such case stored in the case base has the following structure. 1. Name of the patient 2. Diagnosis 3. Therapy 4. Problem: missing value 5. Name of the parameter of the missing value 6. Measurement time point of the missing value 7. Formula of the solution (the “description column of table 1) 8. Reference to the internal implementation of the formula 9. Parameters used in the formula 10. Solution: Restored value 11. Type of solution (exact, estimated, or binary) Since the number of stored cases is rather small, there is no real retrieval problem. The retrieval is performed by keywords. The four main keywords are: Problem code (here: “missing value”), diagnosis, therapy, and time period. As an additional keyword the parameter where the value is missing can be used. Solutions that are retrieved by using the additional parameter keyword are domain dependent. They contain medical knowledge that has been provided by the medical expert. The domain independent solutions are retrieved by using just the four main keywords. The flow of restoring missing values is shown in figure 1. What happens when the retrieval provides more than one solution? Though only very few solutions are expected to be retrieved at the same time, only one solution should be selected. At first ISOR checks whether the required parameters values of the retrieved solutions are available. A solution is accepted

if all required values are available. If more than one solution is accepted, the expert selects one of them. If no solution is accepted, ISOR attempts to apply the one with the fewest required parameter values. Each sort of solution has its specific adaptation. A numerical solution is just a result of a calculation according to a formula. This kind of adaptation is performed automatically. If all required parameter values are available, the calculation is performed and the query case receives its numerical solution.

Fig.1: Flow of restoration The second kind of adaptation modifies a restoration formula. This kind of adaptation can not be done entirely automatically but the expert is involved. When a (usually short) list of solutions is retrieved, ISOR at first checks whether all required values of the exact calculation formulae are available. If required parameter values are not available, there are three alternatives to proceed. First, to find an exact solution formula where all required parameter values are available, second to find an estimation formula, and third to attempt to restore the required values too. Since for the third alternative there is the danger that this might lead to an endless loop, this process can be manually stopped by pressing a button in a dialogue menu. When for an estimated solution required values are missing, ISOR asks the expert. The expert can suggest an exact or an estimated solution. Of course, such an expert solution has also to be checked for the availability of the required values. However, the expert can even provide just a numerical solution, a value to replace the missing data – with or without an explanation of this suggested value. Furthermore, adaptation can be differentiated according to its domain dependency. Domain dependent adaptation rules have to be provided by the expert and they only applicable to specific parameters. Domain independent adaptation uses general mathematical formulae that can be applied to many parameters. Two or more adaptation methods can be combined.

In ISOR a revision occurs. It is the attempt to find better solutions. An exact solution is obviously better than an estimated one. So, if a value has been restored by estimation and later on (for a later case) the expert has provided an appropriate exact formula, this formula should be applied to the former case too. Some estimation rules are better than other. So it may happen that later on a more appropriate rule is incorporated in ISOR. In principle holds, the more new solution methods are included into ISOR, the more former already restored values are attempted to revise. Artificial cases. Since every piece of knowledge provided by a medical expert is supposed to be valuable, ISOR saves it for future use. If an expert solution cannot be used for adaptation for the query case (required values for this solution might be missing), the expert user can generate an artificial case. In ISOR exists a special dialogue menu to do this. Artificial cases have the same structure as real ones, and they are also stored in the case base.

4

Results

Since ISOR is a dialogue system and the solutions are generated within a conversation mainly between the system and the user, the quality of the solutions does not only depend on ISOR but also on an expert user. To test our method we deleted a random set of parameter values from the observed data set. Subsequently, we applied our method and attempted to restore the missing values of this simulated data set. So far, we can just summarise how and how many missing values could be restored (table 2). Since for those 12 parameters that were only measured once and remain constant no values were missing, they are not considered in table 2. More than half of the missing values could be at least partly restored, nearly a third of the missing values could be completely restored, about 58% of restoration occurred automatically. However, 39% of the missing values could not be restored at all. The main reasons are that for some parameters no proper method is available and that specific additional parameter values are required that sometimes are also missing. Table 2. Summary about the numbers of missing and restored values. Number of Parameters Number of values Number of missing values Number of completely restored values Number of estimated values Number of partly restored values (binary) Number of automatically restored values Number of expert assistance Number of values that could not be restored

5

Conclusion

112 448 97 29 17 13 34 25 38

In this paper, a CBR approach to the missing data problem is presented. Here, an application of the ISOR system to the problem of fitness and dialysis patients is shown. A statistical model is combined with Case-based Reasoning. The statistical model supports the hypothesis that fitness can improve the physical conditions of dialysis patients, whereas with the help of Case-based Reasoning the exceptional cases that contradict this hypothesis can be explained. Unfortunately, many data are missing. Since the fewer data are missing the better the model, we attempt fill in the missing data. This is done in a conversational process between a medical expert, ISOR, and the system developer. Since the time of a medical expert is valuable, we attempt to make demands on him as less as possible. Only when absolutely necessary he is asked. We do not ask him to create a knowledge base, because it wood be much too time consuming. So, the main work is done by CBR. In ISOR, all main CBR steps are performed: retrieval, adaptation, and revision. Retrieval (of usually a list of solutions) occurs by the help of keywords. Adaptation is an interactive process between ISOR, a medical expert, and the system developer. In contrast to many CBR systems, in ISOR revision plays an important role. The whole knowledge is contained in the case base, namely as solutions of former cases. No further knowledge base is required but just the knowledge we really need is stored in the case base. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

Vorobieva O, Rumyantsev A, Schmidt R: Incremental development of an explanation model for exceptional dialysis patients. In: Bichindaritz I, Montani S (Eds.): Workshop CBR in the Health Science, 2006, 170-178 McSherry D: Interactive Case-Based Reasoning in sequential diagnosis. Applied Intelligence14 (1) 2001, 65-76 Little R, Rubin D: Statistical analysis with missing data. John Wiley & Sons, 1987. Fleiss J: The design and analysis of clinical experiments. John Wiley & Sons, 1986 Gediga G. Düntsch I. Maximum Consistency of Incomplete Data via Non-Invasive Imputation. Artificial Intelligence Review 19 (1) (2003) 93-107 Davidson AM, Cameron JS, Grünfeld J-P et al. (eds.): Oxford Textbook of Nephrology, Volume 3. Oxford University Press (2005) Kendall MG, Stuart A: The advanced theory of statistics. 4 ed. New York: Macmillan publishing, New York (1979). Hai GA: Logic of diagnostic and decision making in clinical medicine. Politheknica publishing, St. Petersburg (2002) Bichindaritz I, Kansu E, Sullivan KM: Case-based Reasoning in Care-Partner. In: Smyth B, Cunningham P (eds.): Proc EWCBR-98, Springer, Berlin (1998) 334-345 Prentzas J, Hatzilgeroudis I: Integrating Hybrid Rule-Based with Case-Based Reasoning. In: Craw, S., Preeece, A. (eds.): Proc ECCBR 2002, Springer, Berlin (2002) 336-349 Shuguang L, Qing J, George C: Combining case-based and model-based reasoning: a formal specification. Proc APSEC'00 (2000) 416 Corchado JM, Corchado ES, Aiken J et al.: Maximum likelihood Hebbian learning based retrieval method for CBR systems. In: Ashley KD, Bridge DG (eds.): Proc ICCBR 2003, Springer, Berlin (2003) 107-121 Rezvani S, Prasad G: A hybrid system with multivariate data validation and Casebased Reasoning for an efficient and realistic product formulation. In: Ashley KD, Bridge DG (eds.): Proc ICCBR 2003, Springer, Berlin (2003) 465-478

14.

Arshadi N, Jurisica I: Data Mining for Case-based Reasoning in high-dimensional biological domains. In: IEEE Transactions on Knowledge and Data Engineering 17 (8); (2005) 1127-1137