Mining Association Rules from a Pediatric Primary Care ... - NCBI

9 downloads 2999 Views 737KB Size Report
unsupervised data mining algorithm to a database containing data collected at the point of care for clinical decision support. The data set was taken from the ...
Mining Association Rules from a Pediatric Primary Care Decision Support System Stephen M. Downs, MD and MS Michael Y. Wallace, MS Departments of Pediatrics and Biomedical Engineering University of North Carolina at Chapel Hill incomplete. This is especially true for data collected by elaborate decision support systems. In fact, the more a clinical decision support system guides clinical care, the more the data collected tend to be selective and, therefore, incomplete. This poses special challenges to data mining that are not inherent in more complete data sets like billing or research data. Wong and Wang have proposed a rule induction algorithm that tolerates up to 15% missing data.[2] We wanted to demonstrate the application of this algorithm to a large data set collected during routine use of a pediatric decision support system and to explore the ability of this algorithm to "discover" novel and non-trivial knowledge.

ABSTRACT

The purpose of this study was to apply an unsupervised data mining algorithm to a database containing data collected at the point of care for clinical decision support. The data set was taken from the Child Health Improvement Program (CHIP), a preventive services tracking and reminder system in use at the University of North Carolina. The database contains over 30,000 visits. We used a previously described pattern discovery algorithm to extract pd and fd order association rules from the data and reviewed the literature two see if the associations had been described before. The algorithm discovered 16 2d order associations and 103 3rd order associations. The 3rd order associations contained no new information. The 2Id order associations demonstrated a covariance among a range of health risk behaviors. Additionally, the algorithm discovered that both tobacco smoke exposure and chronic cardiopulmonary disease are associated with failure on developmental screens. These relationships have been described before and have been attributed to underlying poverty. The work demonstrates the ability ofunsupervised data mining by rule association on sparse clinical data to discover clinically important associations. However, many associations may be previously known or explained by confounding variables.

METHODS The CHIP Database The data mining algorithm was applied to data collected during routine use of the Child Health Improvement Program (CHIP). CHIP is a childhood preventive services tracking and reminder system that has been in use in the Pediatrics Clinics at the University of North Carolina since 1995. The system has also been successfully implemented in health department and private practice settings. CHIP contains a database of pediatric preventive services guidelines based on recommendations of the American Academy of Pediatrics, the US Preventive Services Task Force, and the Centers for Disease Control and Prevention. The guidelines are stored as over 150 prompts to the clinician and are prioritized so that the most important preventive services are A patient database tracks delivered first. demographics, preventive services and risk factors for patients seen in the clinic. At each clinic visit, CHIP compares the guideline database to the patient database and determines which preventive services the child is eligible to receive during the visit. CHIP prints the ten highest priority prompts on a tailored worksheet that is used during the clinic visit. Each prompt consists of a "stem" explaining the prompt and one to six check boxes that allow the clinician to document risk factors, assessments,

INTRODUCTION One goal of clinical computing is to provide decision support and capture clinical data at the point of care. These data can then be analyzed in order to increase medical knowledge. The process of identifying novel and potentially useful patterns in data has been described as "data mining" and "knowledge Data mining and knowledge discovery."[1] discovery "close the loop" between clinical data capture and evidence-based decision support by facilitating the conversion of clinical data into evidence for future decision support. One of the greatest challenges to mining clinical data is that data captured in support of clinical care, unlike data captured for research purposes, may be

1067-5027/00/$5.00 © 2000 AMIA, Inc.

200

services provided or referrals. Each prompt is associated with a "pass condition." The pass condition is a pattern of boxes that, when checked, indicate that the preventive service has been satisfied and the prompt is no longer needed on subsequent worksheets. The worksheet is bar coded so that, at the end of the clinic session, data captured on the worksheet can be optically scanned into the computer. These data are used to tailor subsequent worksheets to the patient's needs. In the past five years, the CHIP system at the University of North Carolina has collected data from over 30,000 visits by more than 7000 children. Preprocessing CHIP's patient database consists of records for each visit, keyed on medical record number and date of visit. Each record stores information on which prompts were printed on the worksheet and which boxes were checked for each prompt. In order to apply the data mining algorithm, the database was transformed so that each record represents one patient. Fields representing each of the prompts contained in the guideline database indicate whether the patient ever failed the pass condition for that prompt. Since CHIP contains over 150 prompts and each worksheet contains only 10 prompts, most fields in the transformed database are empty. Data Mining We used a modified version of the data mining algorithm described by Wong and Wang.[2] Each binary variable was decomposed into two variablevalue pairs, defined as primary events. The algorithm described by Wong considered all combinations of primary events from order 2 to order N, where N is the total number of primary events. For this pilot study, we considered only events up to order 3. For each event of order two or three, the algorithm compares the observed frequency of occurrence of the event, 0, to the expected frequency of occurrence, E, based on the marginal probabilities of the component primary events, using the standardized residual, z:

where v is given by:

v=1-HBN and N is the sample size. The probability that d will exceed any particular value by chance is given by standard tables for the z-statistic. This threshold can be set arbitrarily. For this pilot study, we considered compound events whose adjusted standardized residual exceeded 1.96 and which, therefore, had less than a 5% probability of occurring by chance. Wang and Wong reported that their algorithm functioned well with up to 15% missing values. Because the CHIP system contains 150 variables but only 10 are used at each visit, the CHIP database is sparser. We modified the algorithm to calculate the residuals using only records for which the values of the events under consideration were not missing. Variable Values Sex M,F Developmental Screening Pass, fail Poor weight gain True, false Short stature True, false Microcephaly True, false Hearing risk assessment Pass, fail Breast feeding Yes, no Lead exposure risk assessment Pos., neg. Tuberculosis risk assessment Pos., neg. Tuberculosis skin test (PPD) Pos., neg. Anemia screening test Pos., neg. Influenza risk True, false Own infant/toddler car seat True, false Use infant safety seat correctly True, false Use toddler safety seat correctly True, false Have working smoke detectors True, false Varicella vaccine after age 12 True, false Hearing screen Pass, fail Vision screen Pass, fail Over/underfed (0-6 weeks) True, false True false Prone sleep position (SIDS risk) Blood lead screen Pos., neg. Environmental tobacco smoke True, false Parental tobacco smoking cessation True, false Recurrent acute otitis media True, false Persistent otitis media with effusion True, false Table 1. 26 variables representing the 52 primary events evaluated by the algorithm. Pruning the Search Space The run time of an algorithm that exhaustively evaluates all possible combinations of primary events increases geometrically. Therefore, we used two strategies to limit the search space. Following Wong,[2] we reasoned that rare or absent events are unlikely to participate in higher level events. So,

O-E Because the standardized residual has a chi-squared distribution, it can be approximated by the normal distribution, when adjusted by the sample variance, v, by: z

201

primary events that occurred less than three times in the database were placed on a "negative event" list. Higher order events containing events on this list are not considered further in the algorithm. We also placed compound events with an adjusted standardized residual less than -1.96 on the negative event list, reasoning that these events and higher order events that contain them happen extremely rarely. Wong also found that higher order events rarely resulted in significant residuals. For this pilot, we chose to look only at second or third order events. Variable Selection Another strategy for limiting the run time for the algorithm was to select variables. A pediatrician on the research team reviewed all of the variables in the CHIP database and selected the set of 26 that he thought were most interesting (Table 1). Variables selected were those that dealt with the child's health status (e.g., growth parameters, vision or hearing screens, developmental assessment), risk assessments (e.g., lead, tuberculosis, tobacco smoke exposures) and parents' reported health behaviors (e.g., car seats, smoke detectors). We did not include variables that documented services or counseling provided by the physician without any assessment of the child (e.g., when to introduce solid foods).

behaviors. These include not breast feeding, incorrect use of car seats and lack of smoke detectors, environmental tobacco smoke exposure in the home, and unsafe infant sleep positioning. For example, children whose parents did not use an automobile safety seat correctly were also less likely to have a smoke detector in the home. Parents who smoke are less likely to breast feed their child. A second set of associations revealed predictors of clinical outcomes such as failed developmental screening and growth failure. These include exposure to environmental tobacco smoke and the presence of a chronic heart or lung disease (e.g., asthma) that require annual influenza vaccination. 2' Order Associations Resid Effct Samp size

Size

1.39

1648

2.43

1.6

2189

4.61

2.5

1740

2.55

1.2

1712

2.08

1.2

632

3.11

1.7

678

2.23

1.3

683

2.52

1.7

637

2.48

1.3

1179

5.60

1.2

1648

2.51

2.5

787

2.47

1.5

362

_ Health Behaviors Smoke exposure and no breast 7.10

feeding Smoke exposure and no smoke detectors Car seat misuse and no smoke detector Car seat misuse and no breast

feeding No breast feeding and prone sleep position (SIDS risk) Infant car seat misuse and toddler car seat misuse Prone sleep position (SIDS risk) and smoke exposure Prone sleep position (SIDS risk) and car seat misuse Lead exposure risk and tobacco smoke exposure Breast feeding and no tobacco smoke exposure Clinical Outcomes Chronic cardiopulmonary disease and failed developmental screen Smoke exposure and poor

Evaluating Results We anticipated that some associations discovered by the algorithm would be trivial, and that some would be well-recognized associations. For associations not known to the pediatrician on the team, we conducted a literature review to determine if the associations had been described in the literature previously.

RESULTS For the 26 variables considered by the algorithm, there were 52 primary events and 990 possible second order associations of which 16 had significant standard residuals. The time needed to evaluate these associations was 979 seconds on a Pentium II, 450 MHz CPU with 128 MB RAM.

growth Smoke exposure and failed 2.27 1.4 3334 developmental screen Lead exposure risk and failed 2.24 1.8 1203 developmental screen 2.51 1.6 263 Microcephaly and hearing impairment risk Car seat misuse and failed 2.27 1.9 1797 developmental screen Table 2. Second-order association rules derived by the algorithm. 'Resid" is the adjusted standardized residual for the association. The "Effct Size" is the ratio of the observed to the expected frequency of the compound event. The "Samp size" is the number of records that contained (non-missing) values for both variables in the association.

For the same 52 primary events, there were 14,190 possible third order associations of which 103 had significant residuals. Running the algorithm to investigate third order events took 20.47 hours. Associations Discovered A pediatrician reviewed the associations discovered by the algorithm. The 16 second-order associations were classified into two types (Table 2). The first group related to various health behaviors and showed a high covariance among several "unhealthy"

202

risk of injuries and asthma, and lower developmental scores in a range of tests at multiple ages.[7] However, the mechanisms by which these risks are modulated is not clear. The covariance of health behaviors discovered in this data set may help explain how many of these risks are associated.

All of the 103 third order associations contained the second order events already discovered. Therefore, the third order association simply showed that the relationships described by the second order associations applied within as well as across subgroups. No relationships were found to be exclusive to any one subgroup.

DISCUSSION

Comparison to Previous Reports We conducted a literature search to determine if previous investigators had described the associations between environmental tobacco smoke exposure or cardiopulmonary disease and developmental delays or among the various health behaviors.

In this study we demonstrated the feasibility of using data mining techniques to extract association rules from a data set obtained through routine operation of a preventive care decision support system. We found that, in this data set, association rules of order three did not contain knowledge that was not contained in the second order rules. This data set suffers from the flaws characteristic of data captured at the point of care for the purpose of providing decision support. The data are sparse and inconsistently collected. Nonetheless, the algorithm extracted clinical knowledge about health risk behaviors and clinical outcomes that have been born out in previous work. The discovery of a direct association between chronic cardiopulmonary disease (e.g., asthma) and developmental delay among otherwise healthy children was a novel discovery. However, the literature shows a high covariance among a range of health risks that may explain the coexistence of these problems in impoverished families. The decision to calculate residuals using only records for which the values of the events under consideration were not missing was necessitated by the large amount of missing data. Many variables were missing more often than present because the CHIP decision support system did not print the prompts that assessed them. As a result, most associations were among missing values when these records were not eliminated. Of course, in addition to finding valid, causal relationships in clinical data, data mining will also find all of the spurious and idiosyncratic relationships among the data in a particular data set. For this reason, results of any data mining procedure should be considered exploratory, hypothesis-generating only. Likewise, relationships found by unsupervised data The mining algorithms may be confounded. relationship between two variables may result from a third, unmeasured variable that affects both of the measured variables. This is illustrated in our study by the effect of poverty on both asthma frequency and developmental delays.

When we investigated the association of tobacco smoke exposure with failure on the developmental screening test performed in the clinics, we found that several investigators had independently described this association. In a sample of 3- and 5-yr.-old children, Johnson found maternal smoking in the home was significantly and inversely related to IQ in children of normal birth weight and without neurological problems.[3] Several investigators attributed this relationship to prenatal smoking. Fried found that in a sample of children at 12 and 24 months of age, prenatal exposure to cigarette smoking was significantly associated with poorer language development and lower cognitive scores at both 36 and 48 months after statistically controlling for confounding factors.[4] Frydman compared two samples of children, aged 4 to 5, and aged 6 to 7 (40 children in total), whose mothers had smoked during pregnancy, with two samples of 40 children of the same ages whose mothers had not smoked and found a difference of more than 15 IQ points in favor of the children of nonsmoking mothers.[5] Trasti found that at 5 years, children of smokers had an increased risk of getting an IQ score below the median value of the population, but the risk was reduced when adjusted for maternal education (OR = 1.6, 95% CI: 0.9-3.7).[6] We did not find studies that described a direct association between chronic cardiopulmonary disease (e.g., asthma) and developmental delay among otherwise healthy children. However, these problems may coexist in impoverished families.[7] In fact, Aber reviews the literature showing a high covariance among a range of health risks. For example, poverty is associated with increased neonatal and post-neonatal mortality rates, greater

203

2. Wong, A. and Y. Wang, High-order pattern discoveryfrom discrete-valued data. IEEE Transactions on Knowledge and Data Engineering, 1997. 9(6): p. 877-92.

To account for confounding variables, a data mining algorithm can be supervised. The qualitative relationships among variables can be specified manually based on existing domain knowledge. This can be accomplished, for example, by data mining with Bayesian belief networks.[8] Bayesian networks offer the advantage of allowing the user to specify known relationships among the variables so that the data mining algorithm takes these into account. However, this specification can be very laborious and permits the introduction of investigator bias. Unsupervised training of Bayesian networks is also possible, but requires very large datasets.

3. Johnson, D.L., et al., Adult smoking in the home environment and children's IQ. Psychological Reports, 1999. 84(1): p. 149-54. 4. Fried, P.A. and B. Watkinson, 36- and 48-month neurobehavioralfollow-up ofchildren prenatally exposed to marijuana, cigarettes, and alcohol. Journal of Developmental & Behavioral Pediatrics, 1990. 11(2): p. 49-58. 5. Frydman, M., The smoking addiction ofpregnant women and the consequences on their offspring's intellectual development. Journal of Environmental Pathology, Toxicology & Oncology, 1996. 15(2-4): p. 169-72. 6. Trasti, N., et al., Smoking in pregnancy and children's mental and motor development at age I and 5 years. Early Human Development, 1999. 55(2): p. 137-47. 7. Aber, J.L., et al., The effects ofpoverty on child health and development. [Review] [94 refs]. Annual Review of Public Health, 1997. 18: p. 463-83. 8. Heckerman, D., Bayesian networks for data mining. Data Mining and Knowledge Discovery, 1997. 1: p. 79-119.

Association rule induction is an easily understood approach to discovering potentially important clinical knowledge even in data that are sparse and inconsistently collected. Though some findings will be spurious and many previously described, the technique offers the potential to discover completely novel associations.

Acknowledgments This work was funded by grants from the Agency for Healthcare Research and Quality (ROI HS09507) and the Robert Wood Johnson Generalist Faculty Scholars Program. Mr. Wallace was funded by the National Library of Medicine (2T15 LM07071). REFERENCES 1.

Fayyad, U., Data mining and knowledge discovery: Making sense out of data. IEEE Expert, 1996. October: p. 20-25.

204