Made, but Problems Remain - NCBI

2 downloads 315 Views 4MB Size Report
both the National DRG Validation Study ... tional DRG Validation Study to address ..... Sox HJ. Probability theory in the use of diagnostic tests:an introduction to ...
The Accuracy of Medicare's Hospital Claims Data: Progress Has Been Made, but Problems Remain

Elliott S. Fisher, MD, MPH, Frednck S. Whaley, PhD, W. Mark Kmushat, MPH, David J. Malenka, MD, Craig Fleming, MD, John A. Baron, MD, MS, MSc, and David C. Hsia, JD, MD, MPH

Introduction Administrative health care databases provide an increasingly accessible and widely used source of data for health care research.1-10 However, the accuracy of hospital discharge data remains uncertain. Studies conducted in the 1970s found that the coding of nonclinical data, such as age, gender, and dates of admission and discharge, was highly accurate but that diagnoses and procedures were less reliably coded.' 1-13 Recent studies continue to raise questions about the accuracy of diagnostic coding.1416 The National Diagnosis Related Group (DRG) Validation Study, which provided the most comprehensive recent assessment of the accuracy of hospital discharge data, found an overall error rate of 20.8% in DRG assignment.'7 However, because many individual DRGs include diverse clinical conditions,'8 the findings could not be generalized to the coding of specific diagnoses or procedures. Also, the published results could not be directly compared with those of the Institute of Medicine (IOM) studies of 1977 and 1980, even though both the National DRG Validation Study and the IOM studies used similar methods. We reanalyzed the data from the National DRG Validation Study to address these issues.

Methods Record Selection and Reabstraction The methods of the National DRG Validation study have been described in detail elsewhere.'7'19 Briefly, the Office of the Inspector General, US Department of Health and Human Services, ran-

domly sampled hospitals from each of three bed-size strata, excluding specialty hospitals and those in states not using prospective payment during the study period. In the second stage, a random sample of 30 Medicare discharges between October 1, 1984, and March 31, 1985, was selected from each of the 239 hospitals and copies of the medical records were requested. Of the 7076 discharges selected, 7050 charts (99.6%) were obtained. As previously reported, the sample accurately represented the population of all Medicare beneficiaries in prospective payment system jurisdictions.'7 Accredited records technicians, blinded to the coding in the original records, reviewed the medical records, selected the supportable diagnoses and procedures, and translated them into International Classification of Diseases, Ninth Edition, Clinical Modification (ICD-9-CM) numeric codes. Reviewers adhered strictly Elliott S. Fisher, David J. Malenka, Craig Fleming, and John A. Baron are with the Department of Medicine and the Department of Community and Family Medicine, Dartmouth Medical School, Hanover, NH. Elliott S. Fisher and Craig Fleming are also with the Veterans Affairs Medical Center, White River Junction, Vt. Fredrick S. Whaley is with the Department of Community and Family Medicine, Dartmouth Medical School. W. Mark Krushat and David C. Hsia are with the Office of Analysis and Inspection, Office of the Inspector General, Department of Health and Human Services. Requests for reprints should be sent to Elliott S. Fisher, MD, MPH, Center for Evaluative Clinical Sciences, Dartmouth Medical School, HB 7250, Strasenburgh Hall, Hanover, NH 03755-3862. This paper was submitted to the journal November 26, 1990, and accepted with revisions June 10, 1991.

American Journal of Public Health 243

Fisher et aL

ords with a specified principal diagnosis that had the same principal diagnosis coded on the original record. Calculations were weighted according to our sampling weights to ensure comparability with the IOM study.

Diagnosis- and Procedure-specific

Analyses of CodingAccuracy

to ICD-9-CM and Uniform Hospital Discharge Data Set coding rules. The 1CD9-CM codes were entered into the GROUPER program to assign the proper DRG. A second blinded recoding of 5% of the sample by different medical records technicians, conducted to determine the reliability of the reabstraction process, revealed no significant discrepancies in DRG assignment (agreement = 0.95, K = 0.86, Z = 2.12).

to those of the National DRG Validation study, and the 1977 NHDS discharge abstract was more similar to current discharge abstracts than were the Medicare abstracts of the 1970s. Although the earlier study used ICDA-8 rather than the

Comparson with Institute of Medicine Study

currently employed ICD-9-CM, the broad diagnostic groupings we present are identical. To assess overall agreement, we determined the proportion of all reabstracted records that had the same principal diagnosis coded on the original record and on the reabstracted record at the specified level of truncation (third or fourth digit).

We chose to compare our data with those from the IOM study of the accuracy of the National Hospital Discharge Survey (NHDS) data13 rather than the IOM study ofMedicare data12 because the sampling methods of the former were similar

We used the same method to determine overall agreement for procedures (except that truncation was at either the second or third digit). To assess agreement within NHDS diagnostic categories, we determined the proportion of reabstracted rec-

244 American Journal of Public HealthF

To analyze the accuracy of coding with respect to specific procedures and diagnostic conditions, three clinicians familiar with both health services research and ICD-9-CM coding rules aggregated codes into clinically distinct groups of diagnoses and procedures such as might be used to select patient cohorts for studies of the outcomes of medical hospitalizations or surgical treatments. Within each diagnosis and procedure group, we compared the codes as recorded on the computerized abstract submitted to the Health Care Financing Administration (original record) with the codes as determined by the Office of the Inspector General (OIG) reabstraction of the medical record (reabstracted record). We evaluated the coding accuracy of the current data using the concepts of diagnostic test evaluation.20 Within this framework, we consider the reabstracted record to be the "gold standard" for defining the presence of a diagnostic condition or the performance ofa particular procedure. Thus, the sensitivity (true positive rate) is the conditional probability that a diagnosis (or procedure) within the specified group was coded on the original record given that a diagnosis (or procedure) within the specified group was actually present on the reabstracted record. Specificity (true negative rate) is the conditional probability that a diagnosis or procedure was not present on the original record, given that is was not coded on the reabstracted record. Positive predictive value is the conditional probability that a diagnosis (or procedure) was actually present on the reabstracted record, given that it had been coded on the original record. To calculate 95% confidence intervals for sensitivity and positive predictive value, we used an exact method based on the two-tail binomial test.21 For the analyses presented in Tables 2 and 3 we calculated the point estimates using both weighted and unweighted data. Because the results were very similar and the use of the unweighted data allowed us to calculate confidence intervals, we present our findings based on the unweighted data.

February 1992, Vol. 82, No. 2

Accracy of Hospital Claims Data

Results Comparson of 1977 with 1985 Overall, the percentage of agreement between the principal diagnosis on the reabstracted record and the original hospital record when codes were truncated at the third digit was 78.2% in 1985, compared with 73.2% in 1977. When codes were truncated at the fourth digit, we found 71.5% agreement, compared with 63.4% in the earlier study. For principal procedures, agreement at two digits was 76.2%, identical to the 76.2% reported in 1977, and agreement at three digits was 73.8% in 1985, comparedwith 71.4% in 1977. When grouped by principal diagnosis into the diagnostic categories used in the NHDS, the percentage of agreement between reab-

February 1992, Vol. 82, No. 2

stracted records and oriinal records varied widely in each study (Table 1). In all but two diagnostic groups (mental disorders and diseases of the respiratory system), the accuracy of the coding of principal diagnosis improved between 1977 and 1985, with the greatest percentage improvement occurring in the coding of neoplasms.

Accuracy of Diagnosis and Procedure Coding Tables 2 and 3 present relevant measures of coding accuracy for selected diagnoses and procedures according to whether the condition or procedure was coded in the principal position or in any position on the discharge abstract. We present two measures of the accuracy of

hospital coding: sensitivity, the proportion of cases with a given code present on the reabstracted record that were correctly coded by the hospital; and positive predictive value, the proportion of cases assigned a given code by the hospitals that the reabstraction process confirmed. Specificity is not presented because it was over 94% (generally over 99%) for all conditions and can be calculated from the data presented. The accuracy of diagnosis coding varied widely (Table 2). When position on the discharge abstract was ignored, sensitivity ranged from a low of 0.58 for peripheral vascular disease to a high of 0.97 for breast cancer. The positive predictive value for the selected conditions also varied widely, from 0.53 for peripheral vasAmerican Journal of Public Health 245

Fisher et a.

cular disease to 0.94 for hip fracture. A similar range in coding accuracy was found when the analysis was restricted to the principal diagnosis. The coding of major procedures was substantially more accurate than the coding of most diagnoses (Table 3). Sensitivity, when position was ignored, ranged from a low of 0.88 for cardiac catheterization to over 0.95 for 10 of the 15 procedures. Positive predictive value, for principal procedures, ranged from a low of 0.89 for cardiac catheterization to over 0.99 for seven procedures.

Discussion Recent published research on the accuracy of Medicare's hospital discharge data has focused on reimbursement issues and has reported findings in the aggregate rather than for specific clinical condi-

246 American Journal of Public Health

tions.1722 The National DRG Validation Study reported 79.2% agreement in DRG assignment for discharges in 1985.17 We found similar overall accuracy at the three-digit coding level for principal diagnoses in 1985. However, these aggregate results obscured substantial variations in the coding accuracy of individual conditions and procedures. Several medical conditions, such as hip fracture, acute myocardial infarction, and most cancers, were accurately coded, as were virtually all major surgical procedures. Many clinical conditions, however, were coded with limited accuracy. Several limitations of our study must be kept in mind. First, relatively small numbers of records were available for use in assessing the coding of many individual conditions. Consequently, some of our estimates have wide confidence limits and it would be inappropriate to assume that the

coding of conditions that we did not examine is either poor or excellent. Second, our study relied on data from the last 3 months of 1984 and the first 3 months of 1985, and recent work by Carter et al. suggests that coding accuracy has improved since 1985.22 Reabstraction of a national sample of records by the SuperPRO revealed that the proportion of cases on which the hospital and SuperPRO agreed on the weight of the DRG improved from 77% in 1985 to 86% in 1988.22 However, because aggregate results reveal little about the coding of individual conditions, and the relative accuracy of coding across individual conditions is unlikely to have changed substantially, our findings should prove useful even to researchers who plan to use only the more recent data. The most important limitation of our study, however, is that the clinical accuracy of the coding cannot be directly in-

February 1992, Vol. 82, No. 2

Accurcy of Hospital Claims Data ferred from our findings. ILike previous studies, the National DRG Validation Study was designed to test the accuracy with which hospitals adhered to coding rules,23 rather than to assess the validity of the claims data as a source of information on patients' health status or medical treatments. Consequently, the findings we report do not directly answer the two major clinicaly relevant questions of whether a specific condition present in the patient was coded on the computerized abstract (sensitivity) and whether specific conditions coded on the abstract actually were present in the patient (predictive value positive). This distinction between coding conventions and clinical status may explain the low accuracy with which some medical conditions are coded. A diagnosis may have been present but not recorded on the computerized abstract either because there were five more serious conditions present or because the condition was judged not to contribute to the care provided to the patient.23 If a condition is coded by one abstraction team but not the other, it seems likely that the condition was documented in the patient's chart but that the two abstractors disagreed about whether it should be recorded among the listed diagnoses. Such misclassification could lead to an underestimate of both the sensitivity and the positive predictive value of the coded diagnoses. This distinction between coding rules and clinical status also provides the basis for a caveat about our finding that the coding of acute myocardial infarction was relatively accurate. Although their studywas based on a small sample of hospitals and was conducted prior to the implementation of the prospective payment system, Iezzoni et al. found that 9% of cases coded as acute myocardial infarction at nonteaching hospitals and 42% of cases at teaching hospitals did not meet clinical criteria for this condition.14 New rules have been implemented specifically to address the limitations of myocardial infarction coding, but the proper clinical interpretation of many coded diagnoses will remain uncertain until clear-cut clinical criteria for their use are established and promulgated. Surgical procedures, because of the precision with which they are defined and the accuracy with which they are coded, pose less difficulty in interpretation. What are the implications of our findings? One common use of administrative databases is in research comparing outcomes across individual hospitals or among classes of hospitals.7'-4 Our demonstration of variability in condition-speFebruary 1992, Vol. 82, No. 2

cific coding accuracy and previously reported differences across types of hospitals in aggregate coding accuracy14,16,17 underline the importance of interpreting claims-based hospital comparisons with caution. However, the potential value of administrative health care data has been demonstrated in follow-up studies based on more complete and accurate data,25'26 and itjustifies the increasing support for continued, if cautious, pursuit of claims-based studies.8 A second major use of claims databases is in population-based studies of the outcomes of specific conditions or treatments.8-10 The major advantages of such claims-based studies are their relative efficiency and their potential to avoid the possible selection bias inherent in institution-specific or other convenience samples. From a clinical perspective, such studies may offer valuable information about the short- and long-term outcomes of common treatments and conditions in the general population.10'27-29 Our findings demonstrate that major surgical procedures and medical conditions that are accurately coded, such as hip fracture and cancers, can be reliably identified through the claims data. However, for clinicians to apply the findings of claims-based research to help their patients understand the likelihood of relevant outcomes of specific treatments, the study population must be stratified according to the full range of clinical characteristics that may influence the outcome of treatment, including age, gender, severity of illness, functional status, and comorbidity.29 The current level of coding accuracy and diagnostic precision is clearly insufficient to achieve the level of detail that most clinicians would seek in such studies. Further improvements in the precision and accuracy of hospital coding would facilitate health care research. However, achieving a high level of accuracy for all conditions will clearly require substantial effort and attention to the many causes of inaccurate coding that have long been recognized.17,30-32 In the short term, two steps should be considered to help overcome the constraints imposed by the variable accuracy of diagnosis coding and the uncertain clinical implications of diagnostic codes. First, studies similar in scope to the National DRG Validation study should be repeated at regular intervals and should report not only coding accuracy-the degree to which hospitals adhere to Uniform Hospital Discharge Data Set coding rulesbut also the clinical accuracy of the cod-

ing. Second, a mechanism should be established to facilitate investigators' access to medical records for cases identified through the claims data. Chart review would not only ensure proper case definition, but would also allow precise casemix measurement and accurate description of the specific interventions received by each patient. The Medicare administrative databases are a valuable resource collected at great expense. Our findings show that some conditions may currently be studied using the claims data alone and that modest progress has been achieved in improving the accuracy of diagnostic coding. In the long run, improving overall accuracy would enhance claims-based research. In the short run, the implementation of steps to supplement the currently available data would provide investigators with an even more powerful tool for use in evaluating the outcomes of hospital care. O

Acknowledgments Support for this study was provided by the Agency for Health Care Policy Research (grant R18 HS05745) and the National Institute ofAging (grant RO1.AG07146-01). For their assistance with this project, the authors are indebted to Cathaleen A. Ahern, Barry L. Steeley, and Jane C. Tebbutt of the DHHS Office of Inspector General; Patricia E. Brooks, Anne B. Fagan, and Barton C. McCann of the US Health Care Financing Administration; Penni I. St. Hilaire of the US Public Health Service; Annette M. Delaney, Laurie H. Moore, Joseph S. Yarmus, and their staff at Baxter Health Data Institute; and John R. Boyle and Walter P. Weaver of the DHHS Office of the Assistant Secretary of Management and Budget. We would also like to thank John E. Wennberg, Wayne Ray, and an anonymous reviewer for their helpful comments. O

References 1. Farmer ME, White LR, Brody JA. Race and sex differences in hip fracture inci-

dence.AmJPublic Health. 1984;74:13741380. 2. Kellie SE, Brody JA. Sex-specific and race-specific hip fracture rates. AmJPub-

lic Health. 1990;80:326-328. 3. Fisher ES, Baron JA, Malenka DJ, et al. Hip fracture incidence and mortality in New England. EpidemioL 1991;2:116-122. 4. Ray WA, Griffin MR, Schaffner W, Baugh DK, Melton JL. Psychotropic drug use and the risk of hip fracture. N Engi J MedL 1987;316:363-369. 5. McPherson K, Wennberg JE, Hovind OB, Clifford P. Small-area variations in the use of common surgical procedures: an international comparison of New England, England, and Norway. N Engl J Med.

1982;307:1310-1314.

6. Wennberg JE, Gittelsohn AM. Small area variations in health care delivery. Science. 1973;183:1102-1108. 7. Health Care Financing Administration.

American Journal of Public Health 247

Fisher et a. Medicare Hospital Mortality Infonnation: 1986. Washington, DC: US Dept of Health and Human Services; 1988. 8. Roper WL, Winkenwerder W, Hackbarth GW, Krakauer H. Effectiveness in health care: an initiative to evaluate and improve medical practice. N Engl J Med. 1988;319:1197-1202. 9. Roos NP, WennbergJE, Malenka DJ, et al. Mortality and reoperation after open and transurethral resection of the prostate for benign prostatic hyperplasia. N Engl J Med. 1989;320:1120-1124. 10. Wennberg JE, Roos N, Sola L, Schori A, Jaffe R. Use of claims data systems to evaluate health care outcomes: mortality and reoperation following prostatectomy. JAMA. 1987;257:933-936. 11. Institute of Medicine. Reliability of HospitalDischargeAbstracts. Washington, DC: National Academy of Sciences; 1977. 12. Institute of Medicine. Reliability of Medicare Hospital Dischaige Records. Washington, DC: National Academy of Sciences; 1977. 13. Institute of Medicine. Reliabiity of National Hospital Discharge Swvey Data. Washington, DC: National Academy of Sciences; 1980. 14. Iezzoni LI, Burnside S, Sickles L, Moskowitz MA, Sawitz E, Levine PA. Coding of acute myocardial infarction: clinical and policy implications. Ann Intemn Med. 1988;109:745-751. 15. Jencks SF, Williams DK, Kay TL. Assessing hospital-associated deaths from dis-

248 American Journal of Public Health

charge data: the role of length of stay and comorbidities. JAMA. 1988;260:22402246. 16. Greenfield S, Aronow HU, Elashoff RM, Watanabe D. Flaws in mortality data: the hazards of ignoring comorbid disease. JAMA. 1988;260:2253-2255. 17. Hsia DC, Krushat M, Fagan AB, Tebbutt JA, Kusserow RP. Accuracy of diagnostic coding for medicare patients under the prospective-payment system. N EnglJ MeL 1988;318:352-355. 18. Iezzoni LI, Moskowitz MA. Clinical overlap among medical diagnosis-related groups. JAMA. 1986;255:927-929. 19. Delaney AM, Hsia DC. National DRG Validation Study: Final Report. Waltham, Mass: Baxter Health Data Institute; 1987. 20. Sox HJ. Probability theory in the use of diagnostic tests: an introduction to critical study of the literature. Ann Int Med. 1986;104:60-66. 21. Conover ,WJ. Practical Nonparametric Statistics. 2nd ed. New York, NY: John Wiley & Sons Inc; 1986. 22. Carter GM, Newhouse JP, Relles DA. How much change in the case-mix index is DRG creep? Santa Monica, Calif: The RAND/UCLA/Harvard Center for Health Care Financing Policy Research; 1990. Publication R-3826-HCFA. 23. American Hospital Association. ICD9-CM Coding Handbook for Entry-Level Coders, with Answers. Chicago, IL. 1979;58. 24. Dubois RW, Brook RH, Rogers WH. Adjusted hospital death rates: a potential

25.

26.

27.

28.

29.

30.

31. 32.

screen for quality of medical care. Am J Public Health. 1987;77:1162-1167. Dubois RW, Moxley JH, Draper D, Brook RH. Hospital inpatient mortality: is it a predictor of quality? N Engl J Med. 1987;317:1674-1680. Maerki SC, Luft HS, Hunt SS. Selecting categories of patients for regionalization: implications of the relationship between volume and outcome. Med Care. 1986;24:148-158. Fisher ES, Malenka DJ, Solomon NA, Bubolz TA, Whaley FS, Wennberg JE. Risk of carotid endarterectomy in the elderly. Am J Public Health. 1989;79:16171620. Whittle J, Steinberg EP, Anderson GF, Herbert R. Use of Medicare claims data to evaluate the outcomes of elderly patients undergoing lung resection for lung cancer. Chest. 1991;100:729-734. Fisher ES. Helping patients choose treatments: an important role for administrative health care databases. Chest. 1991; 100:595-596. Demlo LK, Campbell PM. Improving hospital discharge data: lessons from the national hospital discharge survey. Med Care. 1981;19:1030-1040. Lloyd SS, Rising JP. Physician and coding errors in patient records.JAM4. 1985;254: 1330-1336. Cohen BE, Pokras R, Meads MS, Krushat WM. Howwill diagnosis related groups affect epidemiologic research? Am J EpidemioL 1987;126:1-9.

February 1992, Vol. 82, No. 2