Physician Judgment in Clinical Settings - Clinical Chemistry

14 downloads 0 Views 3MB Size Report
Dec 2, 1992 - Physician Judgment in Clinical Settings: Methodological Influences and Cognitive. Performance. Neal V.Dawson. Understanding the quality of ...
CLIN. CHEM.

39/7, 1488-1480

(1993)

Physician Judgment in Clinical Settings: Methodological Influences and Cognitive Performance Neal

V.Dawson

practice variation include differences in physicians’ Understanding the quality of physicians’ intuitive judgments is essential in determining the appropriate use of styles of practice, unrecognized differences in clinical of groups of patients, incentives inherent their judgments in medical decision-making (vis-#{224}-vischaracteristics in payment mechanisms, and differing perceptions analytical or actuarial approaches). As part of this process, the quality of physicians’ predictions must be asamong clinicians of the risks and benefits of specific therapeutic approaches. A better understanding of the sessed because prediction is fundamental to common role of physician perception and cognition in the genesis clinical tasks: determining diagnosis, prognosis, and therapy; establishing monitoring intervals; performing screenof practice variation is needed. For centuries clinicians have been engaged in the ing and preventive maneuvers. Critical evaluation of prefamiliar clinical tasks of making a diagnosis, assessing dictive capabilities requires an assessment of the prognosis, determining therapy, and monitoring outcomponents of the prediction process: the data available. for prediction, the method used for prediction, and the comes. Attempting to answer the questions “What’s wrong with me?”, “How will it affect me?”, and “What accuracy of prediction. Although variation in and uncercan be done about it?” has always been the purview of tainty about the underlying data elements are often acthose providing clinical care. Only more recently have knowledged as a source of inaccurate predictions, predicthe more population-oriented tasks become a regular tion also can be confounded by both methodological and activity of clinicians, especially primary-care providers. cognitive limitations.During the past two decades, numerSuch activities include screening for diseases or risk ous factors have been recognized that may bias test factors and the institution of programs for preventing characteristics (sensitivity and specificity). These same specific diseases or undesirable outcomes. factors may also produce bias in intuitivejudgments. The use of cognitive processes to simplify judgment tasks ClinicalPrediction (e.g., the availability and representativeness heuristics) and the presence of certain biases in the judgment An activity that is common to all of the clinical tasks process (e.g., ego, regret) may present obstacles to I have just reviewed, and that is essential to rational accurate estimation of probabilities by physicians. Limitadecision-making, is clinical prediction. The clinical pretions on the intuitive use of information (cognitive biases) diction process can be divided into three components: have been demonstrated in both medical and nonmedical the data that are available to use in prediction, the decision-making settings. Recent studies have led to a method used to actually make the prediction, and the deepening understanding of the advantages and disadaccuracy of the prediction. Each of these three compovantages of intuitive and analytical approaches to decinents can be further subdivided into important parts. sion making. Here, many aspects of the basis for this Data used for clinical prediction can come from a wide understanding are reviewed. Indexing Terms: physician prediction . heuristics . cognitive bias . judgmentlimitations . factors affecting physicians’ judg-

ments Associated with efforts to address our spiraling health-care costs has come an increasing emphasis on the development and evaluation of explicit clinical strategies. The recognition of widespread practice variation by Wennberg (1, 2) and others has led to an increasing proportion of health-care research funds and hospital resources being directed at outcomes assessment and improvement (e.g., 3-5). One of the explicit goals of current research into practice variation is the creation of clinical policies or guidelines, which, it is believed, will both diminish practice variation and enhance quality of care. Factors suspected of influencing Case Western Reserve University at MetroHealth Medical Center, Cleveland, OH 44109-1998. Received December 2, 1992; accepted February 3, 1993. 1468

CLINICAL

CHEMISTRY,

Vol. 39, No. 7, 1993

variety of sources: history, physical exminiition, laboratory tests, radiological procedures, surgical observations, pathology specimens, questionnaires, other observations, and other measurements. Any of these data can be viewed and analyzed as if they were “tests.” Even when these data are collected in a valid and reproducible manner at the individual patient level, conclusions about the relationship between the “test” and outcome of interest can be influenced greatly by both spectrum and some forms of bias (6, 7). “Spectrum” refers to the kinds of patients from whom predictive data are derived. For an individual clinician, it would represent his or her case mix. For a study, the spectrum would be determined by the sample selection process and the specific inclusion and exclusion criteria that were applied. Bias can creep in from multiple sources. For diagnostic tests, the most important sources of bias to examine are the processes by which the test is determined to be positive or negative and how the disease is determined to be present or absent (6). Clinical predictions can be made in a variety of ways.

The method by which the vast majority of predictions are made, whether in clinical practice or in everyday life, is by human judgment. Methods of prediction that formally incorporate data from cross-sectional or cohort and other longitudinal studies are sometimes referred to as “actuarial models.” The quality or accuracy of predictions is of obvious interest to clinicians and patients alike. The overall accuracy of a judgment or prediction can be influenced by prevalence, discrimination, calibration, and scatter. Prevalence (also called prior probability) is the occurrence rate of the disease or outcome in question in the sample being studied. The ability to discriminate reflects the capacity to distinguish those who have a given disease from those who do not or to distinguish those who will develop a particular outcome from those who will not. Judgments that are calibrated demonstrate the ability to provide realistic probability estimates, e.g., if one predicted a 20% chance of survival, one would expect the observed frequency of survival among such patients to be 20%. (The assessment of calibration for intuitive judgments is conceptually similar to the measurement of calibration for a laboratory procedure where assays are performed from samples with known concentrations of an analyte.) Scatter simply refers to the noisiness of the judgments or predictions. We also need to be concerned about the qualities of our inaccurate predictions. It is common for false-positive and false-negative errors to have different implications for the clinical course of disease and for care provided (8, 9). As such, it is important not only to determine the quantity of errors made but also to assess the qualitative aspects of correct and incorrect predictions. Assessing the risks, benefits, and costs of obtaining certain pieces of information can lead to important insights into the overall utifity of such information.

MethodologicalFactors InfluencingTest Characteristics We all recognize the importance of prevalence in the interpretation of tests, but as was described earlier, how well a test performs also can be influenced by spectrum, that is, the kinds of subjects studied. Bias can occur both in the assessment of test results as being positive or negative and in the assessment of the disease state as being present or absent. Both spectrum and bias can affect test characteristics (sensitivity and specificity). Spectrum. The sensitivity of a test should be evaluated in a broad range of patients who have the disease in question. Thus, in the diseased group, the pathological component of spectrum would reflect the extent and location of disease, the clinical component would reflect the severity and chronicity of symptoms, and comorbidity would reflect other diseases that may influence the false-negative rate (6). Specificity should be examined in a broad range of patients who do not have the disease in question (6). In the group of patients without this disease, one should examine the pathological features that lead to a positive test result, then look to see whether patients have been included who have these features but do not have the

disease in question. For the clinical and comorbid components, one would look for clinical features or other diseases that tend to affect the false-positive rate. Test characteristics and prevalence. Sensitivity and specificity are generally presumed to be prevalence independent (10, 11). Evidence is accumulating, however, that spectrum varies with the prevalence rates of some diseases (12-15). The covariation of prevalence and spectrum implies that, at the level of clinical testing, one cannot assume that the same numerical values for sensitivity and specificity will hold as the prevalence of the disease of interest changes. Bias. In diagnostic test development and evaluation, the best way to reduce bias is to be certain that the test result and disease status are determined independently of one another. Bias in the evaluation of diagnostic tests can be placed into one of four categories: veriflcation/ work-up bias, diagnostic review bias, test review bias, and incorporation bias. The opportunities are common for all of these types of bias to occur in usual clinical practice. These biases may, therefore, contribute to inaccurate perceptions by the physician of relationships between “tests” and diseases or outcomes. In considering examples of clinical practices that predispose to these biases, we must clearly recognize that a practice that may be quite reasonable for the task of providing care may be unacceptable (methodologically) when the task is to evaluate the true test characteristics (sensitivity and specificity). Work-up bias is a form of verification bias. It occurs when the results of the test being evaluated affect the subsequent clinical work-up. An example is when patients with negative stress tests receive a different subsequent evaluation (e.g., no cardiac catheterization) than patients who have positive stress tests. Work-up bias is avoided if the same evaluation is performed regardless of the result of the test being evaluated. Verification bias also can occur when study subjects are restricted to only those patients who have had definitive verification of disease status. This sounds counterintuitive, so let’s look at a hypothetical example. If we were studying a new test for cirrhosis of the liver and only included patients who had liver biopsies, our study would be at risk for verification bias. Not all patients who actually have cirrhosis undergo liver biopsy. The ones who have biopsies maybe more severely affected or otherwise unique. The test performance in this highly selected group may not be the same as in patients who are less severely affected. Verification bias may be avoided by performing the “gold-standard” test on all subjects who are at risk of having the disease in question. Verification bias can lead to underdiagnosis and tends to inflate estimates of sensitivity. It has variable effects on specificity. Diagnostic review bias can occur when the results of the test being evaluated affect the subjective review of the gold-standard test used to establish the diagnosis. This bias may be especially important in influencing clinicians’ perceptions about the value of symptoms and signs as predictors of disease. For example, a physician CLINICAL

CHEMISTRY,

Vol. 39, No. 7, 1993

1469

hears rales in the area of the right middle lobe of an outpatient presenting with cough. He or she then reports this to the radiologist, who sees a questionable infiltrate in the same area. The radiologist maybe more likely to believe that an infiltrate is truly present and “confirm” the radiological diagnosis of pneumonia. This bias can lead to either over- or underdiagnosis; it can be avoided by blinding the interpreter of the gold-standard test to the results of the test being evaluated. Test review bias is the complement of diagnostic review bias; that is, the interpretation of the test being evaluated is influenced by knowledge of the true diagnosis. For example, a patient is admitted from the Emergency Department to the inpatient service with a roentgenogram in hand that shows consolidation of the right middle lobe. When the chest is examined by the inpatient physician, the atypical sound heard in the right middle lobe area may be more likely to be called “rales.” This bias can also be avoided by blinding. Incorporation bias occurs when the results of the test being evaluated are used as evidence to determine the presence or absence of disease. This would occur if we were examining the relationship between values for creatine kinase (CK) and the presence of myocardial infarction (MI) and used the cardiologist’s final clinical diagnosis from chart review as the criterion for the presence of MI.’ Obviously, CKvalues are routinely used as evidence of MI in usual clinical practice. Despite the circular nature of this logic, many studies that demonstrate incorporation bias can be found in the literature (16, 17). This observation is in no way meant to fault those who have chosen to struggle with methodologically difficult assessments (e.g., CK and MI); rather, it is meant to highlight the potential for this frequently unrecognized source of bias in studies of diagnostic tests.

Physicians’ ,Judgment We will now turn to the issue of human judgment, and specifically to what is known about physicians’ judgment. Head-to-head comparisons of physicians’ judgments and those derived from actuarial models, prediction rules, or decision aids will be examined. Studies outside of the clinical laboratory setting will be discussed, because most issues related to physicians’ judgment have not been systematically studied in clinical chemistry laboratory settings. Having “good” judgment means that our predictions regarding diagnosis, prognosis, or management are usually accurate. Being accurate could mean that we can correctly predict a continuously measured outcome, such as the length of time until a patient will begin to respond to therapy, or that we can correctly predict a discrete event, such as whether or not a patient actually has a given disease. At the extreme, having “good” judgment would suggest that no more than some minimum number of mistakes would be made. Because ‘Nonstandard abbreviations: CK, creatine kinase; MI, myocardial infarction; ROC, receiver-operating characteristic; and ECG, electrocardiogram.

1470 CLINICAL CHEMISTRY, Vol. 39, No. 7, 1993

biological

endpoints are not perfectly predictable, 100% is not a reasonable standard. Uncertainty about or variation in the information available for making diagnoses, for prognostication, or for recommending therapy suggests that even optimal use of such information will lead to less-than-perfect prediction. Many measures have been developed and used for gauging the accuracy of human judgments. In the present context, these measures tend to reflect two fundamental aspects of accuracy: the ability to discriminate patients with a given outcome from those without that outcome, and the agreement between predicted values and actual values, i.e., the degree of calibration. We are interested in discrimination because it tells us how well we can assign higher probability values to patients who have disease compared with those without disease. The assignment of realistic probability values, which is necessary for good calibration, is also important. When important thresholds exist (18), differences in calibration among physicians can lead to differences in testing and management strategies by physicians assessing the same or similar patients. Two methods are commonly used for graphically displaying results of studies that have examined physicians’ abilities to make medical predictions. For discrimination, receiver-operating characteristic (ROC) curves are often used (19-21). For the association between estimated and actual values, calibration curves are used (22,23). Graphical displays: ROC curves. Although several measures of discrimination exist (24), the ROC curve explicitly incorporates important features of tests (sensitivity, specificity) and is becoming increasingly familiar to clinicians. The ROC curve plots sensitivity, or truepositive rate, against 1 - specificity, or false-positive rate (see Figure 1). The 450 line denotes random discrimination. The closer a curve is to the left upper corner, the better the discrimination. The index of discrimination is the area under the ROC curve, which varies from 0.5 to 1.0. The higher the ROC area, the better the discrimination. Several methods are currently available to compare areas for statistical significance. Many clinically useful tests have moderate discrimination, with areas between 0.75 and 0.85. The curve shown in Figure 1 for electrocardiographic stress testing has an area of nearly 0.85. Points A through F represent different cutoffs for ST-segment depression that could be used to define a positive stress test. A very stringent cutoff value, such as the 2.5 mm of ST depression represented by point A, gives nearly 100% specificity but very low sensitivity. If lesser values of ST depression (point D) are used to define a positive test, sensitivity improves but at the expense of specificity. A cutoff of 1 mm of ST depression provides a sensitivity of about 65%; i.e., of all patients with coronary artery disease, 65% will have a positive test. This cutoff provides a specificity of about 85%; i.e., among all patients who do not have coronary artery disease, 85% would have a negative test. This corresponds to a false-positive rate of 15%. A test with better discrimination than stress

accuracy

1.000 0.600 U,

0

5-

0.600

SI. SEGMENT

DEPRLSSION(IN mm)

U)

A ).‘2.5

32.O C )- 1.5

D ). 1.0 S F >=0.O

0.200 U,

z (0

0.00

0.20

0.40

0.60

0.80

1.00

1-SPECIFICITY (FALSE POSITIVE RATE) Fig. 1. ROC curve for the electrocardiographIc exercisestresstest, based on published data (Diamond GA, Forrester JS. N EngIJ Med 1979;300:1 350) Points A through F representincreasingly lax’ criteriafor calling a given test result “positive” or abnormal. Sensitivity of the test Is plotted agaInst 1 specificity. The closer a given ROC cuive is to the left upper corner, the better the dIscriminationbetween true-positiveand false-positiveresults. The diagonal finerepresents dIscrImInation that Is no better than chance (“50-50”). Reproduced with permission, Evaluation and the Health Professions 1990;13:57;C 1990, Sage Publications

testing would have a curve falling above the one shown here and would have a significantly larger area. Graphical displays: calibration. The calibration curve plots the predicted values and actual values against one another, e.g., the predicted probability of disease might be plotted against actual frequency of disease for each decile of probability (see Figure 2). The 450 line indicates perfect calibration. Points falling above the 450 line reflect underestimation; points falling below the line signiQy overestimation. Indices of calibration, such as the piecewise and Spiegethalter’s tests, can be used as statistical methods for comparing predicted probabilities with observed frequencies. The calibration curve shown in Figure 2 was reported by Christensen-Szalanski and Bushyhead (25) and demonstrates the relationship between physicians’ subjective estimates of the probability of pneumonia in outpatients

t

I 30 507090 Subjective Probobility of Pneumonio Fig. 2. Relationship between physicians’ subjective probability estimate of pneumonia and the actual frequency of pneumonia (25) Reproduced with permIssion, Journal of Experimental Psychology: Human Perception and Performance

1981;4:930;

© 1981, AmerIcan

Psychological

presenting with cough with the actual frequency of pneumonia by radiographic examination. When experienced physicians estimated the probability of pneumonia to be 90%, pneumonia was present only 20% of the time, demonstrating a large degree of overestimation. Physicians’ cognitive performance. The number of published studies looking critically at physicians’ judgment is relatively small compared with many areas of investigation in medicine. However, the past 10-15 years have seen an increasing number of studies that have assessed the accuracy of physicians’ diagnostic and prognostic judgments made during the clinical care of patients. I will briefly review the results of several such studies. Judgments about common problems such as streptococcal pharyngitis and cough have been studied in ambulatory areas. In hospital-practice settings, estimates of prognosis relative to patients surviving a given hospital stay and beyond have been examined. Only a few studies have evaluated the effects that factors such as level of experience have on the accuracy of physicians’ judgments. A few studies have examined the performance of physicians’ diagnostic or prognostic judgment as directly compared with the use of decision aids in the form of statistical models. The results of studies of the prediction of pneumonia, MI, hemodynamic status, and intensive-care unit survival also will be reviewed. The observation that experienced physicians have difficulty predicting streptococcal pharyngitis in adults was made by Poses et al. (26). They reported on 10 university health-service physicians who predicted the presence of streptococcal pharyngitis among 308 students presenting with acute sore throats. These experienced physicians demonstrated low discrimination, with an ROC area of 0.67. There was a consistent tendency for the physicians to overestimate the likelihood of a positive streptococcus culture; e.g., when the average estimate was 62%, the actual prevalence was only 8%. The likelihood of streptococcal pharyngitis was overestimated by physicians for 81% of patients. Physician performance for several prediction tasks has been shown to be better than that observed for predicting pneumonia or streptococcal pharyngitis. Tierney et al. (27) reported on physicians’ ability to estimate the probability of MI among 492 patients who presented to an emergency room with chest pain. Very good discrimination between those who actually had and did not have MIs was shown: The area under the ROC curve was 0.87. The cutoff point or operating position at which physicians chose an admission to a coronary-care unit rather than admission to a medical floor led to 87% of all MIs being admitted to the former and 78% of all non-MIs being admitted to the latter. Physician calibration for predicting the probability of MI for emergency-room patients also was reported. Physician estimates of 70% probability were well calibrated. For the mid-range of predictions, physicians demonstrated a general tendency to overestimate the likelihood of MI. Poses et al. (28) studied the ability of physicians to CUNICAL

CHEMISTRY,

Vol. 39, No. 7, 1993

1471

predict survival to hospital discharge for 256 patients admitted to an intensive-care unit. ROC curves were developed for the 25 residents, four critical-care attending physicians, and 56 primary attending physicians who participated. All demonstrated very good discrimination, with ROC areas of 0.83, 0.86, and 0.84, respectively. These differences are not statistically significant. The authors also reported on the relationship between estimated survival and actual survival rates (calibration). Residents tended to underestimate survival. Critical-care attending physicians were generally well calibrated. Primary physicians had a mixed performance of under- and overprediction of survival. One must temper this generally quite good performance with the recognition that there was often considerable disagreement among the three physician estimators. For almost 45% of patients, there was a disagreement of at least 20 percentage points between at least one of the three possible pairs being compared. For about 25% of patients, two such disagreements existed. Why might physicians’ predictions be inaccurate?

Lens model

analysis

(30-34),

depicted

schematically

in Figure 3, is a method for formally comparing the use of clinical information to make judgments with the “inherent” predictive capabilities of the available information. One can determine which clinical variables tend to be used by physicians when they make judgments (r8 in Figure 3). One can also see how these same variables relate empirically to the actual outcome (re). Statistical models can be created to represent both the judgment and the outcome. The technique allows the examination of how well the model captures physicians’ judgment (RB) and how well the actual outcome is predicted by the model (RB). How well the physicians’ judgment predicts the actual outcome (Ra) can also be assessed. In addition, one can compare the model of judgment with the model of outcome (R,). Applying lens model analysis to the prediction of hemodynamic status gave the following results (34). The model of the empirical relationship between the available clinical cues and hemodynamic status correlated

significantly

better

with

measured

hemodynainic

One possibility is that the events we are trying to predict are in fact unpredictable. It could be that the available clinical data simply do not correlate well with the outcome we are trying to predict. Alternatively, it could be that the event is predictable, but we are using the wrong cues from the clinical data, or that we are using the right cues but giving them the wrong weights. Connors et al. (29) developed and validated a model to predict hemodynamic status in patients in the intensive-care unit by use of commonly available clinical and

status than did the model of physicians’ predictions (correlation = 0.67 and 0.42, respectively). Some of this difference related to the tendency for physicians to place too much weight on a few clinical variables such as the clinical impression of the presence of congestive heart failure. In addition, physicians tended to ignore other potentially important information. Thirty percent of the explanatory power (explained variance) of the empirical model came from data from the laboratory, chest radio-

laboratory data. The validated model provided significantly better discrimination than did physicians’ judgment (ROC area = 0.84 vs 0.70), suggesting that more predictive information was available than was used by the physicians.

not to use this information in predicting hemodynamic status. Only 7% of the explanatory power of the physicians’ model came from the laboratory, chest radiograph, and ECG. Earlier in this article we saw how poorly physicians

graph,

and electrocardiogram

(ECG). Physicians

tended

CLINICAL VARIABLES

PHYSICIAN JUDGMENT

FIg. 3. Lens model analysis: the use of clinical information In judgments made by physicians is compared with the “inherent” predictive capabilities of such information (as captured by a mathematical model) Reproduced with permission, Medical Decision Making1989;9:247; © 1989, Hanley and Belfas, Publishers

1472 CUNICAL CHEMISTRY, Vol. 39, No. 7, 1993

were calibrated when they predicted pneumonia in outpatients (25). Speroff and 1(35) plotted a calibration curve describing the performance of the same task by physicians at MetroHealth Medical Center in Cleveland, OH. It replicated (almost identically) the findings of Christensen-Szalanski and Bushyhead (25) that physicians tend to overestimate the probability of pneumonia in outpatients. We (35) also compared the discrimination of MetroHealth physicians for predicting pneumonia with the performance of a validated decision aid [data on physicians’ discrimination were not available from the original study (25)]. The decision aid ROC area was 0.83, whereas that from physicians’ estimates was 0.73. The clinically important difference in area between the two curves, 0.10, suggests that the decision aid may outperform physicians’ judgment. Owing to the small number of patients in this sample who had pneumonia (n = 15), the power to detect a difference of 0.10 was limited, however. Studies of physicians’ estimates of prognosis for intensive-care-unit patients have generally shown very good physician discrimination. A typical example is the ROC curve shown in Figure 4 from a study of intensive-careunit patients by Brannen et al. (36). In this study, physicians had significantly better discrimination than the APACHE II severity system for predicting mortality. The spectrum of patients in this study had important differences from the patient groups in which the APACHE system was developed and validated, however. The majority of patients in the Brannen study had acute respiratory failure. A current thrust for the APACHE system is to develop separate prognostic models for important subgroups of intensive-care-unit patients. A similar study performed by McClish and Powell

S S

Area Under P00 =0.899 0.038(Physician) -0.796 O.044(APACHE-ll)

False-Positive

Rate

Fig.4. ROCcurvesforin-hospital mortalitypredicted by physicians and by the APACHEn (Acute R,ysiologyand Chronic Health E,aluation) score

(36)

SEM are listed. Reproduced with permission, Archives of Internal Medicine 1989;149:1085; 5 1989, Ameulcan Medical Association

Areas

±

(37) documented similar although less impressive differences between physician judgment and the APACHE II system. They also examined the relationship between predicted and observed mortality rates, which revealed better calibration by the APACHE II system. Physicians had a tendency to overestimate mortality rates to a greater extent than did the APACHE!! system. Many new insights into both physicians’ and patients’ decision-making are being gleaned from the SUPPORT study (38), a Study to Understand Prognoses and Preferences for Outcomes and Risks of Treatment. SUPPORT is a study of nearly 10000 seriously ifi hospitalized adult patients who fall into one or more of nine disease categories. The study is being performed at five sites: Duke University, University of California-Los Angeles, Beth Israel Hospital in Boston, the Marshfield Clinic in Wisconsin, and MetroHealth Medical Center in Cleveland. Phase I was an observational study in which a prognostic model of survival was developed and validated, and issues important to decision-making by physicians, patients, and patients’ surrogate decision-makers were recorded. Phase II is a randomized trial of prognostic information feedback and efforts to enhance patient-physician communication. During Phase I, physicians provided estimates of survival for 2514 patients for intervals of both two and six months after study entry. In addition, physicians indicated the confidence they had in each estimate. For each patient, a prediction of survival was derived from a four-strata, Cox proportional hazards model (39). The model considers physiological data from APACHE LII, Glasgow Coma Score, patient’s age, disease group, number of days spent in hospital before entry into the study, and comorbidity, especially that associated with cancer. A comparison of the frequency distributions of estimates of 2-month survival for physicians and the Cox Model demonstrated that physicians tended to place their estimates in the three highest deciles of probability, in the 50-50 category, and in the lowest decile. The model had a much smoother distribution and, unlike the physicians, rarely placed patients in the very highest or lowest deciles (40). The physicians’ and model’s abilities to discriminate between patients who would live and who would die by 2 months after study entry were almost identical, with ROC areas of 0.795 and 0.798, respectively. Across all probability estimates, the model provided very good calibration. Physicians were reasonably calibrated at very high and the lowest probability levels but had a tendency to underestimate survival in the midrange of estimates-an observation consistent with the phenomenon of “hanging crepe.” An extremely interesting observation has been made in these analyses. Specifically, when the physicians’ estimate was considered with the model’s estimate, it added significantly to the accuracy of the prediction. The model’s estimate also added significantly to the accuracy of the physicians’ estimate. This suggests that each estimate contains important prognostic information that is not considered by the other (39). In fact, the most accurate estimates tend to occur when both estiCLINICAL

CHEMISTRY,

Vol. 39, No. 7, 1993 1473

mates are used together. This statistically combined estimate is the basis for the prognostic feedback provided to physicians in Phase II of SUPPORT. As part of SUPPORT data-gathering, the patient’s desire for information is assessed also. This scale measures the extent to which the patient wants information about his or her condition and treatment options. The scale contains seven items with a dichotomous agree-disagree response (41). Physicians were given this same scale and were asked to respond as they believed their patient would. The degree of agreement between physicians and patients was low, as assessed by the intraclass correlation coefficient of 0.14. When differences between matched scores were examined, physicians’ ratings were consistently lower than patients’ ratings; thus, physicians significantly underestimated their patients’ desire for information. Soon after study entry, patients provided their own estimates of the likelihood that they would survive for 2 and 6 months. Patients who demonstrated a high desire for information had a greater degree of concordance with the physician estimate of 2-month survival (42). The Phase II intervention in SUPPORT is aimed at improving decision-making by providing physicians with enhanced prognostic estimates and facilitating better doctor-patient communication regarding patients’ values and preferences as well as prognosis. Judgment Limitations We have thus far seen examples of physician-judgment tasks in which physicians’ performance ranged from very poor to very good. What are some of the reasons that may be responsible for judgment errors? Numerous studies within and outside of medical settings have documented the limits of human information-processing capabilities (32, 43-47). Intuitive approaches to large amounts of information or complex decisions can lead to frequent errors. Faced with complexity or uncertainty, we tend to use cognitive shortcuts to try to simplify the decision-making chore. Fainiliar rules of thumb, or “heuristics” as they are called in the literature, are frequently helpful in simplifying the task (43). The disadvantage is that their use also can lead to unrecognized systematic errors in judgment. Several other tendencies have been described that do not involve the use of heuristics but can lead to inaccurate probability estimates or pose impediments to information synthesis. Intuitive processes that lead to systematic errors in judgment are labeled “cognitive bias.” The fact that these limitations exist certainly does not indicate that we are cognitive cripples. Our abffity to recognize when these errors may occur, however, does provide an opportunity for us to make better decisions. Table 1 lists nine factors that contribute to cognitive bias. The first six factors can contribute to inaccurate probability estimates. The remaining three factors can impede optimal information synthesis. All of these factors can be recognized in medical decision-making settings (47). Probability estimates: heuristics. The availability heu1474

CLINICAL

CHEMISTRY,

Vol. 39, No. 7, 1993

Table 1. Factors That Can ContrIbute BIas

to CognItive

Obstaclesto accurateprobabilityestimates Heuristics Availability Representativeness

Anchoring and adjustment Biases Ego Hindsight Regret (value-induced) Impedimentsto optimalsynthesisof information Confirmatory bias Ignoring negative evidence

Framingeffects

ristic occurs when we use the ease with which instances come to mind as a proxy for the likelihood that they will occur. Cases that may come easily to a physician’s mind include those that have occurred recently, those that are memorable because of their rarity or uniqueness, or those with which the physician has had a personal investment, either through personal or family illness or personally performed research. Although common events are easily recalled, not all easily recalled events are, in fact, common. “Representativeness” is essentially the same as pattern recognition. Pattern recognition can be very useful in many areas of medicine, from making a diagnosis to interpreting ECGs or chest radiographs. Pattern recognition is not influenced by several factors that are known to influence actual probabilities, however. For example, it is not influenced by the prior probability of the disease or outcome of interest. Nor is it influenced by the fact that data obtained from a small sample may be an unreliable estimator of the underlying (population) characteristic (e.g., a single blood-pressure reading vs an individual’s average blood pressure). Pattern recognition fails to account for the phenomenon of regression to the mean, events that occur by chance alone, and the degree to which the event in questions is even predictable. The “anchoring and adjustment” heuristic is used in circumstances in which an initial probability estimate is updated as new information becomes available. For example, when physicians clinically assess cardiac output, some predictive information comes from the history and additional information comes from the physical examination. Based on the initial impression from the history and physical, the likelihood that cardiac output is abnormal could be increased or decreased based on information from the ECG, chest radiograph, and laboratory data. The initial data provide the initial estimate; adjustment takes place after new data are available. The problem with this approach is that people tend to be too conservative when they adjust their initial estimate upward or downward, as if they were truly “anchored” to their initial estimate. Probability estimates: bias. Is your skifi as a driver below average, average, or above average? Psychologi-

cal research indicates that we tend to attribute our successes to skill and our failures to chance-or as many of us like to put it, “bad luck.” Most people feel that they are above-average drivers in terms of skill and safety. Limited research in medical settings indicates that we may have the same kind of perceptions about ourselves as physicians. When ego bias is present, probability estimates are altered in a self-serving way. For example, in a study of perceptions of surgical mortality, most surgeons felt that the mortality rate for their own patients was lower than the mortality rate for the entire service (48). This assessment was undoubtedly true for some surgeons but obviously wasn’t true for most of them. Related to the influence of ego bias on probability estimation is its influence on the confidence with which judgments are made. Several recent studies have demonstrated that experts and nonexperts alike show great confidence in their judgments, even if these judgments can be shown to be incorrect (49). Several colleagues and I (50) have studied the relationship between confidence and accuracy of judgments in several intensive-care units in Cleveland. Interns, residents, and fellows made predictions about hemodynamic status (cardiac output, pulmonary capillary wedge pressure, and total systemic resistance) in intensive-care-unit patients. Each physician also provided a statement of how certain he or she was about the estimates. Certainty ranged from very uncertain to very certain. Fellows were significantly more confident in their predictions than were the interns and residents. Despite the fellows’ high level of confidence in their estimates, they were no more accurate than were their less-experienced colleagues, even at the highest levels of confidence. The relationship between physician confidence in their prognostic estimates and the accuracy of their estimates also was examined in the SUPPORT study (51). For the prediction of survival, however, increased confidence was significantly associated with increased accuracy. In fact, when confidence was very high-e.g., 9 or 10 on a scale of 0-10-physicians were more accurate than the Cox model of survival. Physician confidence in prognostic estimates also has been shown to be significantly related to resource use (52). In the SUPPORT study, the intensity of resource use was measured by the average TISS (Therapeutic Index Scoring System), which was measured on study days 1, 3, 7, 14, and 25. Even after controlling for factors that are known to influence resource use, such as severity of illness and hospital length of stay, increased levels of resource use were significantly associated with both lower estimated survival rates and lower confidence in the estimates. Stating the relationship another way, resource use was lowest among patients who were thought to have the best chance of survival; however, regardless of the estimate of survival, if the physician was not confident in his or her estimate, more resources were used. “Hindsight” bias represents the following phenome-

non: after an event has occurred, we believe we could have more easily predicted its occurrence than is true if we had been asked to predict it beforehand (53). Hindsight bias may produce especially troublesome results in malpractice cases where judgments are always made from a position of hindsight. It may also affect second opinions. When the second evaluator is aware of the first diagnosis made, he or she may be more likely to agree with that diagnosis than is warranted by the available data. Hindsight bias has been shown to occur in clinicopathological conferences (54). Physicians in the audience at these conferences were randomly divided into two groups. The first (foresight) group provided a differential diagnosis after all the clinical information was available but before the pathological diagnosis was discussed. The second (hindsight) group provided their differential diagnosis at the end of the conference and were asked to do so as if they did not know the final diagnosis. Compared with foresight physicians, the hindsight physicians were significantly more likely to give the correct diagnosis as first on their list of differential diagnoses (30% vs 50%, respectively). “Regret” or “value-induced” bias is the “chagrin factor” in clinical medicine that Feinstein has written about (55). Anticipated regret can affect probability estimates. People tend to distort probability estimates by combining two steps in the decision-making process, i.e., allowing the undesirability of a given diagnosis or outcome to alter the estimate of its likelihood of occurrence. When we anticipate the regret we would feel after having missed a given diagnosis, we may overestimate the probability of its occurrence. For instance, we would likely regret missing significant coronary artery disease in a young man with atypical chest pain. When we see young men with atypical chest pain, we may tend to feel that coronary artery disease is more likely than it really is. This reason for inaccurate probability assessment may lead even the best physicians among us astray. Value-induced bias, rather than greed or other explanations, may be at the root of test ordering for patients who objectively can be shown to be at low risk for the disease in question.

Information Synthesis Three additional factors can impede optimal information synthesis. The first is the tendency for us to seek only information that will confirm rather than disconfirm our hypotheses. Physicians as well as nonphysicians tend to seek information that can only be used to confirm hypotheses (56, 57). The confirmatory bias may influence the process of data interpretation as well. In an evaluation of hypothetical cases, medical students and practicing physicians were shown to use low-relevance information as being supportive of their own diagnostic hypotheses (58). People have been shown to regularly ignore negative evidence, even though it must be used to make decisions optimally. It is as if intuitive human information processing is rather inept at using negative evidence, i.e., “normal” findings. Physicians have been shown to use CLINICAL

CHEMISTRY,

Vol. 39, No. 7, 1993

1475

abnormal, but not normal findings, in diagnosing pneumonia in adult outpatients, even though both types of information are required for efficient diagnosis (59). Apparently it is easier to learn that the presence of moderate to severe cough increases the likelihood of pneumonia than it is to learn that the absence of rhinorrhea also increases the likelihood of pneumonia. “Framing,” which represents different ways the same information can be presented to people, has been shown to greatly influence, and sometimes reverse, preferences about medical therapies. This phenomenon was demonstrated by McNeil et al. (60) in a study that compared the attractiveness of surgery vs radiation therapy for the treatment of lung cancer. The participants were a large group of ambulatory patients (who did not have lung cancer), a group of physicians, and a group of graduate students. The attractiveness of surgery was substantially enhanced when the treatments were identified rather than not specifically identified, when the information consisted of life expectancy rather than cumulative probability, and when the problem was framed as the probability of living rather than in terms of the probability of dying. Framing effects have raised ethical concerns because the method in which exactly the same information is presented can influence the choice of therapy for the majority of patients. This realization could lead to the active manipulation of choices by the person or persons presenting the information. Illusion of Validity We are often unaware of errors in judgment that we may have made in usual clinical practice. Ethical or pragmatic constraints often prevent the type of feedback that would be necessary to identify and correct such mistakes. This phenomenon has been called the “fflusion of validity” (49, 61). This phenomenon is especially problematic for judgments about diseases that have a variable course. We suspect the presence of the disease; we suggest a treatment; the patient improves and we assume it was due to our treatment. The risky or costly gold-standard test to confirm the diagnosis may not have been performed-or perhaps the gold-standard test simply doesn’t exist. Well-designed studies of clinical decision-making may be our only opportunity to sort out and correct some of the misperceptions we have developed in usual clinical practice. Human Judgment vs Analytical Approaches Findings from a large body of literature summarized by Hammond (62) allow us to make some generalizations about the properties of intuitive as compared with analytical approaches to decision-making (see Table 2). We are generally not very aware of the intuitive processes we use, and we generally have little conscious control over them. For analytical approaches, the opposite is true: The methods to be used are explicit, and we have direct control over them. An answer from intuitive problem-solving can usually be obtained rapidly, whereas an analytical approach generally requires a slower, more methodical pace. A major advantage of 1476

CLINICAL

CHEMISTRY,

Vol. 39, No. 7, 1993

Table 2. PropertIes of Intuitive vs Analytical Approaches8 intuition

Analysis

Cognitive awareness

Low

Conscious control

Low

High

Rate Reliability

Faster Low

Slower

Errors

Gaussiandistribution Few, but large High for answer, low Low for answer, for method highfor method

Confidence

High

High

Modified from reference62; reproducedwith permission, Concise Encyclopedia of InformationProcessing in Systems and Organizations1990:309; 1990, PergamonPress.

analytical approaches is their reliability. When the same information is fed in, we are highly likely to obtain the same value repeatedly. When the same information is given to the same individual to assess intuitively at different times or is given to several people at the same time, there is a lower likelihood that the same value will be repeatedly obtained. When a validated analytical system is used in the appropriate patient population, there tend to be fewer errors than with intuitive approaches. When errors do occur, however, they may be very large. A data-entry error or the misplacement of a decimal point may provoke a large error. In aggregate, errors from human intuition tend to be randomly distributed about the true value. A moderate proportion of estimates are reasonably close to the true value, and very few are greatly in error. This random variabffity is probably why studies have shown that taking the average of several intuitive estimates tends to provide a better estimate than relying on any individual estimate (63). Because we often have little insight into exactly how we derive an answer from intuition, our confidence in the method tends to be low. We often have great confidence in the answer, however. Because the process for analytical approaches is usually explicit, we tend to have good confidence in the method, but may feel uncomfortable about the correctness of the answer. The advantages and weaknesses of each approach may suggest ways in which each may be used in effective decision-making. Informal or intuitive decisionmaking holds the clear advantage when the context of the problem-solving task is broad, e.g., determining what the present ifiness really is and what diseases belong highest in the differential diagnosis. When the decision task becomes fairly narrow, appropriately validated formal quantitative techniques generally prevail over intuition (64, 65). Summary The majority of clinical problems do not have validated quantitative approaches available to be used, so what can clinicians do to make more accurate quantitative estimates? The literature generally supports four factors as being important. People who have more experience usually will perform better than novices. People who have had specific training in the judgment task will

tend to provide better estimates, especially if they have had accurate and timely feedback about prior decisions. The average of estimates from several individuals also will tend to provide a more accurate estimate. In addition, even the qualitative use of principles that are fundamental to decision analysis, such as explicitly structuring the problem and characterizing the information needed to make the decision, can produce better judgments. In the spirit of broadening the base for decisionmaking in clinical chemistry, I would like to present a brief commercial message on behalf of the Society for Medical Decision Making. This international scientific society has been in existence for only the past decade. During this time, the Society’s national meeting and journal, Medical Decision Making, have become important forums for the presentation and discussion of a wide variety of topics and techniques as they relate to decision-making in medicine. Topics include quantitative techniques such as decision analysis; psychological aspects of decision-making such as cognition, attitudes, and values; and health economics and health policy. New and improved methodologies for decision-making, computer-based decision-support systems, and legal and social issues as they influence medical decision-making are additional areas of investigation and discourse. The Society provides an important venue in which clinical chemists and other laboratorians can interact in stimulating and productive ways with quantitatively oriented clinicians and others with similar interests. These interactions should provide the necessary insights and expertise to approach important issues at the interface between clinical chemistry and clinical care.

basic science for clinical medicine. Boston: Little, Brown and Co., 1985:59-138. 10. Galen ES, Gambino SR. Beyond normality: the predictive value and efficiency of medical diagnoses. New York: John Wiley

This work was supported in part by a grant from the Northeast Ohio Chapter of the American Heart Association, the Cuyahoga County Hospital Foundation, the Kech Foundation, and the Robert Wood Johnson Foundation. I would like to thank my co-investiga-

NJ: Prentice Hall, 1990:45-58, 81. 24. Yates JF. External correspondence:

tors in the SUPPORT study for help in providing data for this report. I also thank Barbara Dawson and Vickie Stepka for their help in preparing the manuscript.

References 1. Wennberg JE. Which rate is right? N Engi J Med 1986;314: 310-2. 2. Wennberg JE. Small area analysis and the medical care outcome problem. In: Sechrest L, Perrin E, Bunker J, eds. Research methodology: strengthening causal interpretations of nonexperimental data. Rockville, MD: Agency for Health Care Policy and Research, 1990:177-206. 3. Stewart AL, Hays ED, Ware JE Jr. The MOS short-form general health survey. Med Care 1988;26:724-35. 4. Grady ML, Schwartz HA, eds. Medical effectiveness research data methods. AHCPR Publ. No. 92-0056. Rockville, MD: Agency for Health Care Policy and Research, June 1992. 5. McEachern JE, SchiffL, Cogan 0. How to start a direct patient care team. Qual Rev Bull 1992;18:190-200. 6. Ransohoff DF, Feinstein AR. Problems of spectrum and bias in evaluating

the efficacy of diagnostic

teats. N Engl J Med 1978;299:

926-30. 7. Begg CB. Biases in the assessment of diagnostic tests. Stat Med

1987;6:411-23. 8. Weinstein MC, Fineberg IIV, Elstein

AS, Frazier HS, NeuRR, McNeil BJ. (eds.). Clinical decision analysis. Philadelphia: WB Saunders Co., 1980:122-4. 9. Sackett DL, Haynes RB, Tugwell P. Clinical epidemiology: a hauser D, Neutra

and Sons,

1975:123-5.

11. Sox HC, Blatt MA, Higgins MC, Marton KI. Medical decision making. Boston: Butterworths, 1988:91-3. 12. Weiner DA, Ryan TJ, McCabe Gil, Kennedy JW, Schloss M, Tristani F, et al. Exercise stress testing. Correlations among history of angina, ST-segment response, and prevalence of coronary artery disease in the Coronary Artery Surgery Study (CASS). N Engl J Med 1979;301:230-5. 13. Hlatky MA, Pryor DB, Harrell FE, Calif RM, Mark DB, Rosati RB. Factors affecting sensitivity and specificity of exercise electrocardiography: multivariable analysis. Am J Med 1984;77: 64-71. 14. Dawson N, McLaren C, Osinan M. Cogent subgroups threaten the internal validity and transportability of decision aids [Abstract). Clin Res 1986;34:262. 15. Knottnerus JA. The effects of disease verification and referral on the relationship between symptoms and diseases. Med Decis Making 1987;7:139-48. 16. Eikman EA, Cameron JL, Coleman M, Natarajan TK, Dugal P, Wagner HN Jr. A test for patency of the cystic duct in acute cholecystitis. Ann Intern Med 1975;82:318-22. 17. Puleo PR, Guadagno PA, Roberts R, Scheel MV, Marian AJ, Churchill D, Perryman MB. Early diagnosis of acute myocardial infarction based on assay of subforms of creatine kinase-MB. Circulation 1990;82:759-64. 18. Pauker, SG, Kasairer JP. The threshold approach to clinical decision making. N Engl J Med 1980;302:1109-17. 19. Swets JA. Signal detection in medical diagnosis. In: Jacquez JA, ed. Computer diagnosis and diagnostic methods. Springfield, IL: Charles C Thomas, 1972:8-28. 20. Swet.s JA. The relative operating characteristic in psychology. Science 1973;183:990-1000. 21. Metz CE. Basic principles of ROC analysis. Semin Nucl Med 1978;8:283-98. 22. Lichtenstein S, Fischhoff B, Phillips LD. Calibration and probabilities: the state of the art to 1980. In: Kahneman D, Slovic P. Tversky A, eds. Judgment under uncertainty: heuristics and biases. New York: Cambridge Univ. Press, 1982:306-34. 23. Yates JF. Judgment and decision making. Engelwood Cliffs,

decompositions of the mean probability score. Org Beh Human Perf 1982;30:132-56. 25. Christensen-Szalanski JJ, Bushyhead JB. Physicians’ use of probabilitistic information in a real clinical setting. J Exp Psychol [Hum Percept] 1981;4:928-35. 26. Poses EM, Cebul ED, Collins M, Fager SS. The accuracy of experienced physicians’ probabifity estimates for patients with sore throats. J Am Med As8oc 1985;254:925-9. 27. Tierney WM, Fitzgerald J, McHenry R, Roth BJ, Psaty B, Stump DL, Anderson FK. Physicians’ estimates of the probability of myocardial infarction in emergency room patients with chest pain. Med Decis Making 1986;6:12-7. 28. Poses EM, Bekes C, Copare FJ, Scott WE. The answer to ‘What are my chances, Doctor?” depends on who is asked: prognostic agreement and inaccuracy for critically ifi patients. Crit Care Med 1989;17:827-33. 29. Connors AF, Dawson NV, Speroff T, Khemka A, Veverka M. Successful assessment of cardiac function in the critically ill using clinical information [Abstract]. Am Rev Reap Dis 1989;139:A17. 30. Hammond KR, Hursch CJ, Todd FJ. Analyzing the components of clinical inference [Review]. Psychol Rev 1964;71:438-56. 31. Hursch C, Hammond KR, Hurach J. Some methodological considerations in multiple-cue probability studies [Review]. Paychol Rev 1964;71:42-60. 32. Schwartz S, GrifiIn T. Medical thinking: the psychology of medical judgment and decision making. New York: Springer Verlag, 1986:15, 16, 196-8. 33. Wigton ES. Use of linear models to analyze physicians’ decisions. Med Decis Making 1988;8:241-52. 34. Speroff T, Connors AF, Dawson NV. Lens model analysis of hemodynamic status in the critically ill. Med Decis Making 1989; 9:243-52. CLINICAL

CHEMISTRY,

Vol. 39, No. 7, 1993

1477

35. Dawson NV, Speroff T. Validated decision aid for outpatient pneumonia [Abstract]. Cliii Res 1989;37:774. 36. Brannen AL, Godfrey U, Goetter WE. Prediction of outcome from critical

illness:

a comparison

of clinical

judgment

with

a

prediction rule. Arch Intern Med 1989;149:1083-6. 37. McClish DK, Powell SH. How well can physicians estimate mortality in a medical intensive care unit? Med Decis Making 1989;9:125-32. 38. Murphy DJ, Cluff LE. SUPPORT: Study to Understand Prognoses and Preferences for Outcomes and Risks of Treatments (study design). J Cliii Epidemiol 1990;43(Suppl):1S-123S. 39. Knaus W, Harrell F, Lynn J, Connors A, Dawson N, Goldman L, et al. The SUPPORT prognostic model [Abstract]. Chin Res 1992;40:253. 40. Dawson NV, Speroff T, Connors A, Arkes H, Knaus WA, Harrell FE, et al. Use of the mean probability score: comparison of the SUPPORT prognostic model with physicians’ subjective estimates of survival [Abstract]. Med Decis Making 1992;12:336. 41. Krantz DS, Baum A, Wideman MV. Assessment of preferences for self-treatment and information in health care. J Person Soc

Psychol 1980;39:977-90. 42. Speroff T, Dawson N, Connors AF, Coulton C, Arkes H, Youngner S, et al. Desire for information: influence on decision making of seriously ill patients [Abstract]. Med Decis Making

1992;12:336. 43. Tversky A, Kahneman D. Judgment under uncertainty: heuristics and biases. Science 1974;185:1124-31. 44. Arkes HE, Hammond KR, eds. Judgment and decision making: an interdisciplinary reader. New York: Cambridge University Press, 1986. 45. Elatein AS. Clinical judgment: psychological research and medical practice. Science 1976;194:696-700. 46. Dowie J, Elatein A, eds. Professional judgment: a reader in clinical decision making. New York: Cambridge University Press,

1988. 47. Dawson NV, Arkes HR. Systematic errors in medical decision making: judgment limitations. J Gen Intern Med 1987;2: 183-7. 48. Detmer DE, Fryback DG, Gassner K. Heuristics and biases in medical decision making. J Med Educ 1978;53:682-3. 49. Einhorn HJ, Hogarth EM. Confidence in judgment: the persistence of the ifiusion of validity. Psychol Rev 1978;85:395-416. 50. Dawson N, Connors A, Speroff T, Kemka A, Shaw P, Arkes HR. Hemodynamic assessment in the critically ill: “Is physician confidence warranted?” Med Decis Making 1993; 13 (in press). 51. Connors AF, Dawaon NV, Speroff T, Arkes H, Knaus WA, Harrell FE, et al. Physicians’ confidence in their estimates of the

probability of survival: relationship to accuracy [Abstract]. Med Decis Making 1992;12:336. 52. Dawson NV, Speroff T, Connors AF, Goldman L, Calif R, Bellamy P, et al. Intensity of resource use in the care of the seriously ifi [Abstract]. Clin Res 1992;40:579A. 53. Arkes HE, Wortmann RL, Saville P, Harkness AR. The hindsight bias among physicians weighing the likelihood of diagnoses. J Appl Psychol 1981;66:252-4. 54. Dawson NV, Arkes HE, Siciliano C, Blinkhorn R, LakshmananM, Petrelli M. Hindsight bias: an impediment to accurate probability estimation in cunicopathologic conferences. Med Decis Making 1988;8:259-64. 55. Feinstein AR. The “chagrin factor” and qualitative decision analysis. Arch Intern Med 1985;145:1257-9. 56. Eddy DM. Probabilistic reasoning in clinical medicine: problems and opportunities. In: Kahneman D, Slovic P, Tversky A, eda. Judgment and uncertainty: heuristics and biases. New York: Cambridge University Press, 1982:249-67. 57. Wolf FM, Gruppen LD, Bilhi JE. Differential diagnosis and the competing hypothesis heuristic: a practical approach to judgment under uncertainty and Baysian probability. J Am Med Assoc

1985;253:2858-62. 58. Wallsten TS. Physician and medical student bias in evaluating diagnostic information. Med Decis Making 1981;1:145-64. 59. Christensen-Sza]anski JJ, Bushyhead JB. Physicians’ misunderstanding of normal findings. Med Decis Making 1983;3:169-75. 60. McNeil BJ, Pauker SG, Sox HC, Tversky A. On the elicitation of preferences for alternative therapies. N Engl J Med 1982;306: 1259-62. 61. Bushyhead JB, Christensen-Szalanski JJ. Feedback and the illusion of validity in a medical clinic. Med Decis Making 1981;1: 115-23. 62. Hammond KR. Intuitive and analytical cognition: Information models. In: Sage A, od. Concise encyclopedia of information processing in systems and organizations. Oxford: Pergamon Press,

1990:306-12. 63. Poses EM, Bekes C, Wunkler EL, Scott WE, Copare FJ. Are two (inexperienced) heads better than one (experienced) head? Averaging house officers’ prognostic judgments for critically ifi patients. Arch Intern Med 1990;150:1874-8. 64. Blois MS. Information and medicine: the nature of medical descriptions. Berkeley: University of California Press, 1984:15883. 65. Dawes EM, Faust D, Meehl PE. Clinical versus actuarial judgment. Science 1989;243:1668-74.39/7, 000-000 (1993)

Discussion Fred Lasky: I am reminded of a paper I saw about 10 years ago about the “hassle factor”: Physicians provided follow-up, made diagnoses, and ordered tests, depending on the amount of hassle they perceived they or their patients would receive. I was thinking how that might apply in the examples you presented. Also, what are your thoughts on diagnosis of pneumonia or myocardial infarction, considering the long- and short-term impact from this hassle factor? Neal Dawson: I can’t comment specifically on the hassle factor; I can give you an alternative hypothesis as to why this may occur. In studies looking at outpatients presenting with cough, where chest roentgenograms were taken for everyone who met certain criteria, physician sensitivity ranged from one-third to two-thirds; in other words, there were a lot of missed infiltrates. A single study has some very suggestive evidence that if you

treat unrecognized courses,

infiltrates,

people will have shorter

cough less, be sick less long, and miss less work

1478 CLINICALCHEMISTRY,Vol. 39, No. 7, 1993

them. In that particular study, the use were people for whom the physicians chose not to order chest roentgenograms but who actually had infiltrates. This seems to be a “single synapse” sort of decision; that is, if you are sick enough to get a roentgenogram, you are sick enough to get antibiotics. It doesn’t matter what the roent.genogram shows, most of the time. For pneumonia, we miss the diagnosis in a lot of people who actually have the disease, but we don’t have any way to get adequate feedback to correct that suboptimal behavior. In the case of myocardial infarctions, the test that carries the most weight in the predictive models is an electrocardiogram, and doctors are pretty good at readhing those. For many things we predict, there are no single factors to look at that carry a lot of predictive weight. Some people sent home with myocardial infarctions will do fine, but some people don’t, so there is at least an opportunity to get feedback because they may than

if you don’t treat

only people who benefited from antibiotic

either be rehospitalized or die. The same may be true with predictions of survival. If you are taking care of these patients, you have some way of getting feedback about whether or not your estimate was correct or not. These are some of the factors that may lead to differences in terms of how well clinicians do on each of those tasks. Noel Lawson: Is the laboratory component of error considered when you derive your decision algorithms? If so, it seems to me from the data we’ve seen that, because the laboratory component of error is relatively small and the physician component is somewhat larger, the laboratory may be quite satisfactory. Would you please address this? Neal Dawson: We have not looked at that specifically, but I agree with you in terms of saying the laboratory probably plays a very small role in terms of contributing to the “noise” in the prediction. When these sorts of studies are done, it is important to look at the incremental value of the information. We should assess it the same way the physician would ordinarily collect it. After having done the history, the physical, and some routine tests, by the time you get to the single specialized laboratory test, it may well be that a lot of the predictive information has already been captured by other variables. The incremental value of the laboratory test would certainly be important to examine. To address your specific point, I agree that the laboratoryassociated variation would be “peanuts” in the context of the full process. Mario Werner: Clinicians have three sources from which to take information: signs and symptoms, biophysical data, and laboratory data. Signs and symptoms on the whole are neither objective nor quantitative. Biophysical data are objective, but on the whole not naturally quantitative. Laboratory data are both objective and quantitative. Dr. Dawson has presented problems that the clinician encounters in dealing with signs and symptoms. Much of this appears to be caused by the fact that these findings are neither objective nor quantitative. With a database of other intrinsic properties, different problems might arise. It is this possibility that concerns the laboratory physician: medicine continues to rely on decision rules and strategies appropriate to a historical database, but today’s database has different intrinsic properties. The question before us is, how can objective and quantitative data best be used? For example, myocardial infarction can be diagnosed with any one of the three available databases. In some cases signs and symptoms by themselves might suffice, but Dr. Dawson well described the problems when this is the sole database. With only biophysical findings, say, blood pressure and the electrocardiogram as a database, the problems would become somewhat different, necessitating other decision rules and strategies. Finally, with laboratory data, entirely new problems arise, because we now have “hard” numbers, where the analytical uncertainty to a considerable extent takes the place of the physician’s intellectual “uncertainty” in the face of a subjective database. What we are attempting to define

here is how much uncertainty in our data would be allowable, were we using our data properly. Neal Dawson: I think we do ourselves a disservice when we make the dichotomy between subjective and “objective” data. Science, despite protestations to the contrary, is not a totally value-free enterprise, not a completely unsubjective enterprise. Time and time again, it has been shown to be a human endeavor, just like anything else we do. The question we should put forward with respect to what kinds of data we should use to make predictions or clinical decisions is, how good are they for the purpose we want them to serve? Are they reproducible? Are they valid or accurate? If they are, it doesn’t really matter from what source they are derived. Other criteria can be applied. What is the cost of acquiring the data? One should not just send everyone to a laboratory or draw blood from everybody without any thought of the prior probability of certain diseases. Clinical information certainly can outperform laboratory tests for some purposes: for example, the activity of rheumatoid arthritis can be followed as well or better by use of questionnaires than by determination of the erythrocyte sedimentation rate. There are probably numerous ways of arriving at the same point with respect to information. In general, it is better to take into consideration the usual acquisition sequence. The history and physical, for example, are extremely valuable endeavors for getting the physician to “the right ballpark” for a diagnosis. Once a confined differential diagnosis is clear, then other things such as laboratory tests and sometimes predictive instruments can begin to sort out the differential diagnosis. These tools are important because they tend to use information more optimally than do our own judgments, but they are often not very good for making broad characterizations. We need to think about what we want these data to do for us. The source, in and of itself, is not very important. In choosing what data to use, we need to consider our goals. Some goals involve costs and the time it takes to get the data. Data from the history and the physical are routinely available and are going to continue to be available. So we should use the information from the history and physical examination maximally before we go on to the next step that may involve extra cost or some additional risk. An appropriate use of that information should give you an appropriate prior probability to use as you go to the next test and ask, “What is the result?” and “How do I interpret it?” Mario Werner: I agree with all you are saying but I don’t think you are addressing my point. Surely you would agree that there is a great difference as to the quality of information between a patient saying, “I have chest pain,” and going on to describe accurately its location, intensity, and duration as well as other properties, and a high serum creatine kinase measurement. These two pieces of information simply have totally different intrinsic properties and, therefore, should be used in different ways. This is not saying that one database is superior to the other. Either one may be most useful in the appropriate context. For instance, CLINICAL

CHEMISTRY,

Vol. 39, No. 7, 1993

1479

with gastrointestinal disease, the history is paramount and the laboratory almost noncontributory. On the other hand, with impaired liver function, the laboratory is of prime importance, whereas history and physical findings are only contributory. What you have well described is the uncertainty attached to signs and symptoms as well as to biophysical findings, but we have a different type of uncertainty when the data are objective and quantitative. We can write formal algorithms for the use of signs and symptoms just as for the use of laboratory data, but sources of false-positives and falsenegatives would interfere in completely different ways in the two situations. Neal Dawson: Let me add that it is important to think in terms of incremental value of any test. You could look at the incremental value of the physical examination once you know what is in the history. You can look at the incremental value of routine laboratory tests once you know the history and the physical. Some problems are best solved just in the laboratory-no question about it. But many clinical problems are, in fact, not of that

1480

CUNICAL CHEMISTRY, Vol. 39, No. 7, 1993

nature.

Having

is very

important.

methods

appropriate

to the task at hand

The sorts of postulates you are putting forth can be modeled. We can put together in a decision analysis framework all that is known about these things and compare how different approaches turn out. We can look at our assumptions, vary those assumptions, and do a sensitivity analysis. By the time we are done, we will most likely understand the problem much better. Even if the problem is not solved, we can often identilr some key pieces of information. We may find that the decision hinges on two or three things, one of which we don’t know very well, and may decide we need to learn more about that particular variable. There are many things we can haggle about with each other that probably contribute only a very small amount to the decision, i.e., their incremental contribution is probably pretty small. Discovering what consumes most of the predictive power is what we want to understand as we look at these problems.