Other than the 12s - CiteSeerX

1 downloads 0 Views 1MB Size Report
Wisconsin-Madison. Received June 1, 1987; accepted October 20, 1987. Since the 1960s, PT programs such as those of the College of American. Pathologists.
CLIN.CHEM.34/2, 250-256

(1988)

Use of AlternativeRules (Other than the 12s) for EvaluatingInterlaboratoryPerformance Data S. Ehrmeyer,1 Ronald H. Laesslg,2 and Kathy Schell1

Sharon

Previous studies have documented the ineffectiveness of usingeither the group mean ± 2 group standarddeviations (SD) or the 125 rule as the standard of acceptable performance in evaluating interlaboratory proficiency testing (PT) data. Using computersimulationof PT data, we evaluated the efficiencyof 244 alternativesto the 12s rule, all based on the PT population’smean and SD. Using the traditional PT format, we determined the ability of each rule to correctly identify both good and deficient intralaboratory performance. The rules are based on results from one to five PT samples “analyzed” at the same time. Because the effectivenessof the criteria set for acceptableperformancein a PT programis influencedby the populationSD, each rule’s interlaboratory

capabilitieswere examined for PT populationswith interlaboratorySDs rangingfrom 1% through10% of the population mean value. All rules achieve their maximumefficiencyover a narrowrange of interlaboratorySDs. For PT evaluationsof intralaboratoryperformance to be optimallyeffective,selectionof the rulemustbe basedon the SD of the PT population. Interlaboratory proficiency testing (PT) has become a routine and often mandated component of all competent laboratories’ programs of quality assurance (1). PT serves multiple purposes by providing: (a) information to participants about their own performance, (b) insight into the

performance

of other laboratories with similar methods and (i.e., defining the “state of the art”), (c) identification of tests or instruments where remedial measures may be needed, and (d) information on individual laboratory performance that third-party agencies use for accreditation and (or) licensure (2-5). The fundamental premise of regulatory or professionally mandated PT programs is: “If a laboratory performs acceptably in the program, it is assumed that it also analyzes instruments

patients’ remains:

samples correctly.” However, “Does meeting interlaboratory ria in PT programs portend acceptable

the basic question performance crite-

routine laboratory performance?” At the most basic level, it is reasonable to assume that PT samples receive preferential treatment when accreditation or licensure is at issue. Most regulators and professional organizations accept that performance, as measured by analysis of PT specimens, as indicative of the “best” that the laboratory can do-not necessarily routine performance (6). The more fundamental question, however, must address the validity of using PT to certifr performance. This application assumes that the criteria (rules) used for evaluation can differentiate between “good” and “deficient” intralaboratory performance with a degree of reliability sufficient

for the intended

purpose.

‘School of Allied Health Professions, Medical Technology Program, Center for Health Sciences, University of Wisconsin-Madison, Rm. 6169, 1300 University Avenue, Madison, WI 53706. 2Departments of Pathology and Laboratory Medicine and Preventive Medicine, State Laboratory of Hygiene, University of

Wisconsin-Madison. Received June 1, 1987; accepted October 20, 1987. 250

CLINICAL CHEMISTRY, Vol. 34, No. 2, 1988

Since the 1960s, PT programs such as those of the College of American Pathologists (CAP) and others that meet the requirements of the Clinical Laboratory Improvement Act (cIJA) (7-9) have routinely used the evaluation criterion based on the group mean and ± 2 group SD (1) to “pass” or “fail” participant performance. The 1 rule in PT was designed to incorporate state-of-the-art performance into the interlaboratory evaluation process (10). A previous study quantified the effectiveness of the 1 rule in evaluating the performance of a laboratory analyzing one sample in a chemistry PT program (11). Other techniques to evaluate mtralaboratory performance have been identified (12-18). We recognize a movement among PT providers to use “fixed limits,” and we have studied the implications of this approach in a separate paper (19). Preliminary studies (20-22) indicated that (a) the choice of the criterion selected for evaluating intralaboratory performance strongly influences a PT program’s ability to distinguish good and deficient intralaboratory performance and (b) currently used PT criteria do not consistently reflect actual intralaboratory performance. At a minimum, PT criteria should warn the laboratory (as well as the regulatory agencies) when intralaboratory performance is not meeting patients’ needs. Likewise, no useful purpose is served when good laboratories are failed or when deficient laboratories are passed. For this study, we devised 244 alternative rules and evaluated their “effectiveness,” i.e., their ability to correctly characterize intralaboratory performance. By making these rules available for use in PT, the profession will be able to accomplish the fundamental (regulatory) objective of correctly characterizing the intralaboratory performance of all participants.

Materials and Methods Our previously described computer model (23) quantified the effect of intralaboratory performance characteristicsi.e., specified levels of internal imprecision (CV) and inaccuracy (bias)-on the ability of conventional PT criteria to characterize intralaboratory performance correctly. This study is an attempt to develop alternative criteria (rules) encompassing state-of-the-art interlaboratory performance (i.e., group mean, SD) that more effectively reflect intralaboratory performance. All of the rules are based on a unit of convenience analogous to the Standard Deviation Interval (SD!) used (8) by the CAP PT programs (see below). To evaluate the proposed rules, the model simulated a PT program consisting of 401 participating laboratories. Each laboratory in the PT program “analyzed” from one to five samples per “shipment.” PT programs frequently use multiple samples, but to date no studies have evaluated performance based on rules that consider the combined performance of multiple samples. In the computer model of the PT program, the result obtained by the test laboratory, XT, is a function of a unique, assigned intralaboratory bias and CV combination that defines actual intralaboratory performance. XT is generated, based on the true value and the test laboratory’s CV and bias, by use of a gaussian random-number generator. The

results for the other 400 participating laboratories in the PT population are generated similarly and are based on their individual bias and CV values. The population mean and interlaboratory SD are calculated. The model emulates data obtained from “sending” the group of 401 laboratories identical sets of samples for 400 surveys in a PT program. See reference 23 for a detailed description of its function. Each test laboratory result is evaluated by comparing it with the group (interlaboratory) mean, X5, and the group SD, SDg, by the formula: SD!

=

The test laboratory will pass PT or fail PT on the basis of how the chosen rule interprets the SD! obtained from one to five samples in the “shipment.” In conventional PT programs SDI 2.0 is used as a criterion of acceptability, the interlaboratory analog of the “Westgard 1 rule” (13). To investigate the possibility of finding more-effective PT rules, we used an empirical approach for evaluating all possible criteria that could be devised from a PT program in which one to five samples (challenges) are used per shipment. Our approach deliberately used some rules so stringent and some so broad as to be trivial; however, this process ensures that the optimal rule is encountered. For example, if a theoretical PT program sends only one sample per shipment, the proposed rules would set the pass-fail criteria at N(SDI) where N varied from 0.5 to 5.0. While N = 0.5 is obviously overly stringent and N = 5.0 overly broad, the optimum value of N is clearly within these limits. The 1 rule is included in this group. In all, 245 nonredundant rules were devised and evaluated. All use the SD! in some manner to compute limits of acceptable performance. The rules can be divided into four classes. 1. Single-sample rules (33 rules): This group includes conventional rules, such as the 1 rule. However, the one out of two results exceeding a specified SD, one out of three, etc. represent an extension of this approach. When the result from only one sample is evaluated, it may come from a set of from one to five samples, which may be identical or different. If the laboratory’s SD! on any one sample in the set exceeds (N)(SDg), where N is a constant that varies from 1.0 to 5.0 for different rules, it is declared to have failed to meet the PT criterion. For example, in the Tables and Figures that follow, the “1 of 5, SD! >2” rule is defined as one sample result out of the set of five samples that falls outside the SDI_>2 limit. 2. Absolute rules (37 rules): These rules are based on adding the results from multiple samples. Intuitively, the use of multiple measurements will increase reliability. The test laboratory’s individual SDIs on shipments of two, three, four, or five samples are added without regard to sign. If the sum exceeds the value (N)(SDg), where N ranges from one to 10, the test laboratory’s results fail to meet the criterion. For example, a laboratory fails the “5 abs SD! >6” rule when the sum of the absolute (abs) values of the SD! from five samples exceeds (N)(SDg), which is 6.0. Theoretically, the absolute rules will not differentiate between random and systematic error. 3. Algebraic rules (37 rules): These rules also add results from multiple samples to increase their discriminatory power. The rules are similar to the absolute rules, except that the SDIs are added with regard to sign. lithe sum on two, three, four, or five samples exceeds the (N)(SDg) limit,

N ranges from one to 10, the laboratory fails to meet the criterion. The “4 aig SDI >3” rule simulates sending four PT samples and determining if the algebraic (aig) sum of the four SD!s exceeds the limit of 3.0. Note that since the value of the argument-i.e., 3.0-can be either positive or negative, the algebraic rules really represent 74 rules; however, 37 of these are redundant. The algebraic rules should detect systematic error but be insensitive to random error, especially as the number of samples increases. 4. Pattern rules (138 rules): Theoretically, pattern rules should differentiate between systematic and random error. The”Westgard 2”rule is the simplest example (13). Clearly, in an interlaboratory quality-control program, successive results >2.0 SDg from the mean and on the same side (both positive or negative) or on opposite sides (one positive and one negative) of the mean have different implications. The rule recognizes both the magnitude and the distribution of the laboratory’s error. The pattern rules can be based on three, four, or five samples. These rules require that two or more of the results of the individual test laboratory exceed a specified (N)(SDg) limit-and do so in a specific pattern. For example, if two samples were sent in the PT program, one version of the “2 of 2, SDI >3” rule would require that the SD! on both of the test laboratory results exceed the (3.O)(SDg) limit and that they lie on the same side of the group mean. A second version of the “2 of 2, SDI >3” rule requires that both exceed the limit, and fall on opposite sides of the mean. If and are used to designate outliers above and below the mean, these two rules can be conceptualized as: version one “+ + and and version two “+ and +“, respectively. The pattern rules for three, four, and five samples present the opportunity for a geometric progression of combinations. The number of cases can be constrained by eliminating equivalent or redundant patterns. We do this by defining rules that are as follows: two of three, two of four, and two of five outliers falling on the same side or opposite sides of the mean as above. Likewise, for the three of three, three of four, three of five, four of four, four of five, and five of five rules, outliers can fall on the same or opposite sides of the mean. For example, with the rule in which three out of five results fall outside the stated limit on the “opposite sides” of the mean, + +, + + + + +, etc., are alljudged tobe equivalent, whereas and + + + are defined as the rule in which three out of five fall on the “same side” of the mean. An example of a rule following this pattern is “3 of 5, SDI >2s”; three out of a set of five results exceed the SDI >2 limit and fall on the same side of the mean. The combinations include 10 nonredundant same-side rules, with SDIs ranging from 1.0 to 4.0, for a total of 69 combinations; 69 opposite-side rules make up the remainder. where

“+“

“-“

-

-

-“

-

-

-‘

-

-‘

-

-‘

-

-

-

-

-

Rule Effectiveness Table 1 illustrates the computer model’s output for a single-sample rule where the population SD equals 6.0. The y-axis lists intralaboratory CVs in the test laboratory; the xaxis lists the intralaboratory biases. The tabular values indicate the percentage of times a laboratory with a particular bias and CV combination will fail a PT program when the 1 rule is used for evaluation. Obviously, a laboratory with intralaboratory CVs of 8%, 9%, and 10% (upper left) would be so imprecise that it could not produce reliable data, because of random error. Likewise, a laboratory with excessive bias (lower right) would also tend to produce consistently wrong, but precise, answers. Small biases and CVs (lower CLINICALCHEMISTRY, Vol. 34, No. 2, 1988

251

Table 1. Percentage of Test Laboratory Results Falling a PT Program Using the 1 Group Mean = 100 and the Group SD = 6.0

Rule When the Intralaboratory

10

21

21

23

24

25

28

34

36

38

40

43

9

18

17

19

21

23

24

28

32

35

39

41

8

13

14

15

16

19

21

22

29

32

36

40

7

9

10

11

12

13

16

20

23

30

34

38

6

5

6

6

8

10

12

15

20

26

32

36

5

2

2

3

4

6

9

12

15

20

27

34

4

0

0

1

2

3

5

7

11

15

2130

3

0

0

0

0

1

2

3

6

9

15

24

2

o

o

0

0

0

0

0

1

3

7

14

o

o

0

0

0

0

0

0

01

0

3

0

1

2

3

4

5

6

7

8

9

10

Test laboratory’s intralaboratory CV, % of mean

left) are consistent with good intralaboratory performance, large biases and (or) CVs with deficient performance. A rule’s effectiveness must be evaluated against two criteria: (a) its ability to characterize laboratories whose performance is “good,” by declaring them top PT, and (b) its ability to declare that those laboratories whose performance is “deficient” fail PT. For the purposes of this study, good intralaboratory performance is defined as the test laboratory having CV-bias combinations located in the lower left-hand sector (I) of Table 1. These combinations of CV and bias represent a laboratory with a routine error of ±10% or less (24). For the ensuing discussion, we will assume that a “good” laboratory, one producing medically useful data, controls its systematic and random error to the degree that 95% of its results will have a total relative error of ± 10% or less. A “deficient” laboratory is one that has any of the 22 CV-bias combinations in the middle sector (H) of Table 1. It will produce results that exceed the 10% error limit more than 5% of the time (24). Laboratories with these CV-bias combinations are assumed to be performing at a level of competence below that required to yield medically useful results. Ideally, when a PT program applies a particular rule, it should declare that the performance of laboratories with the combinations of CV and bias in sector ! passes. Likewise, it should declare that those laboratories in sector II fail. We did not further evaluate the rules’ ability to detect large intralaboratory errors; i.e., we did not include those CVbias combinations above sector H. Because these laboratories are making errors so gross, including them would make even an incompetent rule appear to be effective. Sectors I and H represent a rigorous test of any rule’s ability to identify good and deficient intralaboratory performance. One can characterize the effectiveness of any given rule in an interlaboratory PT program in a manner analogous to that suggested by Galen and Gambino (25) for evaluating the ability of a laboratory test to diagnose disease. In sector I, good laboratories passed by PT are the true negatives; those that fail are the false positives. To carry the analogy one step further: in sector H, the deficient laboratories 252 CLINICALCHEMISTRY, Vol. 34, No.

2, 1988

correctly identified (i.e., those that fail PT) are the true positives, whereas those passed are the false negatives. Using the Galen/Gambino scheme, the net effectiveness of any rule can be characterized by its efficiency, which is its ability to identify both good and deficient laboratories correctly. The values in Table 1 are the percentage of times a test laboratory with particular CV-bias combinations fails the PT program. Ideally, no laboratories in sector I should fail PT. Values greater than zero represent the percentage of false positives for that particular intralaboratory CV-bias combination. In sector H, ideally all laboratories with any of these CV-bias combinations should fail. The difference between the Table’s values and 100 represents the percentage of false negatives. Rather than evaluate each CV-bias combination, the model calculated the median value for the 24 and 22 cells in sector! and II, respectively, and used these median values to compute efficiency: Efficiency

100

X

(true positives + number of results

true

negatives)Jtotal

The ideal value for efficiency is 100%; it was calculated for 245 rules at 10 PT population SDs ranging from 1% to 10% of the mean value. Because efficiency is a function of the prevalence of the number of true positives and true negatives in a PT population, efficiencies were calculated for prevalence rates for deficient laboratories of 10%, 5%, and 1%.

Results ProficiencyTesting Program Philosophy In selecting a rule to evaluate laboratories’ performance, a PT program must choose between conflicting goals. Is it more important to identify deficient laboratories (analogous to setting very stringent limits), or is it more important to assure that good laboratories are not misidentified? As the computer model demonstrates, meeting both goals is not a realistic expectation. A nondiscriminating rule, e.g., one that declares that

100% of the laboratories in both sectors I and II pass, will have a limiting efficiency, dependent on the prevalence of good laboratories. For this study, we set the prevalence rates at 90%, 95%, and 99%, resulting in limiting values of efficiency of 90%, 95%, and 99%, respectively. To illustrate this dilemma, Figure 1 displays the effectiveness of the familiar 1 rule when it is used to identify both good and deficient laboratories, where “good” is defined as obtaining results within ± 10% of the true value 95% of the time. The rule’s effectiveness is strongly affected by the group SD of the PT survey population. The upper curved line, depicting the ability to identify correctly or “pass” good laboratories in sector I, indicates that at a population SD of 2, the 1 rule will identify only 70% of the good laboratories and fail 30%. This curve rises very steeply, so that at a population SD of 4, the 1 rule is capable of identifying 97% of the good laboratories. However, at the same time the rule’s ability to identify deficient laboratories correctly falls off precipitously with increases in the group SD. At a population SD of 4, only 20% of the deficient laboratories are failed. When the population SD is small, good laboratoriesi.e., those well within the limits of medical usefulness-are penalized while performing quite adequately. When the SD is large, no laboratory, good or bad, is declared to fail the PT program. In all probability this poor performance characteristic of the 1 rule with respect to good laboratories is what has led various PT programs to consider adopting alternatives such as “fixed limits” to evaluate performance, especially as instrumentation has become better calibrated and more precise, resulting in smaller population SDs. It is not practicable to list or graph the efficiencies of all 245 rules at 10 population SDs at three prevalences. To facilitate the ensuing discussion, we selected the bestperforming rules for further study. In Table 2, up to 10 rules with efficiencies >90% at a 10% prevalence of deficient laboratories were selected at each population SD. The first five entries in each section list optimal rules for a PT program whose philosophy emphasizes accurately identifytoo

‘6 OF GOOD LABS IDENTIFIED AS PASSING

70

60 C

2 0

40 S OF BAD LABS IDENTIFIED AS FAILING

20

10

2

3

4

5

6

7

SD, Percent of Population

Fig. 1. The effectiveness deficient laboratories

SDs and to lose its ability

laboratories

as the population

to identify

deficient

SD increases.

Discussion A PT program that adopts the philosophy that it is more important to identify good laboratories correctly would simply use the rules with the highest efficiency. These are the first five rules reported in each group in Table 2. The maximum efficiency over a short range of population SDs reported for any rule is 94%, 97%, and 99.4% for prevalences of 10%, 5%, and 1%. Although rules with the highest efficiencies identify the greatest percentage of the good laboratories, they also fail to identify a significant percentage of deficient laboratories. The second group of rules were included in Table 2 to represent the best compromise between identifying a high percentage of both good and deficient laboratories. At least

five general

observations

can be made:

3. Those rules that involve multiple samples are generally the most efficient. Because laboratory error is usually a combination of CV and bias, these rules have the best opportunity to detect imprecision. 4. The magnitude of the combination (N)(SDI) that yields the best rule depends on the population SD. At population SDs of 1%, 2%, and 3%, single-sample and pattern rules based on large multiples of the SD! such as “1 of 4, SD! >4” and “3 of 5, SD! >4” are more effective. At large population SDs, rules based on smaller multiples are better. Because the definition of good performance (± 10% total error) is invariant, the rule needs the tighter (smaller) values for (N)(SD!) at large population SDs to achieve discriminating power.

50

30

population

1. The greatest difficulty lies in selecting rules for very small population SDs where even the “best” rules detect only a relatively small percentage of deficient laboratories. This corresponds to the problems caused by shrinking interlaboratory SDs observed currently in PT programs. 2. The choice of the “best” rules changes as a function of population SD. For small population SDs, the pattern rules are the most efficient. At intermediate population SDs, the single-sample rules are better. As the population SD increases, absolute and algebraic sum rules are the best, particularly the algebraic summing rules.

90

so

ing the largest number of laboratories, whether good or deficient. These are the rules with the highest efficiency. At low prevalence rates, this means identifying the largest possible percentage of good laboratories correctly. The second group of entries in each section represents rules selected for an optimum combination of high efficiency and the ability to detect a high percentage of deficient laboratories correctly. For comparison purposes, similar data for the 1 (1 of 1, SDI >2) rule have been included in parentheses at each population SD. Figure 2 illustrates the efficiency of a “best” rule selected from Table 2 for each population SD. The maximum for each curve represents the rules’ ability to identify the largest percentage of both good and deficient laboratories correctly. It also represents the point at which the most deficient laboratories are identified. The percentages of deficient laboratories identified for each rule are the numbers superimposed on the x-axis. The nature of the plots reflects the tendency of each rule to exhibit optimum performance (detecting good and deficient) over a narrow range of

8

Mean

of the 1 rule in correctly identifying good and

5. The 12. rule is never a good choice. Although it does reach a maximum efficiency of 91% at a population SD of 5 (Table 2), many other rules are more capable of characterizing laboratory performance. In addition, at smaller population SDs, it misidentifies good

laboratories and fails them in PT programs. CLINICAL CHEMISTRY, Vol. 34, No. 2, 1988

253

Table 2. Best Rules to Evaluate Intralaboratory Performance at Three Prevalences of Deficient LaboratorIes Effectiveness, %

Sector’

Effectiveness,%

Efficiency, %,

______________

prevalence

at of

Sector ii

10%

5%

1%

100

18

92

96

$

100

10

96

PopulationSD = #{163}5 alg >4 SDI 1 of 3, SDI >1.5

S0I >2.5 d SDI >2.5 d

100

7

91 91

99 99

95

99

100

7

91

95

99

4 of 5, SDI >3.5 S 3 of 4, SDI >2.5 d 2 of 5, SDI >3 d 4 of 5, SD1 >2.5 d 4 of 4, SDI >4 $ (1 of 1, SDI >2)c

98 100 100 100

29 6 6 6

91 91 91

95 95 95

g 99 99

1 of 5, SDI >1.5 2 of 5, SDI >1 s 2 of 4, SDI >1 s

6

91 91

95

100 35

81

100 100 98 100

41

Rule

PopulationSD 4 of 5, 4 of 4, 3 of 5, 2 of 4,

1

b

501 >3.5

PopulationSD 2 of 5, 2 of 4, 3 of 5, 2 of 5, 1 of 3,

=

SDI >4

=

SD1 >3.5 s SDI >2.5 s SDI >4 $ SDI >4

1 of 5, SDI >4 1 of 4, SDI >4

98

54

94

96

98

99

40

93

96

98

#{163}5 abs >5 SDI

100

33

93

97

99

1 of 4, SDI >1.5

98

50 58

93 93

96 95

98

63 49 50 35 35 3

91 91 91 92 92 90

92 94 94 95 95 95

94 96 96 97 97 99)

99 100 98

46

94

96

98

37

94

97

99

52

100

96 97

98 99

100

30 24

93 93

92

96

99

98

41

92

95

97

97 99 99 100 100

39 34 33 1

91 93 92 92 90

94 96 96 96 95

96 98 98 99 99)

100

97

31 54

93 93

97 95

99 97

100 100

25 25

93 93

96 96

99 99

98

43

93

95

97

97

47

92

95

97

98 99

34 32

95 96 96

97 98 99

99 99

#{163}3 alg >2.5 SDI #{163}3 abs >3 SDI

40

35)

#{163}2 alg >2 SDI (1 of 1, SDI >2)

94 93

97 96

99 99

PopulationSD 2 of 5, SDI >1

92 92

95 96

97 99

92

95

97

92 92

94 95 95 96

1 of 5, SOt >1.5 #{163}5 aIg >4 SDI 1 014, SDI >1.5 #{163}2 abs >2 SDI

=

97 94 96 96 98 98 100

S

97 99

43

92

30 61

92

96

98

69

68

69

69)

>3

99

47

96

98

>3

100 100

38

94 94 94

94 93

97 97 96 97

99 99 98 99

PopulationSD = 8 2 of 5, SOl >1 s #{163}5 alg>3 SDI 15 alg>4 SOl 14 alg >3 SDI

93 92 92

94 94

13 alg >2 SDI 14 alg >2.5 SDI 12 alg >1.5 SDI

PopulationSD

=

91

3

1%

97

7

93 93 93 94

61 58

5%

6

71

2 of 4, SDI >3 $ 4 of 5, SDI >2 $ (1 of 1, SDI >2) SDI SDI SDI S0i

10%

94 95 95

2 of 5, SOl >3 s

1 of 4, 1 of 3, 2 of 5, 1 of 5, 1 of 5,

97

26 42 23 49

Sector

ii

Rule

95 37

2

SDI >3.5 s

Sector’

Efficiency, %, at prevalence of

#{163}4 alg>3 SDI #{163}3 abs >2.5 SDI #{163}3 alg >2.5 SDI 2 of 4, SDI >1 s 1 of 3, SDI >1.5 (1 of 1, SDI >2)

22

98

37 55

SDI >3.5

100

33

#{163}5 alg >7 SDI 2 of 5, SDI >2 s #{163}4 alg >6 SD1 #{163}5 alg>8 SDI

94

70

94 94

62 60

92 91 91

98

51

93

96

98

14 abs >3 SDI

#{163}4 abs >7 SDI

96

48

91

94

96

13 abs >2.5

SDI

100

24

92 92 92

(1 of 1, SDI >2)

89

39

84

87

89)

2 of 4, SDI >1 s (1 ofl, SDI >2)

100 100

20

92

96

99

0

90

95

99)

97

59

93

95

97

98 99 99 98

50 40 39 45

93 93 93 93

95 96 96 95

98 98 98 97

>2.5 $ >3

PopulationSD = 1 of 5, SOt >2.5

4

#{163}5 alg>6 SDI 1 of 4, SDI >2.5 #{163}5 abs>7 SDI 2 of 5, SDI >2 $ 2 of 5, SDI >1.5 s 1 of 3, SD1 >2 2of4,SDI >1.5 s #{163}4 alg >5 SDI #{163}3 alg >4 SDI (1 of 1, SOl >2)

PopulationSD = 5 2 of 5, SDI >1.5 s 1 of 4, SDI >2 1 of 5, SDI >2 #{163}5 abs >6 SDI #{163}5 alg >5 SOl

99 98 99

PopulationSD lofS,SDI>1 1 of 4, SDI >1

96

98 99

1 of 3, SDI >1 15 alg >3 SDI

91

92

94

15 abs >3 SDI

91 91 92 93 90

93 94 95 95 94

95 96 97 97 97)

15 14 13 14 13

100

42

94

98 99

53 34 40

94 93 93

97 96 97 96

100

29

93

94

62

95 96

53 49 47 43 21

100

97 98 98

=

alg >2.5 SDI abs >2.5 SDI abs >2 SOP alg >2.5 SDI alg>2 SDI (1 of 1, SOt >2)

100 100

39

94

97

99

PopulationSD

36 43

97 96

99

99

94 93

100

33 47

93 93

97 95

99 97

1 of 4, SOP >1 1 of 5, SDI >1 15 alg >2.5 SD1

54 50 49 48

91 91 92 92

93 94 95 95

g 96 97 97

98 95 96 97 97 97

98

=

1 of 3, SDI >1

#{163}4 abs >2.5 SDI 13 alg >1.5 SDI #{163}3 alg >3 SDI 14 alg >2 SDI 3 of 5, SDI >1 $ 15 abs >3 SDI #{163}4 alg >4 SDI 33 91 94 96 13 abs >2 SDI #{163}2 abs >2.5 SDI 11 91 96 99) 14 alg >2.5 SDI 100 (1 of 1, SDI >2) (1 of 1, SDI >2) ‘Sector I refersto lower-left-handsectorof Table 1;sector II, the middle sector of Table 1. = outhers fallon the same sideof the mean, d = outliersfall on different sides of the mean. ‘For companson the results for the 1, rule are included in parentheses foreach population SD. #{163}5 abs >5 SDI

254

CLINICALCHEMISTRY, Vol. 34, No. 2, 1988

9

95

61

92

93

95

98

42

92

95

97

97

34

91

94

96

99

33

92

99

31

92

96 96

98 98

100

0

90

95

99)

100

94

97

94

96

99 98

100

37 44 49 28

93 93

96 96

98 99

100

27

93

96

99

96

49 47 33 24 22 0

91 91 92 92 92 90

94 94 96 96 96 95

96 96 98 99 99 99)

10 99

98

96 99 100 100

100

I00

>‘.

0

80

-

60

-

40

-

30

-

20

-

C

5

C)

w

10

-

29

00-

I

I

41

2I

47

3I

42

4I

5I

54

6I

46

7I

54

8I

___I

59

9

44

I tO

SD, Percent of Population Mean Fig.2. Best

rules for given PT population

SOs at a 10% prevalenceof deficientlaboratories

Intuitively, regulatory PT programs have as their goal the identification of deficient laboratories. This means tolerating misidentification of some good laboratories. The rules that evaluate multiple samples are generally best suited to this purpose; the two forms of “sum” rules become more effective at larger population SDs, whereas at smaller population SDs other types of rules would be a better choice. Obviously, any of these rules can detect a good percentage of deficient laboratories only for a specific and narrow range of population SDs. The prevalence of good laboratories determines the maximum efficiency that any rule can achieve. As Table 2 shows, the choice of rule is, generally, prevalence independent; the best rule at 10% is also the best rule at 1% prevalence.

#{149} Conclusions The study of all possible variations of evaluation criteria yields several possible “best” choices for each population SD. The list of best rules in Table 2 indicates that the “sum” rules, which benefit from the ability to use multiple data points, appear frequently, but not to the exclusion of the single-sample rules. The pattern rules are sparsely represented and useful only at small population SDs. The adequacy (or inadequacy) of the criteria now used to evaluate laboratory performance in PT programs has been the topic of multiple papers (11, 19, 21, 22). We believe that our current efforts shod some light on why any criterion seems to exhibit inadequate performance in some instances and not in others. The choice of the optimal rule depends on the program’s goal of identifying correctly either good or deficient laboratories. The ability to predict the rule which for a specific population SD has the optimum efficiency has profound implications for PT programs. A PT program could “select” the best rule to evaluate performance based on PT population or homogeneous subpopulation group SD. The 1, rule would not be one of those chosen because other rules exhibit better performance characteristics. When PT is used in a regulatory setting, common sense demands that the criterion chosen to evaluate laboratory performance detect defi#{149}cient laboratories. Given a knowledge of the best rules, immediate improvements can be made in PT programs’

ability to correctly evaluate laboratories’ performance. This study at least provides some information CAP Chemistry Resource Committee’s

on the effect of the (26) announced intent to abandon the 1 rule and move toward a “fixed limit” approach. However, on the basis of these data, we believe that PT evaluations may be further improved by use of an analog of the Westgard multi-rule approach (27), and we have taken this under study. References 1. Wilcox KR, Baynes TB, Crable JV. Laboratory management. In: Inhorn SL, ed. Quality assurance practices for health laboratories. Washington, DC: Am Public Health Assoc, 1977:3-426. 2. Dorsey DB. The evolution of proficiency testing in the USA. In: Proc 2nd NatI Conf on Proficiency Testing. Bethesda, MD: Information Services, 1975:8-9. 3. Forney JE, Blumberg JM, Brooke MM, et al. Laboratory evaluation and certification. Op.cit. (ref. 1):127-71. 4. Gilbert RK, Rosenbaum JM. Accuracy in interlaboratory quality control programs. Am J Clin Pathol 1979;72:260-4. 5. Eilers RJ. Total quality control for the medical laboratory: the role of the College of American Pathologists Survey Program. Am J Clin Pathol 1970;54:435-6. 6. Annino JS. What does laboratory “quality control” really control? N EngI J Med 1978;299:1130. 7. Elevitch FR, Noce PS, eds. Data recap 1970-1980. Skokie, IL: College of American Pathologists, 1981.

8. CAP Survey Manual. Skokie, IL: Coll of Am Pathologists interlaboratory comparison program, 1986:9-10,29. 9. U.S. Dept. of Health, Education, and Welfare, Public Health Service, Clinical Laboratories Improvement Act of 1967. Fed Rag 1968;33,No. 253 (F.R. Dcc. 68-15586). 10. Standards Committee, Coll of Am Pathologists. Guidelines for evaluating laboratory performance in survey and proficiency testing programs. Am J Clin Pathol 1968;49:457-8. 11. Ehrmeyer S8, Laessig RH. An analysis of the use of the 12, rule to detect substandard performance in proficiency testing. Clin Chem 1987;33:788-91. 12. Henry RJ, Segalove M. The running of standards in clinical chemistry and the use of the control chart. J Clin Pathol 1952;5:305-11. 13. Westgard JO, Groth T, Aronsson T, et al. Performance characteristics of rules for internal quality control: probabilities for false rejection and error detection. Clin Chem 1977;23:1857-67.

CLINICALCHEMISTRY, Vol. 34, No. 2,

1988

255

14. Davies DL, Goldsmith PL. Statistical methods in research and production, 4th ed. New York: Hofner Pub. Co., 1972:342-3. 15. Duncan AJ. Quality control and industrial statistics, 4th ed. Homewood, IL: Richard D. Irwin, Inc., 1974:375-92. 16. Jardine AKS, MacFarlane JD, Greensted CS. Statistical methods for quality control. Bath, UK: Pittman Press, 1975:133. 17. Nelson IS. The Shewhart control chart-tests for special causes, J Qual Technol 1984;16:237-9. 18. Nelson IS. Interpreting Shewhart X control charts. J Qual Technol 1985;17:114-6. 19. Ehrmeyer SS, Laessig RH. An assessment of the use of fixed limits to characterize intralaboratoiy performance by proficiency testing. Clin Chem 1987;33:1901-2. 20. Ehrmeyer SS, Laessig RH, Garber CG. Monthly interlaboratory pH and blood gas survey: establishing accuracy based on interlaboratory performance. Am J Clin Pathol 1984;81:224-9. 21. Ehrmeyer 88, Laessig RH. Alternative statistical approach to

25$

CLINICALCHEMISTRY, Vol. 34, No. 2, 1988

evaluating interlaboratory performance. Clin Chem 1985;31:106-8. 22. Ehrineyer 88, Laessig RH. Adequacy of interlaboratory precision criteria to measure intralaboratory performance. Clin Chem 1985;31:1352-4. 23. Ehrmeyer 88, Laessig RH. Interlaboratory proficiency testing programs: a computer model to assess their capability to correctly characterize intralaboratory performance. Clin Chem 1987;33:7847. 24. Ehrmeyer 88, Laessig RH. The effect of intralaboratory bias and imprecision on laboratories’ ability to meet medical usefulness

goals. Am J Clin Pathol 1988;89 (in press). 25. Galen R8, Gambino SR. The predictive value and efficiency of medical diagnosis. New York: John Wiley & Sons, 1975. 26. Hartrnann AE. Target values and evaluation limits. In: CAP today. Skokie, IL: Coll of Am Pathologists, April 1987: 11. 27. Westgard JO, Barry PL, Hunt MR, et al. A multi-rule Shewhart chart for quality control in clinical chemistry. Clin Chem 1981;27:493-501.