Wisconsin-Madison. Received June 1, 1987; accepted October 20, 1987. Since the 1960s, PT programs such as those of the College of American. Pathologists.
CLIN.CHEM.34/2, 250-256
(1988)
Use of AlternativeRules (Other than the 12s) for EvaluatingInterlaboratoryPerformance Data S. Ehrmeyer,1 Ronald H. Laesslg,2 and Kathy Schell1
Sharon
Previous studies have documented the ineffectiveness of usingeither the group mean ± 2 group standarddeviations (SD) or the 125 rule as the standard of acceptable performance in evaluating interlaboratory proficiency testing (PT) data. Using computersimulationof PT data, we evaluated the efficiencyof 244 alternativesto the 12s rule, all based on the PT population’smean and SD. Using the traditional PT format, we determined the ability of each rule to correctly identify both good and deficient intralaboratory performance. The rules are based on results from one to five PT samples “analyzed” at the same time. Because the effectivenessof the criteria set for acceptableperformancein a PT programis influencedby the populationSD, each rule’s interlaboratory
capabilitieswere examined for PT populationswith interlaboratorySDs rangingfrom 1% through10% of the population mean value. All rules achieve their maximumefficiencyover a narrowrange of interlaboratorySDs. For PT evaluationsof intralaboratoryperformance to be optimallyeffective,selectionof the rulemustbe basedon the SD of the PT population. Interlaboratory proficiency testing (PT) has become a routine and often mandated component of all competent laboratories’ programs of quality assurance (1). PT serves multiple purposes by providing: (a) information to participants about their own performance, (b) insight into the
performance
of other laboratories with similar methods and (i.e., defining the “state of the art”), (c) identification of tests or instruments where remedial measures may be needed, and (d) information on individual laboratory performance that third-party agencies use for accreditation and (or) licensure (2-5). The fundamental premise of regulatory or professionally mandated PT programs is: “If a laboratory performs acceptably in the program, it is assumed that it also analyzes instruments
patients’ remains:
samples correctly.” However, “Does meeting interlaboratory ria in PT programs portend acceptable
the basic question performance crite-
routine laboratory performance?” At the most basic level, it is reasonable to assume that PT samples receive preferential treatment when accreditation or licensure is at issue. Most regulators and professional organizations accept that performance, as measured by analysis of PT specimens, as indicative of the “best” that the laboratory can do-not necessarily routine performance (6). The more fundamental question, however, must address the validity of using PT to certifr performance. This application assumes that the criteria (rules) used for evaluation can differentiate between “good” and “deficient” intralaboratory performance with a degree of reliability sufficient
for the intended
purpose.
‘School of Allied Health Professions, Medical Technology Program, Center for Health Sciences, University of Wisconsin-Madison, Rm. 6169, 1300 University Avenue, Madison, WI 53706. 2Departments of Pathology and Laboratory Medicine and Preventive Medicine, State Laboratory of Hygiene, University of
Wisconsin-Madison. Received June 1, 1987; accepted October 20, 1987. 250
CLINICAL CHEMISTRY, Vol. 34, No. 2, 1988
Since the 1960s, PT programs such as those of the College of American Pathologists (CAP) and others that meet the requirements of the Clinical Laboratory Improvement Act (cIJA) (7-9) have routinely used the evaluation criterion based on the group mean and ± 2 group SD (1) to “pass” or “fail” participant performance. The 1 rule in PT was designed to incorporate state-of-the-art performance into the interlaboratory evaluation process (10). A previous study quantified the effectiveness of the 1 rule in evaluating the performance of a laboratory analyzing one sample in a chemistry PT program (11). Other techniques to evaluate mtralaboratory performance have been identified (12-18). We recognize a movement among PT providers to use “fixed limits,” and we have studied the implications of this approach in a separate paper (19). Preliminary studies (20-22) indicated that (a) the choice of the criterion selected for evaluating intralaboratory performance strongly influences a PT program’s ability to distinguish good and deficient intralaboratory performance and (b) currently used PT criteria do not consistently reflect actual intralaboratory performance. At a minimum, PT criteria should warn the laboratory (as well as the regulatory agencies) when intralaboratory performance is not meeting patients’ needs. Likewise, no useful purpose is served when good laboratories are failed or when deficient laboratories are passed. For this study, we devised 244 alternative rules and evaluated their “effectiveness,” i.e., their ability to correctly characterize intralaboratory performance. By making these rules available for use in PT, the profession will be able to accomplish the fundamental (regulatory) objective of correctly characterizing the intralaboratory performance of all participants.
Materials and Methods Our previously described computer model (23) quantified the effect of intralaboratory performance characteristicsi.e., specified levels of internal imprecision (CV) and inaccuracy (bias)-on the ability of conventional PT criteria to characterize intralaboratory performance correctly. This study is an attempt to develop alternative criteria (rules) encompassing state-of-the-art interlaboratory performance (i.e., group mean, SD) that more effectively reflect intralaboratory performance. All of the rules are based on a unit of convenience analogous to the Standard Deviation Interval (SD!) used (8) by the CAP PT programs (see below). To evaluate the proposed rules, the model simulated a PT program consisting of 401 participating laboratories. Each laboratory in the PT program “analyzed” from one to five samples per “shipment.” PT programs frequently use multiple samples, but to date no studies have evaluated performance based on rules that consider the combined performance of multiple samples. In the computer model of the PT program, the result obtained by the test laboratory, XT, is a function of a unique, assigned intralaboratory bias and CV combination that defines actual intralaboratory performance. XT is generated, based on the true value and the test laboratory’s CV and bias, by use of a gaussian random-number generator. The
results for the other 400 participating laboratories in the PT population are generated similarly and are based on their individual bias and CV values. The population mean and interlaboratory SD are calculated. The model emulates data obtained from “sending” the group of 401 laboratories identical sets of samples for 400 surveys in a PT program. See reference 23 for a detailed description of its function. Each test laboratory result is evaluated by comparing it with the group (interlaboratory) mean, X5, and the group SD, SDg, by the formula: SD!
=
The test laboratory will pass PT or fail PT on the basis of how the chosen rule interprets the SD! obtained from one to five samples in the “shipment.” In conventional PT programs SDI 2.0 is used as a criterion of acceptability, the interlaboratory analog of the “Westgard 1 rule” (13). To investigate the possibility of finding more-effective PT rules, we used an empirical approach for evaluating all possible criteria that could be devised from a PT program in which one to five samples (challenges) are used per shipment. Our approach deliberately used some rules so stringent and some so broad as to be trivial; however, this process ensures that the optimal rule is encountered. For example, if a theoretical PT program sends only one sample per shipment, the proposed rules would set the pass-fail criteria at N(SDI) where N varied from 0.5 to 5.0. While N = 0.5 is obviously overly stringent and N = 5.0 overly broad, the optimum value of N is clearly within these limits. The 1 rule is included in this group. In all, 245 nonredundant rules were devised and evaluated. All use the SD! in some manner to compute limits of acceptable performance. The rules can be divided into four classes. 1. Single-sample rules (33 rules): This group includes conventional rules, such as the 1 rule. However, the one out of two results exceeding a specified SD, one out of three, etc. represent an extension of this approach. When the result from only one sample is evaluated, it may come from a set of from one to five samples, which may be identical or different. If the laboratory’s SD! on any one sample in the set exceeds (N)(SDg), where N is a constant that varies from 1.0 to 5.0 for different rules, it is declared to have failed to meet the PT criterion. For example, in the Tables and Figures that follow, the “1 of 5, SD! >2” rule is defined as one sample result out of the set of five samples that falls outside the SDI_>2 limit. 2. Absolute rules (37 rules): These rules are based on adding the results from multiple samples. Intuitively, the use of multiple measurements will increase reliability. The test laboratory’s individual SDIs on shipments of two, three, four, or five samples are added without regard to sign. If the sum exceeds the value (N)(SDg), where N ranges from one to 10, the test laboratory’s results fail to meet the criterion. For example, a laboratory fails the “5 abs SD! >6” rule when the sum of the absolute (abs) values of the SD! from five samples exceeds (N)(SDg), which is 6.0. Theoretically, the absolute rules will not differentiate between random and systematic error. 3. Algebraic rules (37 rules): These rules also add results from multiple samples to increase their discriminatory power. The rules are similar to the absolute rules, except that the SDIs are added with regard to sign. lithe sum on two, three, four, or five samples exceeds the (N)(SDg) limit,
N ranges from one to 10, the laboratory fails to meet the criterion. The “4 aig SDI >3” rule simulates sending four PT samples and determining if the algebraic (aig) sum of the four SD!s exceeds the limit of 3.0. Note that since the value of the argument-i.e., 3.0-can be either positive or negative, the algebraic rules really represent 74 rules; however, 37 of these are redundant. The algebraic rules should detect systematic error but be insensitive to random error, especially as the number of samples increases. 4. Pattern rules (138 rules): Theoretically, pattern rules should differentiate between systematic and random error. The”Westgard 2”rule is the simplest example (13). Clearly, in an interlaboratory quality-control program, successive results >2.0 SDg from the mean and on the same side (both positive or negative) or on opposite sides (one positive and one negative) of the mean have different implications. The rule recognizes both the magnitude and the distribution of the laboratory’s error. The pattern rules can be based on three, four, or five samples. These rules require that two or more of the results of the individual test laboratory exceed a specified (N)(SDg) limit-and do so in a specific pattern. For example, if two samples were sent in the PT program, one version of the “2 of 2, SDI >3” rule would require that the SD! on both of the test laboratory results exceed the (3.O)(SDg) limit and that they lie on the same side of the group mean. A second version of the “2 of 2, SDI >3” rule requires that both exceed the limit, and fall on opposite sides of the mean. If and are used to designate outliers above and below the mean, these two rules can be conceptualized as: version one “+ + and and version two “+ and +“, respectively. The pattern rules for three, four, and five samples present the opportunity for a geometric progression of combinations. The number of cases can be constrained by eliminating equivalent or redundant patterns. We do this by defining rules that are as follows: two of three, two of four, and two of five outliers falling on the same side or opposite sides of the mean as above. Likewise, for the three of three, three of four, three of five, four of four, four of five, and five of five rules, outliers can fall on the same or opposite sides of the mean. For example, with the rule in which three out of five results fall outside the stated limit on the “opposite sides” of the mean, + +, + + + + +, etc., are alljudged tobe equivalent, whereas and + + + are defined as the rule in which three out of five fall on the “same side” of the mean. An example of a rule following this pattern is “3 of 5, SDI >2s”; three out of a set of five results exceed the SDI >2 limit and fall on the same side of the mean. The combinations include 10 nonredundant same-side rules, with SDIs ranging from 1.0 to 4.0, for a total of 69 combinations; 69 opposite-side rules make up the remainder. where
“+“
“-“
-
-
-“
-
-
-‘
-
-‘
-
-‘
-
-
-
-
-
Rule Effectiveness Table 1 illustrates the computer model’s output for a single-sample rule where the population SD equals 6.0. The y-axis lists intralaboratory CVs in the test laboratory; the xaxis lists the intralaboratory biases. The tabular values indicate the percentage of times a laboratory with a particular bias and CV combination will fail a PT program when the 1 rule is used for evaluation. Obviously, a laboratory with intralaboratory CVs of 8%, 9%, and 10% (upper left) would be so imprecise that it could not produce reliable data, because of random error. Likewise, a laboratory with excessive bias (lower right) would also tend to produce consistently wrong, but precise, answers. Small biases and CVs (lower CLINICALCHEMISTRY, Vol. 34, No. 2, 1988
251
Table 1. Percentage of Test Laboratory Results Falling a PT Program Using the 1 Group Mean = 100 and the Group SD = 6.0
Rule When the Intralaboratory
10
21
21
23
24
25
28
34
36
38
40
43
9
18
17
19
21
23
24
28
32
35
39
41
8
13
14
15
16
19
21
22
29
32
36
40
7
9
10
11
12
13
16
20
23
30
34
38
6
5
6
6
8
10
12
15
20
26
32
36
5
2
2
3
4
6
9
12
15
20
27
34
4
0
0
1
2
3
5
7
11
15
2130
3
0
0
0
0
1
2
3
6
9
15
24
2
o
o
0
0
0
0
0
1
3
7
14
o
o
0
0
0
0
0
0
01
0
3
0
1
2
3
4
5
6
7
8
9
10
Test laboratory’s intralaboratory CV, % of mean
left) are consistent with good intralaboratory performance, large biases and (or) CVs with deficient performance. A rule’s effectiveness must be evaluated against two criteria: (a) its ability to characterize laboratories whose performance is “good,” by declaring them top PT, and (b) its ability to declare that those laboratories whose performance is “deficient” fail PT. For the purposes of this study, good intralaboratory performance is defined as the test laboratory having CV-bias combinations located in the lower left-hand sector (I) of Table 1. These combinations of CV and bias represent a laboratory with a routine error of ±10% or less (24). For the ensuing discussion, we will assume that a “good” laboratory, one producing medically useful data, controls its systematic and random error to the degree that 95% of its results will have a total relative error of ± 10% or less. A “deficient” laboratory is one that has any of the 22 CV-bias combinations in the middle sector (H) of Table 1. It will produce results that exceed the 10% error limit more than 5% of the time (24). Laboratories with these CV-bias combinations are assumed to be performing at a level of competence below that required to yield medically useful results. Ideally, when a PT program applies a particular rule, it should declare that the performance of laboratories with the combinations of CV and bias in sector ! passes. Likewise, it should declare that those laboratories in sector II fail. We did not further evaluate the rules’ ability to detect large intralaboratory errors; i.e., we did not include those CVbias combinations above sector H. Because these laboratories are making errors so gross, including them would make even an incompetent rule appear to be effective. Sectors I and H represent a rigorous test of any rule’s ability to identify good and deficient intralaboratory performance. One can characterize the effectiveness of any given rule in an interlaboratory PT program in a manner analogous to that suggested by Galen and Gambino (25) for evaluating the ability of a laboratory test to diagnose disease. In sector I, good laboratories passed by PT are the true negatives; those that fail are the false positives. To carry the analogy one step further: in sector H, the deficient laboratories 252 CLINICALCHEMISTRY, Vol. 34, No.
2, 1988
correctly identified (i.e., those that fail PT) are the true positives, whereas those passed are the false negatives. Using the Galen/Gambino scheme, the net effectiveness of any rule can be characterized by its efficiency, which is its ability to identify both good and deficient laboratories correctly. The values in Table 1 are the percentage of times a test laboratory with particular CV-bias combinations fails the PT program. Ideally, no laboratories in sector I should fail PT. Values greater than zero represent the percentage of false positives for that particular intralaboratory CV-bias combination. In sector H, ideally all laboratories with any of these CV-bias combinations should fail. The difference between the Table’s values and 100 represents the percentage of false negatives. Rather than evaluate each CV-bias combination, the model calculated the median value for the 24 and 22 cells in sector! and II, respectively, and used these median values to compute efficiency: Efficiency
100
X
(true positives + number of results
true
negatives)Jtotal
The ideal value for efficiency is 100%; it was calculated for 245 rules at 10 PT population SDs ranging from 1% to 10% of the mean value. Because efficiency is a function of the prevalence of the number of true positives and true negatives in a PT population, efficiencies were calculated for prevalence rates for deficient laboratories of 10%, 5%, and 1%.
Results ProficiencyTesting Program Philosophy In selecting a rule to evaluate laboratories’ performance, a PT program must choose between conflicting goals. Is it more important to identify deficient laboratories (analogous to setting very stringent limits), or is it more important to assure that good laboratories are not misidentified? As the computer model demonstrates, meeting both goals is not a realistic expectation. A nondiscriminating rule, e.g., one that declares that
100% of the laboratories in both sectors I and II pass, will have a limiting efficiency, dependent on the prevalence of good laboratories. For this study, we set the prevalence rates at 90%, 95%, and 99%, resulting in limiting values of efficiency of 90%, 95%, and 99%, respectively. To illustrate this dilemma, Figure 1 displays the effectiveness of the familiar 1 rule when it is used to identify both good and deficient laboratories, where “good” is defined as obtaining results within ± 10% of the true value 95% of the time. The rule’s effectiveness is strongly affected by the group SD of the PT survey population. The upper curved line, depicting the ability to identify correctly or “pass” good laboratories in sector I, indicates that at a population SD of 2, the 1 rule will identify only 70% of the good laboratories and fail 30%. This curve rises very steeply, so that at a population SD of 4, the 1 rule is capable of identifying 97% of the good laboratories. However, at the same time the rule’s ability to identify deficient laboratories correctly falls off precipitously with increases in the group SD. At a population SD of 4, only 20% of the deficient laboratories are failed. When the population SD is small, good laboratoriesi.e., those well within the limits of medical usefulness-are penalized while performing quite adequately. When the SD is large, no laboratory, good or bad, is declared to fail the PT program. In all probability this poor performance characteristic of the 1 rule with respect to good laboratories is what has led various PT programs to consider adopting alternatives such as “fixed limits” to evaluate performance, especially as instrumentation has become better calibrated and more precise, resulting in smaller population SDs. It is not practicable to list or graph the efficiencies of all 245 rules at 10 population SDs at three prevalences. To facilitate the ensuing discussion, we selected the bestperforming rules for further study. In Table 2, up to 10 rules with efficiencies >90% at a 10% prevalence of deficient laboratories were selected at each population SD. The first five entries in each section list optimal rules for a PT program whose philosophy emphasizes accurately identifytoo
‘6 OF GOOD LABS IDENTIFIED AS PASSING
70
60 C
2 0
40 S OF BAD LABS IDENTIFIED AS FAILING
20
10
2
3
4
5
6
7
SD, Percent of Population
Fig. 1. The effectiveness deficient laboratories
SDs and to lose its ability
laboratories
as the population
to identify
deficient
SD increases.
Discussion A PT program that adopts the philosophy that it is more important to identify good laboratories correctly would simply use the rules with the highest efficiency. These are the first five rules reported in each group in Table 2. The maximum efficiency over a short range of population SDs reported for any rule is 94%, 97%, and 99.4% for prevalences of 10%, 5%, and 1%. Although rules with the highest efficiencies identify the greatest percentage of the good laboratories, they also fail to identify a significant percentage of deficient laboratories. The second group of rules were included in Table 2 to represent the best compromise between identifying a high percentage of both good and deficient laboratories. At least
five general
observations
can be made:
3. Those rules that involve multiple samples are generally the most efficient. Because laboratory error is usually a combination of CV and bias, these rules have the best opportunity to detect imprecision. 4. The magnitude of the combination (N)(SDI) that yields the best rule depends on the population SD. At population SDs of 1%, 2%, and 3%, single-sample and pattern rules based on large multiples of the SD! such as “1 of 4, SD! >4” and “3 of 5, SD! >4” are more effective. At large population SDs, rules based on smaller multiples are better. Because the definition of good performance (± 10% total error) is invariant, the rule needs the tighter (smaller) values for (N)(SD!) at large population SDs to achieve discriminating power.
50
30
population
1. The greatest difficulty lies in selecting rules for very small population SDs where even the “best” rules detect only a relatively small percentage of deficient laboratories. This corresponds to the problems caused by shrinking interlaboratory SDs observed currently in PT programs. 2. The choice of the “best” rules changes as a function of population SD. For small population SDs, the pattern rules are the most efficient. At intermediate population SDs, the single-sample rules are better. As the population SD increases, absolute and algebraic sum rules are the best, particularly the algebraic summing rules.
90
so
ing the largest number of laboratories, whether good or deficient. These are the rules with the highest efficiency. At low prevalence rates, this means identifying the largest possible percentage of good laboratories correctly. The second group of entries in each section represents rules selected for an optimum combination of high efficiency and the ability to detect a high percentage of deficient laboratories correctly. For comparison purposes, similar data for the 1 (1 of 1, SDI >2) rule have been included in parentheses at each population SD. Figure 2 illustrates the efficiency of a “best” rule selected from Table 2 for each population SD. The maximum for each curve represents the rules’ ability to identify the largest percentage of both good and deficient laboratories correctly. It also represents the point at which the most deficient laboratories are identified. The percentages of deficient laboratories identified for each rule are the numbers superimposed on the x-axis. The nature of the plots reflects the tendency of each rule to exhibit optimum performance (detecting good and deficient) over a narrow range of
8
Mean
of the 1 rule in correctly identifying good and
5. The 12. rule is never a good choice. Although it does reach a maximum efficiency of 91% at a population SD of 5 (Table 2), many other rules are more capable of characterizing laboratory performance. In addition, at smaller population SDs, it misidentifies good
laboratories and fails them in PT programs. CLINICAL CHEMISTRY, Vol. 34, No. 2, 1988
253
Table 2. Best Rules to Evaluate Intralaboratory Performance at Three Prevalences of Deficient LaboratorIes Effectiveness, %
Sector’
Effectiveness,%
Efficiency, %,
______________
prevalence
at of
Sector ii
10%
5%
1%
100
18
92
96
$
100
10
96
PopulationSD = #{163}5 alg >4 SDI 1 of 3, SDI >1.5
S0I >2.5 d SDI >2.5 d
100
7
91 91
99 99
95
99
100
7
91
95
99
4 of 5, SDI >3.5 S 3 of 4, SDI >2.5 d 2 of 5, SDI >3 d 4 of 5, SD1 >2.5 d 4 of 4, SDI >4 $ (1 of 1, SDI >2)c
98 100 100 100
29 6 6 6
91 91 91
95 95 95
g 99 99
1 of 5, SDI >1.5 2 of 5, SDI >1 s 2 of 4, SDI >1 s
6
91 91
95
100 35
81
100 100 98 100
41
Rule
PopulationSD 4 of 5, 4 of 4, 3 of 5, 2 of 4,
1
b
501 >3.5
PopulationSD 2 of 5, 2 of 4, 3 of 5, 2 of 5, 1 of 3,
=
SDI >4
=
SD1 >3.5 s SDI >2.5 s SDI >4 $ SDI >4
1 of 5, SDI >4 1 of 4, SDI >4
98
54
94
96
98
99
40
93
96
98
#{163}5 abs >5 SDI
100
33
93
97
99
1 of 4, SDI >1.5
98
50 58
93 93
96 95
98
63 49 50 35 35 3
91 91 91 92 92 90
92 94 94 95 95 95
94 96 96 97 97 99)
99 100 98
46
94
96
98
37
94
97
99
52
100
96 97
98 99
100
30 24
93 93
92
96
99
98
41
92
95
97
97 99 99 100 100
39 34 33 1
91 93 92 92 90
94 96 96 96 95
96 98 98 99 99)
100
97
31 54
93 93
97 95
99 97
100 100
25 25
93 93
96 96
99 99
98
43
93
95
97
97
47
92
95
97
98 99
34 32
95 96 96
97 98 99
99 99
#{163}3 alg >2.5 SDI #{163}3 abs >3 SDI
40
35)
#{163}2 alg >2 SDI (1 of 1, SDI >2)
94 93
97 96
99 99
PopulationSD 2 of 5, SDI >1
92 92
95 96
97 99
92
95
97
92 92
94 95 95 96
1 of 5, SOt >1.5 #{163}5 aIg >4 SDI 1 014, SDI >1.5 #{163}2 abs >2 SDI
=
97 94 96 96 98 98 100
S
97 99
43
92
30 61
92
96
98
69
68
69
69)
>3
99
47
96
98
>3
100 100
38
94 94 94
94 93
97 97 96 97
99 99 98 99
PopulationSD = 8 2 of 5, SOl >1 s #{163}5 alg>3 SDI 15 alg>4 SOl 14 alg >3 SDI
93 92 92
94 94
13 alg >2 SDI 14 alg >2.5 SDI 12 alg >1.5 SDI
PopulationSD
=
91
3
1%
97
7
93 93 93 94
61 58
5%
6
71
2 of 4, SDI >3 $ 4 of 5, SDI >2 $ (1 of 1, SDI >2) SDI SDI SDI S0i
10%
94 95 95
2 of 5, SOl >3 s
1 of 4, 1 of 3, 2 of 5, 1 of 5, 1 of 5,
97
26 42 23 49
Sector
ii
Rule
95 37
2
SDI >3.5 s
Sector’
Efficiency, %, at prevalence of
#{163}4 alg>3 SDI #{163}3 abs >2.5 SDI #{163}3 alg >2.5 SDI 2 of 4, SDI >1 s 1 of 3, SDI >1.5 (1 of 1, SDI >2)
22
98
37 55
SDI >3.5
100
33
#{163}5 alg >7 SDI 2 of 5, SDI >2 s #{163}4 alg >6 SD1 #{163}5 alg>8 SDI
94
70
94 94
62 60
92 91 91
98
51
93
96
98
14 abs >3 SDI
#{163}4 abs >7 SDI
96
48
91
94
96
13 abs >2.5
SDI
100
24
92 92 92
(1 of 1, SDI >2)
89
39
84
87
89)
2 of 4, SDI >1 s (1 ofl, SDI >2)
100 100
20
92
96
99
0
90
95
99)
97
59
93
95
97
98 99 99 98
50 40 39 45
93 93 93 93
95 96 96 95
98 98 98 97
>2.5 $ >3
PopulationSD = 1 of 5, SOt >2.5
4
#{163}5 alg>6 SDI 1 of 4, SDI >2.5 #{163}5 abs>7 SDI 2 of 5, SDI >2 $ 2 of 5, SDI >1.5 s 1 of 3, SD1 >2 2of4,SDI >1.5 s #{163}4 alg >5 SDI #{163}3 alg >4 SDI (1 of 1, SOl >2)
PopulationSD = 5 2 of 5, SDI >1.5 s 1 of 4, SDI >2 1 of 5, SDI >2 #{163}5 abs >6 SDI #{163}5 alg >5 SOl
99 98 99
PopulationSD lofS,SDI>1 1 of 4, SDI >1
96
98 99
1 of 3, SDI >1 15 alg >3 SDI
91
92
94
15 abs >3 SDI
91 91 92 93 90
93 94 95 95 94
95 96 97 97 97)
15 14 13 14 13
100
42
94
98 99
53 34 40
94 93 93
97 96 97 96
100
29
93
94
62
95 96
53 49 47 43 21
100
97 98 98
=
alg >2.5 SDI abs >2.5 SDI abs >2 SOP alg >2.5 SDI alg>2 SDI (1 of 1, SOt >2)
100 100
39
94
97
99
PopulationSD
36 43
97 96
99
99
94 93
100
33 47
93 93
97 95
99 97
1 of 4, SOP >1 1 of 5, SDI >1 15 alg >2.5 SD1
54 50 49 48
91 91 92 92
93 94 95 95
g 96 97 97
98 95 96 97 97 97
98
=
1 of 3, SDI >1
#{163}4 abs >2.5 SDI 13 alg >1.5 SDI #{163}3 alg >3 SDI 14 alg >2 SDI 3 of 5, SDI >1 $ 15 abs >3 SDI #{163}4 alg >4 SDI 33 91 94 96 13 abs >2 SDI #{163}2 abs >2.5 SDI 11 91 96 99) 14 alg >2.5 SDI 100 (1 of 1, SDI >2) (1 of 1, SDI >2) ‘Sector I refersto lower-left-handsectorof Table 1;sector II, the middle sector of Table 1. = outhers fallon the same sideof the mean, d = outliersfall on different sides of the mean. ‘For companson the results for the 1, rule are included in parentheses foreach population SD. #{163}5 abs >5 SDI
254
CLINICALCHEMISTRY, Vol. 34, No. 2, 1988
9
95
61
92
93
95
98
42
92
95
97
97
34
91
94
96
99
33
92
99
31
92
96 96
98 98
100
0
90
95
99)
100
94
97
94
96
99 98
100
37 44 49 28
93 93
96 96
98 99
100
27
93
96
99
96
49 47 33 24 22 0
91 91 92 92 92 90
94 94 96 96 96 95
96 96 98 99 99 99)
10 99
98
96 99 100 100
100
I00
>‘.
0
80
-
60
-
40
-
30
-
20
-
C
5
C)
w
10
-
29
00-
I
I
41
2I
47
3I
42
4I
5I
54
6I
46
7I
54
8I
___I
59
9
44
I tO
SD, Percent of Population Mean Fig.2. Best
rules for given PT population
SOs at a 10% prevalenceof deficientlaboratories
Intuitively, regulatory PT programs have as their goal the identification of deficient laboratories. This means tolerating misidentification of some good laboratories. The rules that evaluate multiple samples are generally best suited to this purpose; the two forms of “sum” rules become more effective at larger population SDs, whereas at smaller population SDs other types of rules would be a better choice. Obviously, any of these rules can detect a good percentage of deficient laboratories only for a specific and narrow range of population SDs. The prevalence of good laboratories determines the maximum efficiency that any rule can achieve. As Table 2 shows, the choice of rule is, generally, prevalence independent; the best rule at 10% is also the best rule at 1% prevalence.
#{149} Conclusions The study of all possible variations of evaluation criteria yields several possible “best” choices for each population SD. The list of best rules in Table 2 indicates that the “sum” rules, which benefit from the ability to use multiple data points, appear frequently, but not to the exclusion of the single-sample rules. The pattern rules are sparsely represented and useful only at small population SDs. The adequacy (or inadequacy) of the criteria now used to evaluate laboratory performance in PT programs has been the topic of multiple papers (11, 19, 21, 22). We believe that our current efforts shod some light on why any criterion seems to exhibit inadequate performance in some instances and not in others. The choice of the optimal rule depends on the program’s goal of identifying correctly either good or deficient laboratories. The ability to predict the rule which for a specific population SD has the optimum efficiency has profound implications for PT programs. A PT program could “select” the best rule to evaluate performance based on PT population or homogeneous subpopulation group SD. The 1, rule would not be one of those chosen because other rules exhibit better performance characteristics. When PT is used in a regulatory setting, common sense demands that the criterion chosen to evaluate laboratory performance detect defi#{149}cient laboratories. Given a knowledge of the best rules, immediate improvements can be made in PT programs’
ability to correctly evaluate laboratories’ performance. This study at least provides some information CAP Chemistry Resource Committee’s
on the effect of the (26) announced intent to abandon the 1 rule and move toward a “fixed limit” approach. However, on the basis of these data, we believe that PT evaluations may be further improved by use of an analog of the Westgard multi-rule approach (27), and we have taken this under study. References 1. Wilcox KR, Baynes TB, Crable JV. Laboratory management. In: Inhorn SL, ed. Quality assurance practices for health laboratories. Washington, DC: Am Public Health Assoc, 1977:3-426. 2. Dorsey DB. The evolution of proficiency testing in the USA. In: Proc 2nd NatI Conf on Proficiency Testing. Bethesda, MD: Information Services, 1975:8-9. 3. Forney JE, Blumberg JM, Brooke MM, et al. Laboratory evaluation and certification. Op.cit. (ref. 1):127-71. 4. Gilbert RK, Rosenbaum JM. Accuracy in interlaboratory quality control programs. Am J Clin Pathol 1979;72:260-4. 5. Eilers RJ. Total quality control for the medical laboratory: the role of the College of American Pathologists Survey Program. Am J Clin Pathol 1970;54:435-6. 6. Annino JS. What does laboratory “quality control” really control? N EngI J Med 1978;299:1130. 7. Elevitch FR, Noce PS, eds. Data recap 1970-1980. Skokie, IL: College of American Pathologists, 1981.
8. CAP Survey Manual. Skokie, IL: Coll of Am Pathologists interlaboratory comparison program, 1986:9-10,29. 9. U.S. Dept. of Health, Education, and Welfare, Public Health Service, Clinical Laboratories Improvement Act of 1967. Fed Rag 1968;33,No. 253 (F.R. Dcc. 68-15586). 10. Standards Committee, Coll of Am Pathologists. Guidelines for evaluating laboratory performance in survey and proficiency testing programs. Am J Clin Pathol 1968;49:457-8. 11. Ehrmeyer S8, Laessig RH. An analysis of the use of the 12, rule to detect substandard performance in proficiency testing. Clin Chem 1987;33:788-91. 12. Henry RJ, Segalove M. The running of standards in clinical chemistry and the use of the control chart. J Clin Pathol 1952;5:305-11. 13. Westgard JO, Groth T, Aronsson T, et al. Performance characteristics of rules for internal quality control: probabilities for false rejection and error detection. Clin Chem 1977;23:1857-67.
CLINICALCHEMISTRY, Vol. 34, No. 2,
1988
255
14. Davies DL, Goldsmith PL. Statistical methods in research and production, 4th ed. New York: Hofner Pub. Co., 1972:342-3. 15. Duncan AJ. Quality control and industrial statistics, 4th ed. Homewood, IL: Richard D. Irwin, Inc., 1974:375-92. 16. Jardine AKS, MacFarlane JD, Greensted CS. Statistical methods for quality control. Bath, UK: Pittman Press, 1975:133. 17. Nelson IS. The Shewhart control chart-tests for special causes, J Qual Technol 1984;16:237-9. 18. Nelson IS. Interpreting Shewhart X control charts. J Qual Technol 1985;17:114-6. 19. Ehrmeyer SS, Laessig RH. An assessment of the use of fixed limits to characterize intralaboratoiy performance by proficiency testing. Clin Chem 1987;33:1901-2. 20. Ehrmeyer SS, Laessig RH, Garber CG. Monthly interlaboratory pH and blood gas survey: establishing accuracy based on interlaboratory performance. Am J Clin Pathol 1984;81:224-9. 21. Ehrmeyer 88, Laessig RH. Alternative statistical approach to
25$
CLINICALCHEMISTRY, Vol. 34, No. 2, 1988
evaluating interlaboratory performance. Clin Chem 1985;31:106-8. 22. Ehrineyer 88, Laessig RH. Adequacy of interlaboratory precision criteria to measure intralaboratory performance. Clin Chem 1985;31:1352-4. 23. Ehrmeyer 88, Laessig RH. Interlaboratory proficiency testing programs: a computer model to assess their capability to correctly characterize intralaboratory performance. Clin Chem 1987;33:7847. 24. Ehrmeyer 88, Laessig RH. The effect of intralaboratory bias and imprecision on laboratories’ ability to meet medical usefulness
goals. Am J Clin Pathol 1988;89 (in press). 25. Galen R8, Gambino SR. The predictive value and efficiency of medical diagnosis. New York: John Wiley & Sons, 1975. 26. Hartrnann AE. Target values and evaluation limits. In: CAP today. Skokie, IL: Coll of Am Pathologists, April 1987: 11. 27. Westgard JO, Barry PL, Hunt MR, et al. A multi-rule Shewhart chart for quality control in clinical chemistry. Clin Chem 1981;27:493-501.