issues with statistical risks for testing methods in ...

4 downloads 856 Views 257KB Size Report
Noninferiority margin; Placebo-controlled; Within-trial or conditional Type I error rate; Synthesis method. 1. ..... http://www.eudra.org/emea.html). D'Agostino, R. B. ...
Journal of Biopharmaceutical Statistics, 17: 201–213, 2007 Copyright © Taylor & Francis Group, LLC ISSN: 1054-3406 print/1520-5711 online DOI: 10.1080/10543400601177343

ISSUES WITH STATISTICAL RISKS FOR TESTING METHODS IN NONINFERIORITY TRIAL WITHOUT A PLACEBO ARM H. M. James Hung Division of Biometrics I, Office of Biostatistics, OTS/CDER, FDA, Silver Spring, MD, USA

Sue-Jane Wang and Robert O’Neill Office of Biostatistics, OTS/CDER, FDA, Silver Spring, MD, USA Noninferiority trials without a placebo arm often require an indirect statistical inference for assessing the effect of a test treatment relative to the placebo effect or relative to the effect of the selected active control treatment. The indirect inference involves the direct comparison of the test treatment with the active control from the noninferiority trial and the assessment, via some type of meta-analyses, of the effect of the active control relative to a placebo from historical studies. The traditional withinnoninferiority-trial Type I error rate cannot ascertain the statistical risks associated with the indirect inference, though this error rate is of the primary consideration under the frequentist statistical framework. Another kind of Type I error rate, known as across-trial Type I error rate, needs to be considered in order that the statistical risks associated with the indirect inference can be controlled at a small level. Consideration of the two kinds of Type I error rates is also important for defining a noninferiority margin. For the indirect statistical inference, the practical utility of any method that controls only the across-trial Type I error rate at a fixed small level is limited. Key Words: Across-trial or unconditional Type I error rate; Active-controlled; Fixed margin method; Noninferiority margin; Placebo-controlled; Within-trial or conditional Type I error rate; Synthesis method.

1. INTRODUCTION The use of a noninferiority trial design that does not contain a placebo arm is controversial. The results from such a trial design can be difficult, if not impossible, to interpret with confidence. The essential problem with this design is that in the absence of a placebo arm, the ability to assert that a test treatment yields more benefits to the study patients than no application of the test treatment (i.e., placebo) depends entirely on the indirect inference coming from a combination of the direct comparison of the test treatment with the selected active control from the noninferiority trial and the assessment of the control’s effect vs. the Received October 12, 2006; Accepted November 13, 2006 The views presented in this paper are not necessarily those of the U.S. Food and Drug Administration. Address correspondence to H. M. James Hung, Division of Biometrics I, OB/OTS/CDER, FDA, 10903 New Hampshire Ave, Bldg 22, Rm 4238, HFD-710, Mail Stop 4105, Silver Spring, MD 20993-0002, USA; E-mail: [email protected] 201

202

HUNG ET AL.

placebo from the historical data. Such indirect inference is usually necessary to answer important questions, such as, how would the test treatment have fared against a placebo had the placebo been in the trial? How effective would the test treatment have been as compared to the active control? Both of these questions indeed refer to a placebo. There is abundant literature on active-controlled trials, noninferiority designs and the related issues; see a selected list of references Blackwelder (1982, 2002), Bucher et al. (1997), Chow and Shao (2006), CHMP (2006), CPMP (2000), D’Agostino et al. (2003), Department of Health and Human Services, Food and Drug Administration (1999), Dunnett and Tamhane (1997), Ebbutt and Frith (1998), Ellenberg and Temple (2000), Fleming (1987, 2000), Gould (1991), Hasselblad and Kong (2001), Hauschke (2005), Hauschke and Pigeot (2005), Holmgren (1999), Hung and Wang (2004), Hung et al. (2003, 2005) and the articles cited therein. By and large, the interpretability of a noninferiority trial without a placebo arm rests upon the reliability of historical evidence selected, at least, partially subjectively for assessment of the control’s effect, trustworthiness of meta-analysis, assay sensitivity with historical trials and the noninferiority trial, applicability of the constancy assumption that the control’s effect in the historical trials does not change in the noninferiority trial, etc. In fact, as pointed out (Hung et al., 2003), the difficulty begins with the frequently unclear trial objective which dictates the formulation of the clinical/statistical hypothesis at issue and the associated noninferiority margin. Moreover, the appropriate scale (e.g., ratio of mortality risk and difference of mortality rate) on which the treatment effect is quantified is relevant to the acceptability and interpretability of the noninferiority margin. This issue is also controversial when defining a noninferiority margin. Another controversial issue highlighted in the literature concerns the statistical risks associated with making such indirect statistical inference. This fundamental issue was raised and discussed earlier (Hung et al., 2003). In this work, we shall stipulate further the necessary considerations for the statistical risks of such an indirect statistical inference. Section 2 highlights the points to consider in formulating the trial objective of noninferiority. Two major classes of statistical methods in the frequentist statistical framework will be discussed in Sections 3 and 4. 2. STATISTICAL FORMULATION OF NONINFERIORITY TRIAL OBJECTIVE The statistical formulation of the objective of a noninferiority trial may seem obvious but, in fact, it is unclear many times, particularly when the design is based on regulatory applications for the purpose of establishing the efficacy of an unproven agent (Hung et al., 2003, 2005; Tsong et al., 2003). In a number of recent public meetings many clinical experts have challenged the statistical hypothesis formulation for demonstrating noninferiority objectives but offer no better alternative. The original notion of noninferiority in terms of effectiveness is that the effectiveness of the test medical product is not unacceptably less than that of the selected active control and the acceptable degree of inferiority must be defined by a threshold known as a noninferiority margin. In terms of therapeutic effectiveness, as a minimum requirement, the noninferiority trial must be able to assert that the test treatment is efficacious in the sense that it would have been more effective than the placebo had a placebo been in the trial. In addition, for

TESTING METHODS IN NONINFERIORITY TRIAL

203

certain clinical endpoints such as mortality, simply being superior to a placebo is often deemed not enough and an additional objective of retaining at least a substantial portion of the effect of the active control is often required in regulatory applications, especially for active control treatments that have already demonstrated large effects. The intriguing relationships between these noninferiority objectives are extensively discussed (Hung et al., 2003, 2005; Tsong et al., 2003). The notion of “not unacceptably inferior” generates the hypothesis of inferiority, such as H0  T/C ≥  where T/C could be a ratio of mortality risks associated with the test treatment T and the active control C, and  is the noninferiority margin beyond which the degree of inferiority is unacceptable. Retaining 100% of the active control’s effect by the test treatment gives a further specification for the noninferiority margin . For instance, if the percent of effect to retain is strictly applied to a risk ratio scale, then the expression may be  = 1 −  + P/C, where P/C is the ratio of risks associated with the absent placebo and the active control. However, if the percent of effect to retain is applied to a log risk ratio scale, then the expression becomes  = P/C = 11− P/C . In the former, the noninferiority margin is the weighted arithmetic mean of one (i.e., the control has null effect relative to the placebo) and the risk ratio P/C, while in the latter the margin is the weighted geometric mean. The two expressions can generate substantially different noninferiority margin when C/P is in the range of 0.20–0.80 (Hung et al., 2005). This is another point of controversy. Relative to the risk ratio scale, defining a treatment effect on the log scale seems arguably more sensible statistically and mathematically for two reasons. For one reason, suppose that the risk ratio C/P is 0.52, which amounts to a 48% reduction of risk by the control relative to the placebo. But, by inversion, P/C = 192 which amounts to a 92% increase by the placebo relative to the control. The 48% decrease is not equal to the opposite of the 92% increase. This awkward difference may cause difficulty in interpretation of a treatment effect. Defining a treatment effect on the log risk ratio scale will avoid the difficulty because logC/P = − logP/C and thus the effect of C vs. P differs from the effect of P vs. C only by sign. Another reason is that the logarithmic transformation of the ratio-type statistics is usually much better approximated by a Gaussian distribution than the ratio-type statistics. 3. FIXED MARGIN METHOD The various statistical formulations of noninferiority hypothesis have borne two different classes of statistical methods of noninferiority analysis under the frequentist framework of hypothesis testing. A widely used method, often referred to as the fixed margin method, begins with setting a numerical value to the noninferiority margin  in advance of conducting the noninferiority study so that the noninferiority hypothesis to test is well defined prior to observing any results in the completed trial. Choice of  is a difficult task in practice (Chow and Shao, 2006; Hauschke, 2005; Hung et al., 2003, 2005; Laster and Johnson, 2003; Laster et al., 2006; Ng, 2001; Tsong et al., 2003; Wang et al., 2003; Wiens, 2002). Depending on the set objective of the noninferiority trial, the margin is often chosen as a mixture of a statistical margin and a clinical margin. For the noninferiority trial without a placebo arm, the statistical margin is generally derived from empirical data such as the historical trials that provide estimates of the effect of the selected positive

204

HUNG ET AL.

control. For example, for a percent retention statistical hypothesis, one conventional approach employs the worst limit of a 95% confidence interval of the historical estimate of the control’s effect as a proper estimate of the control’s effect in terms of C/P for the noninferiority trial and then generates the statistical margin. Taking the smaller of the statistical margin and the clinical margin will then determine the noninferiority margin . Once the value of the margin is set prior to conduct of the noninferiority study, this margin is regarded as a fixed constant that defines the noninferiority statistical hypothesis. The next step is to calculate a 95% or higher confidence interval for the test product vs. the selected active control from the noninferiority trial. If this interval rules out the preset noninferiority margin , then noninferiority in terms of percent retention so defined by the given margin can be concluded. This method will be referred to as 95NI –95H method, where the first 95 refers to the 95% confidence interval obtained from the noninferiority trial and the latter refers to employment of the 95% confidence interval from the historical trials for determining the . In principle, from regulatory considerations, when testing both superiority and noninferiority on only one study endpoint in the same active control trial, use of the 95% or higher confidence interval to rule out the margin in the noninferiority testing is essential so that the same confidence interval is also used for testing superiority that usually requires use of a 95% or higher confidence interval to rule out a null effect. Testing for both superiority and noninferiority can be controversial (CPMP, 2000; Dunnett and Tamhane, 1997; Hung and Wang, 2004; Morikawa and Yoshida, 1995; Wang et al., 1997). With the 95NI –95H method applied to one endpoint, the probability of falsely concluding noninferiority conditional on the selected margin and the probability of falsely concluding superiority are each no larger than a two-sided 5% level. However, it needs to be emphasized that the two-sided 5% level is the within-noninferioritytrial Type I error rate; that is, by repeating the noninferiority trial infinitely many times with the same given margin, the Type I error rate of falsely concluding noninferiority so defined by the given margin is about 5%. This is true conditional on the given margin. In truth, the value of  is unknown and there is no way to know that the value that is chosen for the margin matches with the unknown true value. Thus, the within-trial Type I error rate is not the only statistical consideration needed to quantify the statistical risks associated with the indirect inference in the noninferiority design that also inherently makes another comparison to a placebo, because the value of  to set is at best estimated from the historical data in the indirect inference. As described earlier (Hung et al., 2003), another statistical risk needs to be quantified that incorporates another kind of statistical error, namely that associated with the across-trial or unconditional Type I error. For the statistical margin, use of less than 95% confidence interval from historical data for estimating the control effect could be considered in some situations. How small this confidence level might need to be is a very difficult question in practice because there are risks associated with not characterizing the active control effect well. One possible approach to balancing both types of statistical risks is examination of an “average” statistical risk of incorrectly concluding a specified percent retention as well as that of falsely asserting that the test treatment is efficacious, namely, beating the absent placebo had it been in the noninferiority trial. Since the noninferiority margin is derived using, at best, an estimate of the control effect from the historical trials, an “average” statistical risk

TESTING METHODS IN NONINFERIORITY TRIAL

205

(referred to as unconditional Type I error probability in Hung et al., 2003) can be calculated by averaging the within-trial Type I error rate of a false claim, conditional on the estimated margin, over the statistical distribution of the historical estimates of the control’s effect. This unconditional Type I error rate is an across-trial Type I error rate pertaining to the statistical inference bridging the noninferiority trial for the test treatment vs. the selected active control and historical trials for the control vs. a placebo. As an example, for the time to some adverse clinical events, let T and C represent the hazard rates of the events in the treatment and control groups, respectively. Let U be the upper limit of the 95% confidence interval for the hazard ratio T/C of the test treatment T to the selected control C. Let ˜ denote the noninferiority margin associated with a specified level of percent retention, using the active control’s effect estimated from historical trials. The unconditional or acrosstrial Type I error probability of falsely concluding noninferiority associated with the ˜ is fixed margin method using the rejection region, U < ,  ˜ d ˜ ˜ PU < ˜  H0  T/C ≥  = PU < ˜  H0  T/C ≥    ˜ is the statistical distribution of . ˜ That is, this Type I error probability where  is an “average” Type I error probability of the rejection region over the distribution of the margin ˜ estimated from the historical trials. It is worth emphasizing that  is the true margin to rule out, the value of which cannot be obtained unless the entire patient population with the targeted diseases has been studied. In reality, the best we can have is the ˜ derived as an estimate from historical trial data. That is, ˜ is, at best, only an estimate for . Assuming asymptotic normality is applicable, if the constancy assumption holds, meaning that the  from the historical studies applied to the noninferiority study, then for any 95NI –XH method, the unconditional Type I error rate of falsely concluding retention of 100% of the control’s effect on log scale is   P/ C − z1−X/100/2 CP  logT/C P log T / C + z0025 TC < 1 − log √   z0025 + 1 − z1−X/100/2 f = 1 −  logP/C = −

 1 + 1 − 2 f where ˆ indicates “estimated from the noninferiority trial”, ∼ indicates “estimated from historical trials”, labels the standard error of resulting estimator, z is the 1 − th percentile of the standard normal distribution , and f is the ratio of the total number of events in the noninferiority trial to the total number of events in historical trials used. For other types of endpoint, f is the ratio of the variance of the estimated control’s effect from historical trials to the variance of the estimated effect of the test treatment relative to the control treatment from the noninferiority trial. In fact, f is proportional to the ratio of the sample size of the noninferiority trial to the sample size of historical studies. In the best scenario, when the constancy assumption holds, the unconditional or across-trial Type I error probability of falsely concluding percent retention is a function of f . As f approaches infinity, this error probability will approach 1 − X/100/2. For example, this error probability is 10% for 95NI –80H method and 2.5% for 95NI –95H method, in the situation that f is extremely large. The situation of an extremely large f is arguably unrealistic.

206

HUNG ET AL.

Figure 1 Across-trial Type I error rate of falsely concluding a 50% retention for 95NI –XH method [f = (number of events in noninferiority trial)/(number of events in historical trials)].

In most practical situations, there must be an upper limit on f . Across-trial Type I error rate for falsely concluding 50% retention on log scale as a function of f in the range of 0.5–10 is given in Fig. 1. This range probably covers almost all practical situations. The practical range of f could be much narrower than this range. By similar arguments, the across-trial or unconditional Type I error ˜ for falsely concluding that probability associated with the rejection region U <  the experimental experiment is efficacious (i.e., the test product would have been more effective than the placebo had the placebo been in the noninferiority trial) can be calculated from  ˜ 0  T/P ≥ 1 f ˜ d ˜  ˜ PU < ˜  H0  T/P ≥ 1 = PU < H Thus, under the constancy assumption, for any 95NI –XH method, the unconditional Type I error rate of falsely concluding that the test treatment beats an imputed placebo is Plog T / C + z0025 TC < 1 − log P/ C − z1−X/100/2 CP   T = P √   z0025 + 1 − hz0025 + z1−X/100/2  f ≤ −

 1 + 1 − 2 f where h satisfies log P/ C − hz0025 CP ≥ 0 As f approaches infinity, this error probability for falsely concluding that the test treatment beats the absent placebo will approach a limit no larger than −1 − hz0025 + z1−X/100/2 . In contrast,

TESTING METHODS IN NONINFERIORITY TRIAL

207

Figure 2 Across-trial Type I error rate of falsely concluding that the test treatment beats placebo [f = (number of events in noninferiority trial)/(number of events in historical trials)].

as f approaches zero, this error probability can be as large as 1 − X/100. Since the placebo P is absent from the noninferiority trial, the calculation of this error probability is not trivial and the level of confidence expressed by h for the confidence interval that can exclude the null control’s effect in the historical evidence plays an important role in the calculation. For example, if the 99.99% confidence interval of the control’s effect excludes the null value, C/P = 1, then the unconditional Type I error rate for false efficacy assertion is small, in the order of 0.001, even for 95NI –80H method that employs the worst limit of the 80% confidence interval of the historical estimate of the active control’s effect (Fig. 2) If, however, in the historical trials the 95% confidence interval for the active control’s effect barely excludes the null value, then the across-trial Type I error rate of false efficacy assertion can quickly go up to a level >0025, for 95NI –XH methods with X < 95 (Fig. 3) and thus a 95% or higher confidence interval from the historical trials should be used to derive the fixed margin method, considering that the assay sensitivity and the constancy assumptions are not verifiable and that the unavoidable differences between the noninferiority trial and the historical trials may be sufficiently large to permeate the interpretability of the results from such indirect statistical inferences. These average statistical risks defined above should also be evaluated assuming that the constancy assumption is violated to a plausible range of extent. To what extent that the constancy assumption is violated would need to rely on the subjective judgment of what the worst possible scenario may be. It is an extremely difficult task. The fixed margin method requires that the noninferiority margin be fixed and specified in advance of conducting the noninferiority study in order to define the statistical hypothesis of noninferiority. If the noninferiority trial is properly conducted, the (conditional) Type I error rate of falsely concluding noninferiority is properly controlled, given the fixed noninferiority margin. If the margin is changed, particularly, when the change is influenced by the noninferiority trial data, the fixed

208

HUNG ET AL.

Figure 3 Across-trial Type I error rate associated with falsely concluding that the test treatment beats an imputed placebo [f = (number of events in noninferiority trial)/(number of events in historical trials)].

margin method can be invalid in the sense that the within-trial Type I error rate is then improperly inflated (Hung and Wang, 2004; Wang et al., 1997). 4. SYNTHESIS METHOD Another class of methods, often referred to as synthesis methods, dwell on a direct combination of the estimates and their variances of the relative effect of the test treatment vs. the control and the historical estimate of the control’s effect relative to a placebo (Hasselblad and Kong, 2001; Holmgren, 1999; Rothmann et al., 2003; Snapinn, 2004; Wang and Hung, 2003; Wang et al., 2002, 2006). The main idea behind this methodology is that if the historical estimate of the control’s effect is reasonably applicable to the noninferiority trial, then a potentially more statistically efficient method for testing the percent retention hypothesis can be constructed by integrating the historical estimate and its variance into the relevant statistics of the noninferiority trial. Because of such integration, a statistical test can be constructed by dividing a relevant combination of the estimate of relative effect of the test product to the control and the historical estimate of the control’s effect by the standard error of the combined estimate. For example, the retention test for 100% retention on log risk ratio scale is constructed by dividing the sum of the estimate of logT/C of the noninferiority trial and 1001 − % times the historical estimate of log(C/P by the standard error of this sum; that is, Z=

log T / C + 1 − log C/ P   2 2 TC + 1 − 2 PC

A P-value can then be generated from the test. A sufficiently small P-value can indicate that the test product retains larger than 100% of the control’s effect. Had

TESTING METHODS IN NONINFERIORITY TRIAL

209

the placebo been in the noninferiority trial, the retention test Z when using the estimate for logC/P from the noninferiority trial would have been the best choice for testing the hypothesis of effect retention in the so-called “Gold Standard” design (Hauschke and Pigeot, 2005). The Type I error probability associated with the rejection region Z ≤ −z  is an unconditional Type I error but it is not an acrosstrial error because all the estimates in the Z statistic are from the noninferiority trials; thus, it is not controversial since the error rate is calculated by repeating the noninferiority trial only. The P-value can be validly interpreted. In this case, the fixed margin 95NI –XNI method would be a poor choice because it is unnecessarily conservative. When the placebo is absent from the noninferiority trial, the Type I error associated with the rejection region Z ≤ −z  is an across-trial Type I error and can only be evaluated by repeating the noninferiority trial and the historical trials infinitely often. Thus, this P-value is controversial in that it has completely different meaning from that of the traditional P-value calculated by repeating only the noninferiority trial. The synthesis test method assumes that the historical estimate of the control’s effect is unbiased for the control’s effect in the noninferiority trial and thus the method exerts a proper control on the unconditional Type I error rate as mentioned above under the constancy assumption that the expectation of the historical estimate of the control’s effect coincides with the true value of the control’s effect in the noninferiority trial setting. However, the synthesis methods are very sensitive to the constancy assumption. As shown earlier (Wang et al., 2003), when the effect of the active control is smaller in the noninferiority trial than it would have been in the historical trials, the unconditional Type I error rate will inflate beyond the level at which the method is aimed to control. The inflation can be very large depending on the extent of bias, such as how much the control’s effect deteriorates in the noninferiority study compared to the historical evidence. Another problem is that the synthesis test method does not always control the traditional conditional Type I error rate described above (Lawrence, 2005). Moreover, in contrast to the classical fixed margin method, the synthesis method that controls the unconditional Type I error rate at a fixed level does not provide a fixed noninferiority margin for assessing clinical relevance. For example, consider the retention test for testing 50% retention on log risk ratio scale at 2.5% level for the unconditional Type I error rate. Algebraic manipulations (Rothmann et al., 2003) show Z =

P0  ln T / C + 05 ln C0 / < −196 2 tc2 + 025 cp0

2

 ⇔ ln T / C + 196 tc < −05 ln  C0 / P0  + 196 cp0 4 2tc + 1 − 2 tc cp0 cp0

Performing exponentiation on both sides leads to   exp ln T / C + 196 tc < ∗

210

HUNG ET AL. Table 1 Relationship of non-fixed margin ∗ to number of deaths in noninferiority trial Planned # of deaths in noninferiority trial 195 390 780 Very large

Non-fixed margin ∗  = 0025 1.28 1.25 1.23 1.12

where 2

 

   = exp − 05 lnC0 /P0  + 196 cp0 4 2tc + 1 − 2 tc  cp0 cp0 ∗

The ∗ looks like a margin for the risk ratio T/C but it is not a fixed margin because it is a function of the statistical information in the noninferiority trial. One problem with use of ∗ to plan the noninferiority trial is as follows. Suppose that the endpoint is time to all-cause mortality and the historical estimate of the hazard ratio is 0.55 with 95% confidence interval 038 080. Table 1 can be generated to illustrate an illogical problem that a larger noninferiority trial requires to rule out a smaller noninferiority margin. For instance, a noninferiority trial with a very large number of deaths would need to rule out the margin of 1.12 as compared to a noninferiority trial with 195 deaths that needs to rule out the margin of 1.28. This is not sensible because the smaller trials get rewarded in the sense that they are easier to achieve noninferiority by use of a larger noninferiority margin. Thus, the results of the analysis using this method may be difficult to interpret in the sense that the clinical significance cannot be assessed.

5. DISCUSSION Interpretability of noninferiority trial results without a concurrent placebo arm is very challenging in evaluation of the effect of the test treatment. A fundamental issue with statistical inference as discussed in this paper pertains to what statistical errors should be of major concern. The traditional within-trial Type I error should be of major concern after the noninferiority hypothesis can be clearly defined in the sense that an acceptable noninferiority margin can be determined and fixed throughout the noninferiority trial. However, as articulated above, in practice, this Type I error probability is at best definable only conditional on the margin derived using the control’s effect that is estimated using the historical data. In fact, the true noninferiority margin relevant to the patient population at stake is usually unknown and at best might be known to be some explicit function of the true control’s effect via, e.g., retention of a certain percent of the control’s effect. The control’s effect in the noninferiority study is at best estimable from the historical data when the constancy assumption that the control’s effect is unchanged from the relevant

TESTING METHODS IN NONINFERIORITY TRIAL

211

historical trial settings to the current noninferiority setting holds. Thus, the withintrial Type I error probability cannot possibly measure the statistical risk associated with any statistical inference referencing the absent placebo. The across-trial Type I error rate as defined above is another possible measure of statistical risk associated with the noninferiority statistical inference referencing the absent placebo, when the constancy assumption is reasonable. If the statistical inference for the noninferiority trial is to provide an assertion for the test treatment related to a placebo arm, such as, whether the test treatment is efficacious relative to the absent placebo or preserves a certain specified proportion of the control’s effect, then both the within-trial and the across-trial Type I error rates must be controlled at a small level, though the former is arguably of first importance under the traditional frequentist statistical framework. The fixed margin methods can be devised to meet this condition. The synthesis methods at best can have the across-trial Type I error rate controlled at a small level only when the constancy assumption holds and can be justified. Theoretically, there is no assurance that the within-trial Type I error rate for any synthesis method can be properly controlled. More importantly, the synthesis method cannot generate a fixed margin, which is often necessary for assessment of clinical significance.

REFERENCES Blackwelder, W. C. (1982). Proving the null hypothesis in clinical trials. Contr. Clinic. Trials 3:345–353. Blackwelder, W. C. (2002). Showing a treatment is good because it is not bad: when does noninferiority imply effectiveness? Contr. Clinic. Trials 23:52–54. Bucher, H. C., Guyatt, G. H., Griffith, L. E., Walter, S. D. (1997). The results of direct and indirect treatment comparisons in meta-analysis of randomized controlled trials. J. Clinic. Epidemiol. 50:683–691. Chow, S. C., Shao, J. (2006). On noninferiority margin and statistical tests in active control trial. Statistics Med. 25:1101–1113. Committee for Medicinal Products for Human Use (CHMP). (2006). Guideline on the choice of the noninferiority margin. Statistics Med. 25:1628–1638. Committee for Proprietary Medicinal Products (CPMP). (2000). Points to consider on switching between superiority and noninferiority 2000. (Available at http://www.eudra.org/emea.html). D’Agostino, R. B., Massaro, J. M., Sullivan, L. (2003). Non-inferiority trials: design concepts and issues – the encounters of academic consultants in statistics. Statistics Med. 22: 169–186. Department of Health and Human Services, Food and Drug Administration [Docket No. 99D-3082]. (1999). International Conference on Harmonisation: Choice of Control Group in Clinical Trials (E10). Federal Register 64(185):51767–51780. Dunnett, C. W., Tamhane, A. C. (1997). Multiple testing to establish superiority/equivalence of a new treatment compared with kappa standard treatments. Statistics Med. 16: 2489–2506. Ebbutt, A. F., Frith, L. (1998). Practical issues in equivalence trials. Statistics Med. 17: 1691–1701. Ellenberg, S. S., Temple, R. (2000). Placebo-controlled trials and active-control trials in the evaluation of new treatments – Part 2: Practical issues and specific cases. Ann. Inter. Med. 133:464–470.

212

HUNG ET AL.

Fleming, T. R. (1987). Treatment evaluation in active control studies. Cancer Treat. Rep. 71:1061–1064. Fleming, T. R. (2000). Design and interpretation of equivalence trials. Am. Heart J. 139: S171–S176. Gould, A. L. (1991). Another view of active-controlled trials. Contr. Clinic. Trials 12:474–485. Hasselblad, V., Kong, D. F. (2001). Statistical methods for comparison to placebo in activecontrol trials. Drug Inf. J. 35:435–449. Hauschke, D. (2005). Choice of delta: a special case. Drug Inf. J. 35:875–879. Hauschke, D., Pigeot, I. (2005). Establishing efficacy of a new experimental treatment in the Gold Standard design (with discussions). Biomet. J. 47:782–798. Holmgren, E. B. (1999). Establishing equivalence by showing that a prespecified percentage of the effect of the active control over placebo is maintained. J. Biopharm. Stat. 9: 651–659. Hung, H. M. J., Wang, S. J. (2004). Multiple testing of noninferiority hypotheses in active controlled trials. J. Biopharm. Stat. 14:327–335. Hung, H. M. J., Wang, S. J., Tsong, Y., Lawrence, J., O’Neill, R. T. (2003). Some fundamental issues with noninferiority testing in active controlled clinical trials. Statistical Med. 22:213–225. Hung, H. M. J., Wang, S. J., O’Neill, R. A. (2005). Regulatory perspective on choice of margin and statistical inference issue in noninferiority trials. Biomet. J. 47:28–36. International Conference on Harmonization: guidance on choice of control group and related design and conduct issues in clinical trials (ICH E-10), Food and Drug Administration, DHHS, July 2000. Jones, B., Jarvis, P., Lewis, J. A., Ebbutt, A. F. (1996). Trials to assess equivalence: the importance of rigorous methods. British Med. J. 313:36–39. Laster, L. L., Johnson, M. F. (2003). Non-inferiority trials: the at least as good as criterion. Statistics Med. 22:187–200. Laster, L. L., Johnson, M. F., Kotler, M. L. (2006). Non-inferiority trials: the at least as good as criterion with dichotomous data. Statistics Med. 25:1115–1130. Lawrence, J. (2005). Some remarks about the analysis of active control studies. Biomet. J. 47:616–622. Morikawa, T., Yoshida, M. A (1995). Useful testing strategy in phase III trials: combined test of superiority and test of equivalence. J. Biopharm. Stat. 5:297–306. Ng, T. H. (2001). Choice of delta in equivalence testing. Drug Inf. J. 35:1517–1527. Pledger, G., Hall, D. B. (1990). Active control equivalence studies: do they address the efficacy issue? Statistical Issues in Drug Research and Development. New York: Marcel Dekker, pp. 226–238. Rohmel, J. (1998). Therapeutic equivalence investigations: statistical considerations. Statistics Med. 17:1703–1714. Rothmann, M., Li, N., Chen, G., Chi, G. Y. H., Temple, R. T., Tsou, H. H. (2003). Noninferiority methods for mortality trials. Statistics Med. 22:239–264. Simon, R. (1999). Bayesian design and analysis of active control clinical trials. Biometrics 55:484–487. Snapinn, S. M. (2004). Alternatives for discounting in the analysis of noninferiority trials. J. Biopharm. Stat. 14:263–273. Temple, R. (1987). Difficulties in evaluating positive control trials. Proceedings of the Biopharmaceutical Section of American Statistical Association. 1–7. Temple, R. (1996). Problems in interpreting active control equivalence trails. Account. Res. 4:267–275. Temple, R., Ellenberg, S. S. (2000). Placebo-controlled trials and active-control trials in the evaluation of new treatments – Part 1: Ethical and scientific issues. Ann. Internal Med. 133:455–463.

TESTING METHODS IN NONINFERIORITY TRIAL

213

Tsong, Y., Wang, S. J., Hung, H. M. J., Cui, L. (2003). Statistical issues on objective, design and analysis of noninferiority active controlled clinical trial. J. Biopharm. Stat. 13:29–42. Wang, S. J., Hung, H. M. J. (2003). Assessment of treatment efficacy in noninferiority trials. Contr. Clinic. Trials 24:147–155. Wang, S. J., Hung, H. M. J., Tsong, Y. (2002). Utility and pitfall of some statistical methods in active controlled clinical trials. Contr. Clinic. Trials 23:15–28. Wang, S. J., Hung, H. M. J., Tsong, Y. (2003). Noninferiority analysis in active controlled clinical trials. Encyclopedia of Biopharmaceutical Statistics. 2nd ed. New York: Marcel Dekker, pp. 674–677. Wang, Y. C., Chen, G., Chi, G. (2006). A ratio test in active control noninferiority trials with a time-to-event endpoint. J. Biopharm. Stat. 16:151–164. Wang, S. J., Hung, H. M. J., Tsong, Y., Cui, L., Nuri, W. (1997). Changing the study Objective in clinical trials. In: Proceedings of the Biopharmaceutical Section of American Statistical Association, pp. 64–69. Wiens, B. (2002). Choosing an equivalence limit for noninferiority or equivalence studies. Contr. Clinic. Trials 23:2–14. World Medical Association Declaration of Helsinki (1997). Recommendations guiding physicians in biomedical research involving human subject. J. Am. Med. Assoc. 277:925–926.