Testing for Treatment Effect in Clinical Trials with ...

1 downloads 0 Views 94KB Size Report
In the clinical trials setting, efficacy or safety ... endpoints evaluated at a specific time point, O'Brien ... adverse events data from a clinical trial for a topical product ...
Joint Statistical Meetings - Biometrics Section-to include ENAR & WNAR

TESTING FOR TREATMENT EFFECT IN CLINICAL TRIALS WITH MULTIDIMENSIONAL CATEGORICAL DATA WITH REPEATED MEASUREMENTS M. Alosh and K. Fritsch Division of Biometrics III, OB, OPaSS, CDER, FDA, Rockville, MD 20850* possibility that the endpoints may be evaluated on different numerical scales. One frequently used approach is to analyze treatment effect for each response variable separately. However, this might not be the case for complex disease trials. In addition, it should be noted that while this approach is simple and easy to implement, it is still not optimal, as proper analysis of the remaining endpoints should borrow strength by taking into account their cross and serial correlations.

KEY WORDS: Longitudinal Data, Global Testing Procedure, Multidimensional Responses, Serial and Cross Correlation Abstract In the clinical trials setting, efficacy or safety assessments are frequently carried out on multidimensional endpoints, each of which is measured repeatedly over the course of the trial. For multiple endpoints evaluated at a specific time point, O’Brien (1984) proposed test statistics based on simple and weighted sums of the standardized single endpoints. In this paper we propose an extension of the O’Brien method for testing multiple endpoints to take into account the longitudinal aspect of the data collected over the course of the trial. We investigate the validity of the proposed extension via a simulation study by comparing the power and the Type I error rates for different sample sizes, numbers of endpoints, and different serial and cross correlation structures. Finally, we consider application of the proposed approach to adverse events data from a clinical trial for a topical product.

As alternatives to commonly used methods such as the T 2 test and the Bonferroni method, O’Brien (1984) proposed a non-parametric rank sum test and two parametric tests (referred to as ordinary and generalized least squares (OLS and GLS, respectively)). The parametric tests are designed to be optimal against the alternative hypothesis that the standardized treatment differences are of equal magnitude and in the same direction among all endpoints. Pocock, Geller, and Tsiatis (1987) illustrated the applicability of O’Brien’s GLS statistics to any set of asymptotically normal statistics for continuous, binary or survival data. Other approaches to this problem have been considered by Sammel and Ryan (1996), Gray and Brookmeyer (1998), Lehmacher, Wassmer and Reitmeir (1991), and Follmann (1995). Zhang, Quan, and Ng (1997) proposed a global assessment procedure based on categorization of the individual endpoint to form an overall composite endpoint. Sankoh, Huque, Russell and D’Agostino (1999) provide an extensive review on the global testing procedures and their applications to clinical trials.

I. Introduction Frequently in the clinical trials setting, multiple response variables are evaluated for treatment efficacy or safety. In addition, the response variables might be measured on different numerical scales and measurements may be taken repeatedly over the course of the trial. The need for multidimensional response assessment arises when the disease conditions are difficult to characterize by one response variable, consequently multiple responses are needed for measuring different aspects of the condition. In some disease areas, such as for anti-viral or photodamage trials, it may be difficult to select a single endpoint to completely characterize the unobservable true conditions. Multidimensional responses which could be thought of as alternative measures (surrogates) for the same latent condition are of fundamental interest.

The objective of this paper is to extend O’Brien’s global testing procedure to take into account the longitudinal aspect of the data. In the following section we present our extension of O’Brien’s method to longitudinal data. In Section III, we investigate the validity of the proposed approach by comparing the power and Type I error rates of the various approaches via a simulation study. In the process, we introduce an algorithm for generating longitudinal, multidimensional, ordered categorical data. In Section 4 we consider application of the proposed methodology to

Estimation of the treatment effect from multidimensional longitudinal data raises several statistical issues, as the responses are expected to be correlated with each other in addition to their serial correlations over time. This is in addition to the *

The views expressed herein are those of the authors and not necessarily of the FDA.

45

Joint Statistical Meetings - Biometrics Section-to include ENAR & WNAR

several expected adverse events from a clinical trial measured at fixed time points.

(b) The Rank Sum Test For the Rank Sum test one replaces the data Yijkt on

II.

An extension of O’Brien’s global testing approach to multidimensional longitudinal responses Suppose there are two treatment groups and K endpoints of interest. For each experimental unit in the study, consider a set of measurements recorded at time t. Specifically, let Yijkt represent the measurement at time t (t =1, 2, …, T) for the kth (k = 1, 2, …., K ) endpoint for individual j (j =1, 2, …, ni) on treatment i (i = 1, 2). The vectors Yijt = (Yij1t , Yij 2t ,...., YijKt ) are

the k endpoints by their ranks, rijkt . In other words, one first ranks the N = ( n1 + n2 ) subjects separately for each variable k, then computes the rank sum over the K endpoints for the jth subject in the ith treatment group, S ijt( R ) = rij1t + ..... + rijKt and then applies the appropriate univariate test statistics. The quantities Sijt(⋅) reduce the dimension of the data so

µit = independently distributed with mean ( µ i1t , µ i 2t ,..., µ iKt ) and covariance matrix Σ t , with σ kk 'tt ' = cov (Yijkt ,Yi ' j 'k 't ' ) for i = i' , j = j' . O’Brien

that there is now one observation for each timepoint on each subject. Repeated measurement approaches, such as MANOVA (multivariate analysis of variance) can now be used to address pertinent clinical questions related to time trend, interaction and similarity of efficacy or safety behavior for the two treatment groups over time (see Section IV).

proposed three methods for reducing the k-dimensional data on each subject to a single value, Sijt(⋅) . (a) The Ordinary Least Squares (OLS) and the Generalized Least Squares (GLS)

Whereas as in O’Brien’s method the approach can be used for any type of data, we restrict our simulation to ordered categorical data, as such data are encountered frequently in the clinical trials we have come across. One situation where ordered categorical data may arise is from an investigator’s global evaluation of some efficacy or safety endpoint. Global evaluations are often classified into categories such as clear, mild, moderate, or severe. Ordered categorical data may also arise from the categorization of an underlying continuous response variable.

For these methods it is necessary to first express the data in common units, if they are on different scales. * represent the standardized variables, that is: Let Yijkt * Yijkt = (Yijkt − Y..kt ) / S..kt

where Y ..kt = (n1Y1.kt + n 2Y1.kt ) /( n1 + n 2 ) and S..2kt = [( n1 − 1) S12.kt + ( n 2 − 1) S 22.kt ] /( n1 + n 2 − 2) . (i) For the OLS method calculate: S ijt(O ) = J ′ ⋅ Yijt*

III.

Validation of the methodology (simulation study) To address the validity of the proposed approach for handling multidimensional categorical responses with repeated measurements, we consider a simulation experiment for comparison of the size of the Type I error rate and the power of the test for the OLS, GLS and Rank Sum for various serial and cross correlation structures. As discussed previously, we limit our simulation study to ordered categorical data as this is the most common data we encounter in data from clinical trials.

(ii) For the GLS method calculate: ˆ −1 Y * S (G ) = J ' ∑ ijt

ijt

ˆ is the covariance matrix where J ' = (1,....,1) , and ∑ with elements * σˆ kk 't = (Yijkt − Yi.*kt )(Yijk* 't − Yi .*k 't ) / (ni − 1)

∑ ij



i

Basically, O’Brien’s parametric statistics are the weighted sums of the K standardized responses using either equal or unequal weights. The distinction between the two methods is that the GLS gives greater weights to endpoints that are not highly correlated with other endpoints while the OLS gives equal weights to all endpoints.

In generating data for our simulation experiment we attempt to mimic the actual clinical data with which we have been involved. Subjects are typically graded on cpoint scales, with lower categories representing more favorable categories. The simulation assumes that each patient has an unobservable underlying latent disease condition with which the multiple observed responses are correlated. A subject’s responses on an endpoint over time are also assumed to be correlated.

46

Joint Statistical Meetings - Biometrics Section-to include ENAR & WNAR

The following model is used to generate random longitudinal, multidimensional, ordered categorical data. Let X ijt be the unobserved (latent) disease condition for subject j (j=1, 2, …., ni) on treatment i (i = 1, 2) at time t ( t = 1, 2, .., T). Let Yijkt represent the

repeated measures analysis under our extension of O’Brien’s method. Table 1 compares the Type I error rates for the analysis using repeated measurements (T = 4) for n = 20, 50; k = 2, 5, γ =0.2, 0.5 and for β k = 0.2, 0.8 utilizing both equal and unequal correlations across endpoints. Under the null hypothesis, the active and control treatments follow the same distribution. The error rates are based on 1000 simulations and nominal α = 0.05.

kth (k = 1, 2, …., K ) surrogate endpoint for X ijt . Let

β k (k = 1, 2, …., K) be parameters defining the correlation between the observed responses, Yijkt , and the latent categories, X ijt . Let γ t (t= 1,… ,T-1) be parameters defining the correlation between adjacent observations in time for the same endpoint (i.e. Yijkt

The results of Table 1 suggest the following. The GLS error rates appear to be consistently larger than their OLS and rank-sum counterparts. The error rates for the rank-sum and OLS methods are generally similar, and are closer to the nominal α than the GLS rates.

and Yijk ( t +1) ). *

Let pmit and pmit represent probabilities associated with falling into class m on treatment i at time t for the latent and observable variables, respectively. To simulate random data, generate the latent disease conditions with: (for t=1)

Table 2 compares the power of the tests for the three procedures. The alternative hypothesis represents a shift of 20% of the probability in the active arm to the next lower category at each time point for each endpoint. For 4 time points and baseline (t=1) probabilities (0.01, 0.04, 0.25, 0.45, 0.25) the probabilities at time t = 4 would be (0.059, 0.162, 0.325, 0.326, 0.128).

X ij1 = m w.p. pmi1 for m = 1, …,C; i = 1, 2 (for t = 2, …, T)  X ij ( t −1) w.p. γt X ijt =  w.p. (1 − γ t )  Z ijt

The results of Table 2 suggest the following. The power for the OLS and GLS are similar and each has greater power than the ranks. The power tends to decrease as either the serial or cross correlations increase.

where Z ijt = m w.p. pmit for m = 1,…,C; i=1, 2 From the latent disease conditions generate the observable responses with:  X ijt w.p. βt Yijkt =  W w.p. ( 1 − βt )  ijkt

IV. Application We consider a clinical trial of a topical drug product comparing the active drug against its vehicle with regards to 3 adverse events (AEs). The 3 adverse events under investigation are burning, erythema, and peeling. Each event was rated on a 4-point scale from 0 to 3, as none, mild, moderate, or severe. Assessment of these AEs at baseline and Weeks 2, 5, 8 and 11 was specified in the protocol.

* where Wijkt = m w.p. pmit for m = 1, …,C; i = 1, 2

Then it can be verified that the correlation between any two endpoints/time points is:

ρ (Yijkt , Yij ( k +r )( t +s ) ) = (

s −1

∏γ

t + m ).β k .β ( k + r )

m =0

Figure 1 presents plots of the mean score of each of the AEs over the course of the trial as well as the plot of the corresponding OLS score (total) for the global test statistics.

Once the set of Yijkt values has been generated, an overall composite measure based on O’Brien’s approach for each time point can be computed.

Figure 1 shows that the average score for some of the AEs, such as peeling, increase up to a certain time point (Week 5) for the active drug arm, and then decline thereafter. A repeated measures approach might capture such a pattern better than analyzing each time point individually.

For the simulation experiment, data was simulated using various values of β k and γ t to compare different correlation structures. The three test statistics proposed by O’Brien (OLS, GLS, and Rank Sum) were computed for each data set. Error rates and power were compared using all post-baseline data with a

47

Joint Statistical Meetings - Biometrics Section-to include ENAR & WNAR

Table 1: Error Rates under Null Hypothesis (Repeated Measures) n K Cross Correlations Serial OLS GLS (β1, …, βK) Corr. (γ) 20 2 (0.2, 0.2) 0.2 0.043 0.058 0.5 0.049 0.058 (0.8, 0.8) 0.2 0.047 0.056 0.5 0.055 0.069 (0.2, 0.8) 0.2 0.056 0.064 0.5 0.046 0.062 5 (0.2, 0.2, 0.2, 0.2, 0.2) 0.2 0.054 0.102 0.5 0.059 0.089 (0.8, 0.8, 0.8, 0.8, 0.8) 0.2 0.048 0.091 0.5 0.050 0.084 (0.2, 0.2, 0.8, 0.8, 0.8) 0.2 0.053 0.086 0.5 0.050 0.083 50 2 (0.2, 0.2) 0.2 0.045 0.050 0.5 0.038 0.043 (0.8, 0.8) 0.2 0.042 0.043 0.5 0.045 0.046 (0.2, 0.8) 0.2 0.057 0.067 0.5 0.047 0.052 5 (0.2, 0.2, 0.2, 0.2, 0.2) 0.2 0.036 0.055 0.5 0.048 0.058 (0.8, 0.8, 0.8, 0.8, 0.8) 0.2 0.046 0.067 0.5 0.045 0.056 (0.2, 0.2, 0.8, 0.8, 0.8) 0.2 0.051 0.061 0.5 0.046 0.069

RankSum 0.044 0.046 0.040 0.047 0.058 0.047 0.056 0.045 0.046 0.048 0.051 0.052 0.043 0.045 0.045 0.052 0.054 0.043 0.036 0.048 0.045 0.047 0.043 0.044

Table 2: Power under Alternative Hypothesis (Repeated Measures) K Cross Correlations Serial OLS GLS (β1, …, βK) Corr. (γ) 20 2 (0.2, 0.2) 0.2 0.844 0.825 0.5 0.815 0.797 (0.8, 0.8) 0.2 0.523 0.511 0.5 0.291 0.308 (0.2, 0.8) 0.2 0.728 0.722 0.5 0.561 0.560 5 (0.2, 0.2, 0.2, 0.2, 0.2) 0.2 0.992 0.984 0.5 0.982 0.975 (0.8, 0.8, 0.8, 0.8, 0.8) 0.2 0.568 0.585 0.5 0.325 0.407 (0.2, 0.2, 0.8, 0.8, 0.8) 0.2 0.799 0.877 0.5 0.600 0.755 50 2 (0.2, 0.2) 0.2 0.999 0.999 0.5 0.994 0.993 (0.8, 0.8) 0.2 0.909 0.895 0.5 0.609 0.558 (0.2, 0.8) 0.2 0.985 0.983 0.5 0.915 0.898 5 (0.2, 0.2, 0.2, 0.2, 0.2) 0.2 1.000 1.000 0.5 1.000 1.000 (0.8, 0.8, 0.8, 0.8, 0.8) 0.2 0.921 0.918 0.5 0.656 0.659 (0.2, 0.2, 0.8, 0.8, 0.8) 0.2 0.999 0.999 0.5 0.949 0.989

RankSum 0.802 0.741 0.465 0.257 0.667 0.499 0.980 0.965 0.507 0.280 0.744 0.515 0.997 0.990 0.862 0.515 0.979 0.874 1.000 1.000 0.890 0.584 0.996 0.910

n

48

Joint Statistical Meetings - Biometrics Section-to include ENAR & WNAR

Erythema

Burning

0.5 0.4 0.3 0.2

b

0.4

v Mean

Mean

0.5

b

0.1

v

0.3 0.2 0.1

0

0 0

2

5

8

11

0

2

Week

Peeling

8

11

Total

0.5

0.5

b

0.4 0.3 0.2

b

0.4

v Mean

Mean

5 Week

0.1

v

0.3 0.2 0.1

0

0 0

2

5

8

11

0

Week

2

5

8

11

Week

Figure 1. Plots of the mean scores for the individual AEs over time and their total score (OLS) over the course of the trial. b represents the active treatment and v represents vehicle. Table 3 presents the p-values for each of the individual AEs and the total score (OLS) at each time point using Fisher’s exact test. Table 4 presents the p-values for the repeated measurement approaches for the three statistics considered here using analysis of variance.

The results of Tables 3 show that, if an adjustment for testing multiple time points is taken into account, neither the individual AEs nor the OLS statistic are statistically significant at individual time points. However, the repeated measurement approach in Table 4 shows significant differences regardless of the method applied.

Table 3 – P-values for the test of treatment effect at each timepoint for the individual AEs and total score Week 2 Week 5 Week 8 Week 11 Erythema 0.559 0.196 0.373 0.042 Peeling 0.130 0.057 0.159 0.662 * * * Burning 1.0 Total 0.267 0.038 0.216 0.195 (OLS) * p-value can not be calculated due to 0 frequency

V. Conclusion By extending the global testing procedure methodology to take into account repeated measurements one can make full use of the data and capture aspects that might not be observed when examining data at an individual time point. The results of our simulation experiment for the repeated measures approach show that the OLS and the ranks give a better control of type I error rate compared to the GLS. In terms of power, the OLS and GLS are comparable and their power exceeds that of the ranks.

Table 4 – P-values for the test of treatment effect using repeated measures analysis and O’Brien statistics (Weeks 2, 5, 8, 11) OLS GLS* Ranks p-value 0.020 0.022 0.016 * GLS analysis uses erythema and peeling only as the covariance matrix would be singular if burning is included.

References Follmann, D. (1995) Multivariate tests for multiple endpoints in clinical trials. Statistics in Medicine, 14, 1163-1175.

49

Joint Statistical Meetings - Biometrics Section-to include ENAR & WNAR

Gray, S. and Brookmeyer, R. (1998). Estimating a treatment effect from multdimensional longitudinal data. Biometrics, 54, 976-988. Lehmacher, W., Wassmer, G. and Reitmeir, P. (1991). Procedures for two-sample comparisons with multiple endpoints controlling the experimentwise error rate. Biometrics, 47, 511521. O’Brien, P. (1984). Procedures for comparing samples with multiple endpoints. Biometrics, 40, 10791087.

Pocock, S, Geller, N. and Tsiatis, A. (1987). The analysis of multiple endpoints in clinical trials. Biometrics, 43, 487-498. Sammel, M. and Ryan, L (1996). Latent variable models with fixed effects. Biometrics, 52, 650663. Sankoh, A. Huque,M., Russell H, K and D’Agostino, R.(1999). Global two-group multiple endpoint adjustment methods applied to clinical trials. Drug Information Journal, 33, 119-140.

50