TESTS THAT ARE ROBUST AGAINST VARIANCE

0 downloads 0 Views 306KB Size Report
frequencies and population variances are both unequal. This article ... values to unit normal z scores using an inverse normal distribution function. (or simple table ... Case I1 also involves equal sample sizes and equal population variances.
Psychological Reports, 1995, 76, 1011-1017. O Psychological Reports 1995

TESTS THAT ARE ROBUST AGAINST VARIANCE HETEROGENEITY IN k x 2 DESIGNS WITH UNEQUAL CELL FREQUENCIES ' JOHN E. OVERALL AND ROBERT S. ATLAS University of Texas Medical School at Houston JANET M. GIBSON Gnilnell College, Grinnell, Iowa Summary.-Heterogeneity of variance produces serious bias in conventional analysis of variance tests of significance when cell frequencies are unequal. Welch in 1938 and 1947 proposed an adjusted t test for the difference between two means when cell frequencies and population variances are both unequal. This article describes two ways to use the Welch t to evaluate the signzicance of the main effect for two treatments across k levels of a concomitant factor in a two-way design. Monte Carlo results document the bias in conventional analysis of variance tests and the stable and appropriately conservative results from applications of the Welch t to evaluation of treatment effects in the two-way design.

The analysis of variance F tests are often characterized as robust against departures from assumptions; however, several authors have emphasized the disruptive effects of heterogeneity of variance in designs with unequal cell frequencies (Glass, Peckham, & Sanders, 1972). Whereas most previous work on this problem has concentrated on simple one-way designs, this article is concerned with the more complex problem of testing the main effect for two treatments across k levels of a concomitant variable. In the two-way design, the p values associated with conventional F tests can be excessively large or excessively small, depending on whether the larger population variances coincide with the smaller or larger sample sizes. It is not cldficult to appreciate why usual analysis of variance tests are biased in the face of correlation between sample sizes and population variances. The mean square error, based on the assumption of homogeneity of variance, is calculated by summing squared deviations about the respective cell means. Ln the case of unequal cell variances, if the larger dispersion occurs in the cells with larger numbers of observations, the pooled error w d be inflated by adding in a greater number of the larger deviations. In turn, the inflated error results in a conservatively biased test of significance for differences between treatment means. If the smaller variances are associated 'This work was supported in part b Grant 5R01 MH 32457 from DHHS/NIMH. Requests for reprints may be addressed to J o i n E. Overall, Department of Psychiat and Behavioral Sciences, The University of Texas Medical School, P O Box 20708, Houston, x x a s 77225.

1012

J. E. OVERALL, ET AL.

with the larger cell sample sizes, the pooled error d emphasize the smaller variances. That results in a nonconservative bias in tests of significance for treatment effects. Welch (1947) has proposed a modification of the Student t test that minimizes problems created by the simultaneous presence of unequal sample sizes and unequal variances. The statistic can be used to test the significance of any 1 df contrast in a multiple-group design. The purpose of this article is to emphasize the serious problem attendant to use of customary analysis of variance calculations in a k x 2 design with unequal cell frequencies and heterogeneity of variance and to illustrate two ways in which the Welch t can be used to obtain valid tests of significance for the main effect of two treatments across k levels of a second factor. METHOD Welch (1938) proposed an adjusted t statistic for testing the drfference between two means when the population variances differ. Adapted for use with sample variance estimates the test statistic can be defined as follows: t=

X,

-

X2

IS,^/^ 1 +s2*/n2

with

df=

S12/n,+ S22/n2

sz4

51' + n I 2 ( n l- I ) n2'(n2 - 1 )

[ll

Subsequently, Welch (1947) generaked "Student's problem" to test any 1 df contrast among several means when the several population variances differ.

We are indebted to the reviewer of a prehinary version of this manuscript for pointing out that the single summation notation used by Welch (1947) does not imply that the proposed solution should not generalize to testing the difference between marginal means in a two-way design. Expressed in terms of observed means and variance estimates, the following adaptation of Welch's nonseries approximation formula (p. 32) provides a test of the main effect for two treatments across k levels of a concomitant factor.

x, - x2

with

df =

,

Dl

VARIANCE HETEROGENEITY

1013

where X,,and X,2 are the unweighted means of observed cell means across levels in the two treatment conditions. A second way Welch's t can be used to obtain a superior test of the main effect for two treatments across levels of a second factor in an unbalanced design is to calculate the t statistic at each of the k levels of the design separately (Eq. 11, obtain the associated one-sided p values, transform the p values to unit normal z scores using an inverse normal distribution function (or simple table look-up), and then combine the z scores as follows:

where i is itself evaluated as a unit normal z score (Rosenthal, 1978). If a two-sided test of the main effect for treatments is required, the one-sided p values related to the directional null hypothesis that is most contradicted by the preponderance of the data should be used in the calculations, but the significance of 2 should then be evaluated against the d2 critical value. It should be noted that this is only one of several ways that separate Welch's t tests from the k levels of a two-way design can be combined (Rosenthal, 1978). A somewhat more powerful unbiased test combines the natural logarithms of one- sided p values to obtain a composite statistic that is distributed as chi squared with 2k degrees of freedom (Fisher, 1970). Again, if a two-sided test of the main effect for treatments is required, one-sided p values associated with the directional null hypothesis that is most strongly contradicted by the combined results should be used in the should then be evalucalculations, but the significance of the composite ated against the d2 critical value in the null distribution for chi squared with 2k degrees of freedom. Even though there are s t d other "metaanalysis" tests that could be adapted to combine results from Welch's t across k levels of a second factor, we concentrate on results obtained from evaluating i of Eq. 4 as a unit normal r score because that test is more nearly equivalent to the analysis of variance test of main effect for treatments across levels when cell variances are not unequal. Six designs in which unequal cell frequencies and unequal population variances combine to produce bias in the usual analysis of variance tests are considered. Case I involves individual levels of the k x 2 design in which both sample sizes and population variances are equal, but sample sizes and population variances are negatively correlated across the dkferent levels of the concomitant factor. That is, the levels with small but equal sample sizes have the larger corresponding equal population variances, and the levels with larger equal sample sizes have smaller correspondmg variances. Because this is a design in which sample sizes vary proportionately across levels, the

x2

1014

J. E. OVERALL, ET AL

unweighted means and least squares analysis of variance calculations produce identical results. Case I1 also involves equal sample sizes and equal population variances within levels, but the sample sizes and variances differ between levels in a positively correlated fashion. Levels with larger sample sizes also have larger population variances. Again, the proportionate variation in sample sizes across levels renders unweighted means and least squares analysis of variance equivalent. Case I11 represents a design in which fewer subjects are systematically allocated to one treatment than to the other across all levels of the concomitant variable. Such designs are not infrequent in drug vs placebo comparisons. Unequal population variances can result from individual differences in the response to one treatment which are not present for the other. In Case 111, smaller population variances correspond to the treatment with the larger sample sizes. Case IV has the same proportionate allocation of subjecrs across levels, but the larger population variances are associated with the larger sample sizes. In both cases, the proportionate allocation of subjects between treatments at the different levels results in a proportionate cell frequency design for which unweighted means and least squares analysis of variance produce equivalent results. Case V involves unequal and disproportionate sample sizes with negative correlation between population variances and the sample sizes. Case VI involves unequal and &sproportionate sample sizes with positive correlation between population variances and the unequal sample sizes. In the face of unequal and disproportionate sample sizes, the unweighted means and least squares analysis of variance calculations do not produce exactly the same results; however, the unweighted means calculations provide a good approximation to SAS Type I11 sums of squares, i.e., the default option, and to the solutions available in BMDP and SPSS MANOVA programs. We have chosen to use the unweighted means calculations for the Monte Carlo work reported because of the computational efficiency required for the tens of thousands of analyses. RESULTS To Illustrate the bias in conventional analysis of variance and to evaluate the protection against that bias afforded by the Welch (1947) t test, we consider a 4 x 2 randomized blocks or treatments x levels design. The greater the disparity between cell sample sizes and/or population variances, the greater the &stortion in conventional analysis of variance results. Thus, it is important to consider ddferences that are of reahstic magnitude. Accordingly, sample sizes which &ffer between levels or between cells within levels by a 3:l ratio were considered. Population variances that differ in a 4:l ratio

1015

VARIANCE HETEROGENEITY

(2:1 standard deviation ratio) were considered. In studies where levels result from post hoc stratification on a concomitant variable, cell sample size differences larger than 3:l are common. Standard deviations that dlffer 2:1 are not uncommon, although such observed differences clearly suggest true population differences. The unequal sample size (3:1) and unequal population variances (4:l) correlated positively or negatively as described in Cases I through VI above. TABLE 1 RELATIVE FFSQUENCIES OF TYPEI ERRORS a= Case I: Negative Relarion 5 5 Convenuonal ANOVA 15 15 Welch's t , individual levels 5 5 Welch's I, across levels 15 15 Case 11: Positive Relation 5 5 Conventional ANOVA 15 15 Welch's t , individual levels 5 5 Welch's 1 , across levels 15 15 Case 111: Negative Relarion 5 15 Conventional ANOVA 5 15 Welch's t , individual levels 5 15 Welch's t , across levels 5 15 Case IV: Positive Relation 5 15 Conventional ANOVA 5 15 Welch's r , individual levels 5 15 Welch's 1, across levels 5 15 Case V: Negative Relation 5 15 Conventional ANOVA 15 5 Welch's 1 , individual levels 5 15 Welch's I, across levels 15 5 Case VI: Positive Relation 5 15 Conventional ANOVA 15 5 Welch's t , individual levels 5 15 Welch's 1 , across levels 15 5

.10

.05

.02

.01

,23898

,16056

,09504

,06499

Samphg experiments for the six conditions of unequal sample sizes and unequal variances were accomplished using normally distributed data produced by the IMSL normal random deviate generator. Ten thousand hypothetical data sets with no true treatment effect present were analyzed to provide estimates of Type I error probabhties under each condition. Each

data set was subjected to conventional unweighted means analysis of variance, to single Welch's r calculated across k levels (Eq. 3 ) , and to Welch's t calculated separately.at each of the k levels (Eqs. 1 and 4). The actual relative frequencies of Type I errors are shown in Table 1 for nominal alpha levels 0.10,0.05,0.02,and 0.01.The pattern of sample sizes for each 4 x 2 design is shown on the left, and the direction of correlation between sample sizes and population variances (which dlffered in 4:l ratio) is indicated for each case. DISCUSSION The bias in conventional analysis of variance tests is quite pronounced even in the face of moderate differences in cell sample sizes and population variances. When the larger population variances are associated with larger sample sizes, the analysis of variance results are seriously conservative. When the smaller population variances are associated with the larger sample sizes, the analysis of variance results are seriously nonconservative. The bias appears so great that one would be advised not to use conventional analysis of variance to analyze data from unequal cell frequency designs if the observed cell variances even suggest heterogeneity. The aim of this article has not been simply to emphasize the considerable inadequacy of conventional analysis of variance tests of significance in cases where both sample sizes and population variances differ. Two acceptable alternatives for testing the significance of the marginal effect of two treatments across k levels of a concomitant factor have been presented. While this is a h i t e d case, the k x 2 design IS often used to control for the possible effects of nuisance variables w h ~ c hare of little intrinsic interest in their own right. The well-known Mantel-Haenszel (1959) procedure is used to control for the effects of nuisance variables in much the same way in the analysis of categorical data. Even if one is not concerned about the significance of effects from a concomitant factor, the consistency of treatment effects across levels of the k x 2 design may be a matter of concern. The conventional analysis of variance test of interaction effects suffers the same problems attendant tests of main effects. For k > 2 ,Welch's t does not provide directly a comprehensive test of interactions. Nevertheless, although it has not been a concern of this article, the variance among Welch t statistics across k levels can be evaluated for significance. If one is going to convert t to z for purpose of calculating the & test of marginal effect for treatments (Eq. 4), the sum of squared deviations of the z scores can be evaluated for significance as x2 with k - 1 degrees of freedom (Rosenthal & Rubin, 1979). Both procedures for using Welch's (1947) t statistic to test the marginal treatment effect across levels of a concomitant factor appear highly superior

VARIANCE HETEROGENEITY

1017

to a conventional two-way analysis of variance which does not take into consideration the unequal variances. The Welch test calculated directly across the k levels of the two-way design (Eq. 3 ) appears superior in its close approximations to the specified alpha levels. The 9 test which combines results from Welch t tests at the separate levels of the k x 2 design appears modestly sensitive to the positive or negative direction of correlation between the unequal small sample sizes and the ddferential variances within the separate levels. To discount the possibhty that these mild distortions arose from the method of combining results from Welch tests at the separate levels (Eq. 4), we did further analyses to confirm that Equation 1, when applied to the small unequal sample sizes in twd groups with variances ddfering in a 4:l ratio, was responsible for the bias. When the larger variance was associated with the smaller sample size, Equation 1 produced actual Type I error inflation at the individual levels of the k x 2 design approximating the magnitude shown in Table 1 for the composite 9 test. However, when the same degrees of imbalance were retained, but sample sizes were increased fivefold, the actual Type I error probabhties converged on the specified alpha levels. In summary, the Welch procedure provides an effective correction for a serious problem that arises when unequal variances combine with unequal sample sizes to render conventional analysis of variance inappropriate. Considering the documented magnitude of the problem and the fact that one seldom, if ever, has basis for confidence in the true homogeneity of (population) variances, routine use of a Welch-type correction might be recommended for the analysis of k x 2 designs with unequal cell frequencies. It is the correlation between unequal cell frequencies and unequal variances, not simply heterogeneity of variance, that exaggerates the problem. REFERENCES FCSHER. R. A. (1970) Statistical methods for research workers. (14th ed.) New York: H a h e r . GLASS,G. V., PECKHAM, I? D., &SANDERS,J. R. (1972) Consequences of failure to meet assumptions underlying the fixed effects analysis of variance and covariance. Review of Educational Research, 42, 237-287. MANTEL,N., & HAENSZEL. W. (1959) Statistical aspects of the analysis of data from retrospective studies. Jounzal of the National Cancer Instifufe, 22, 719-748. ROSENTHAL, R. (1978) Combining results from independent studies. Psychological Bulletin, 85, 185-193. ROSENTHAL, R., & RUBIN,D. B. (1979) Comparing significance levels of independent studies. Psychological Bullefin, 86, 1165-1168. WELCH,B. L. (1938) The si nificance of dfierence between two means when the population variances are unequal. iiometrika, 29, 350-362. WELCH,B. L. (1947) The generalization of 'Student's' problem when several different population variances are involved. Biometrika, 34, 28-35.

Accepted April 17, 1995.

This article has been cited by: 1. Donald W. Zimmerman. 2014. Consequences of choosing samples in hypothesis testing to ensure homogeneity of variance. British Journal of Mathematical and Statistical Psychology 67:10.1111/ bmsp.2014.67.issue-1, 1-29. [CrossRef] 2. Dinesh Sharma, B. M. Golam Kibria. 2013. On some test statistics for testing homogeneity of variances: a comparative study. Journal of Statistical Computation and Simulation 83, 1944-1963. [CrossRef] 3. Donald W. Zimmerman. 2013. Heterogeneity of variance and biased hypothesis tests. Journal of Applied Statistics 40, 169-193. [CrossRef] 4. Sharon Ryan, Rory V. O’Connor. 2009. Development of a team measure for tacit knowledge in software development teams. Journal of Systems and Software 82, 229-240. [CrossRef] 5. Donald W. Zimmerman. 2006. Two separate effects of variance heterogeneity on the validity and power of significance tests of location. Statistical Methodology 3, 351-374. [CrossRef] 6. Donald W. Zimmerman. 2004. Conditional Probabilities of Rejecting H 0 by Pooled and SeparateVariances t Tests Given Heterogeneity of Sample Variances. Communications in Statistics - Simulation and Computation 33, 69-81. [CrossRef] 7. Donald W. Zimmerman. 2000. Statistical Significance Levels of Nonparametric Tests Biased by Heterogeneous Variances of Treatment Groups. The Journal of General Psychology 127, 354-364. [CrossRef] 8. Donald W. Zimmerman. 1996. Some Properties of Preliminary Tests of Equality of Variances in the Two-Sample Location Problem. The Journal of General Psychology 123, 217-231. [CrossRef]