Replication-Extension Studies - SAGE Journals

31 downloads 0 Views 584KB Size Report
Replication-extension studies combine and compare results from one or more prior studies with results from a new study. The new study is specifically designed ...
Current Directions in Psychological Science 21(6) 409­–412 © The Author(s) 2012 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/0963721412459512 http://cdps.sagepub.com

Replication-Extension Studies Douglas G. Bonett Iowa State University

Abstract Replication-extension studies combine results from prior studies with results from a new study specifically designed to replicate and extend the results of the prior studies. Replication-extension studies have many advantages over the traditional single-study designs used in psychology: Formal assessments of replication can be obtained, effect sizes can be estimated with greater precision and generalizability, misleading findings from prior studies can be exposed, and moderator effects can be assessed. Keywords effect size, meta-analysis, moderator analysis, prior information, research integrity Replication-extension studies combine and compare results from one or more prior studies with results from a new study. The new study is specifically designed to replicate and extend the results of the prior studies. One advantage of replicationextension studies over traditional single-study designs is the potential to obtain a more precise estimate of some effect size measure. (Effect-size measures, such as correlations and mean differences, reflect the magnitude of the relation between an explanatory variable and an outcome variable). Estimating effect sizes with greater precision will become more important in the future as psychological researchers move away from reporting results dichotomously as “significant” or “nonsignificant” to reporting confidence intervals for effect sizes. If confidence intervals are reported, the main findings may need to be described in a more refrained and tentative tone, because the confidence intervals may be very wide. Wide confidence intervals indicate that the results are not as definitive as a dichotomous significance-testing conclusion would imply. Specifically, a wide confidence interval might indicate that the effect size is very small or, possibly, very large. Replicating a prior study and combining its results with those of one or more prior studies can provide a narrower, and hence more informative, confidence interval for the effect size of interest. Increasing the generalizability of statistical results is another advantage of replication-extension studies over traditional single-study designs. This advantage may not be obvious to researchers who are not fully aware of a fundamental limitation of statistical inference. Statistical inference is a delicate mechanism for generalizing information about a sample to a specific study population from which the sample was randomly selected (or assumed to be randomly selected).

In most psychological studies, the study population is a small population of undergraduate students enrolled in certain introductory courses at a particular university. Inferential statistical methods use information from the random sample to make specific types of statements about the effect size in the study population. These statements are usually in the form of a hypothesis-testing result (e.g., “The population mean reaction time is greater under treatment A than under treatment B”) or a confidence-interval result1 (e.g., “The population mean reaction time under treatment A is 10.3 to 40.5 ms greater than under treatment B”), and both types of statements are made with a specified degree of uncertainty (i.e., the level of confidence for the interval estimate or the probability of a type I error for the hypothesis test). Nonstatistical arguments are then required to generalize from the study population to a larger target population of greater interest. In studies of psychophysiological and basic cognitive processes, these nonstatistical arguments are often so simple and obvious that they are not stated in the research report. However, in studies of higher-order cognitive processes or social behavior, the results for a particular study population may not automatically generalize to the desired target population; in these cases, it is essential that the limitations of the statistical inferential results be clearly stated in the research report. Effect-size results that might vary across certain types of study populations should be replicated in other study populations to determine the

Corresponding Author: Douglas G. Bonett, Department of Psychology, University of California, Santa Cruz, CA 95064 E-mail: [email protected]

410 range of populations for which the claimed results can be extended. If a replication-extension study suggests that the effect sizes are similar across two or more study populations, the effect-size estimates can be combined to obtain a more precise estimate that will statistically generalize to the multiple study populations. Results that apply to a wide variety of study populations usually have greater scientific value than results that apply only to a single study population. Finding similar effect-size values across study populations can be a desirable outcome of a replication-extension study, but finding interesting differences in effect sizes across study populations can be equally important. An effect size that differs substantially across two or more study populations suggests the existence of one or more moderator variables (i.e., a variable that influences the magnitude of the effect size). An important use of replication-extension studies is to assess the degree to which the effect size depends on certain aspects of the research design (e.g., laboratory setting, participant instructions, experimental materials) or varies across study populations that differ in terms of specific demographic features. The discovery of moderator variables can suggest new and more general theories that can then be examined in future studies. Replication evidence is the gold standard by which scientific claims are evaluated, yet replication research is rare in psychology. Perhaps the lack of such research is a byproduct of psychology’s traditional emphasis on significance testing combined with the common practice of describing statistical results as if they automatically generalize to a large target population. If one study reports a “significant” effect with implications that the finding applies to a large target population, then it seems like there is nothing to be gained by replicating the study. However, when the results are presented in the form of a confidence interval, along with a more accurate description of the specific population to which the results apply, a replication study that extends the results to another study population and obtains a more precise estimate of the effect size should be viewed as an important scientific contribution. Another reason for the lack of replication research in psychology might be that original research is perceived as being more important than replication research. This issue can be addressed using replication-extension studies that not only provide replication evidence but also extend the results of prior studies in new and theoretically important directions. If replication-extension studies were to become standard practice, fraud and questionable research practices would be discouraged because researchers would know that their results could be included in a future replication-extension design. Although studies with falsified data are rare, questionable research practices are all too common (John, Loewenstein, & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011). Studies using falsified data or questionable research practices will be exposed if replication results show effect sizes substantially smaller than those reported in the original study.

Bonett

Statistical Methods for ReplicationExtension Studies Meta-analytical statistical methods that combine and compare results from two or more studies, such as those described in Borenstein, Hedges, Higgins, and Rothstein (2009), could be used to analyze summary results from replication-extension studies. However, the classical meta-analytical methods, known as fixed-effect and random-effects methods, rely on strong and unrealistic assumptions—for instance, that effect sizes are homogeneous across study populations (for fixedeffect methods), or that the effect sizes are a random sample from a normally distributed superpopulation (for randomeffects methods). Alternative statistical models have been developed for combining and comparing summary results from multiple studies (Bonett, 2008, 2009, 2010; Bonett & Price, 2012) that do not assume effect-size homogeneity or a random sample of effect sizes from a normally distributed superpopulation. These alternative statistical methods are computationally simple, and have been developed for several different types of parameters, such as Pearson correlations, Spearman correlations, and partial correlations (Bonett, 2008); linear contrasts of means; unstandardized and standardized mean differences for both independent-samples and paired-samples designs (Bonett, 2009); alpha reliabilities (Bonett, 2010); linear contrast of proportions; and measures of association, agreement, and diagnostic test performance for dichotomous responses (Bonett & Price, 2012). These alternative methods are ideally suited to the analysis of replication-extension studies, for which prior studies are deliberately selected on the basis of overall quality, characteristics that provide information regarding possible moderator effects, and appropriateness for combining with the results of a new study. One purpose of a replication-extension study is to assess the effect of factors that moderate the magnitude of the effect size. The classical meta-analytical approach to moderator assessment relies almost exclusively on significance-testing methods. These significance-testing methods are routinely misused; specifically, “nonsignificant” results are incorrectly interpreted as support of the null hypothesis of no moderator effect, and “significant” results are incorrectly interpreted as evidence of an important moderator effect. Confidence intervals have the advantage of providing useful information regarding both the form and the magnitude of the moderator effects, and it is now possible to use confidence-interval methods to assess a wide range of moderator effects. To assist researchers in using the alternative statistical methods for replication-extension studies, I have developed free opensource SAS programs (written in PROC IML and organized in a single Microsoft Word file), which are available upon request. These SAS programs implement all of the methods given in Bonett (2008, 2009, 2010) and Bonett and Price (2012). The SAS programs are easy to use—the user simply

411

Replication-Extension Studies copies appropriate code from the Word file, pastes the code into the SAS command editor, and then replaces the example data with the data to be analyzed.

Example Burger (2009) conducted a study that replicated and extended a version of Milgram’s famous 1963 study of obedience, in which participant “teachers” were led to believe they were giving near-fatal electrical shocks of 450 volts to participant “learners.” Milgram’s original study could not be replicated today because of current standards concerning the ethical treatment of human subjects. However, Burger noticed that for a later study, which used male participants, Milgram had reported the proportion of participant teachers who were willing to administer 150-volt shocks. Burger was able to obtain approval to conduct a 150-volt version of Milgram’s study at his university. Burger extended Milgram’s study by using both male and female participants, and further extended it to include a “modeled-refusal” condition in which participants were paired with a “teacher” confederate who refused to continue administering shocks past 90 volts. Table 1 presents a diagram of Burger’s replication-extension study, with each “P” representing a population parameter value. In the Milgram and Burger studies, the parameters are proportions of all participants in the study populations who would have been willing to administer shocks of 150 volts. Burger used a traditional test of significance to compare P1 with (P2 + P3)/2, although a comparison of P1 and P2 would have been a more appropriate measure of replication. After obtaining a nonsignificant result, Burger concluded that “average Americans react to this laboratory situation today much the same way they did 45 years ago” (p. 1). Reporting a confidence interval for P1 − P2 would be the preferred method of assessing a replication of Milgram’s finding. Burger reported a nonsignificant difference in population proportions between his one-teacher and two-teacher conditions, and concluded that there was no difference between these two conditions. This conclusion was unwarranted because failure to reject a null hypothesis cannot be used as evidence that the null hypothesis is true (though this is a common type of reporting error in psychological research). If Burger had computed a 95% confidence interval for (P2 + P3)/2 − (P4 + P5)/2 using one of the methods given in Bonett and Price (2012), he would have found that the interval included zero and was wide. Instead of concluding that there was “no difference” between the two conditions, Burger should have stated that the results were “inconclusive” because the difference in population proportions could have been small or, possibly, quite large. Burger also found a nonsignificant effect of gender and concluded that men and women did not differ in their rates of obedience. Again, failure to reject the null hypothesis of a zero gender effect does not imply that the null hypothesis is true. If Burger had computed a 95%

Table 1.  Population Parameters in Milgram (1963) and Burger (2009) One-teacher condition Study Milgram (1963) Burger (2009)

Two-teachers condition

Male

Female

Male

Female

P1 P2

— P3

— P4

— P5

Note: P = population parameter.

Table 2.  Population Parameters in Milgram (1963) and Hypothetical Replication Studies One-teacher condition Study Milgram (1963) Replication Study 1 Replication Study 2

Two-teachers condition

Male

Female

Male

P1 P2 P4

— P3 P5

— — P6

Female — — P7

Note: P = population parameter.

confidence interval for the effect of gender, (P2 + P4)/2 − (P3 + P5)/2, it would have included zero and it would have been wide. Burger needed a larger sample size to show that his findings were similar to those of Milgram and that the effects of gender and modeled refusal were small. In retrospect, it would have been better if Burger had conducted a simpler study designed to replicate Milgram’s 150volt finding with an extension to include only gender as an additional factor. With this simplified design, Burger’s sample of 70 participants would have provided more definitive replication information and a more precise estimate of the gender effect size. A future study could then attempt to replicate this gender effect, obtain additional evidence about the replicability of Milgram’s finding, and also extend the design to include a modeled-refusal condition. Table 2 presents a schematic for the suggested series of replication-extension studies. In Study 1, a confidence interval for P1 − P2 would provide replication evidence for Milgram’s finding. If the confidence interval suggested that the value of P1 − P2 was small, a confidence interval for (P1 + P2)/2 − P3 would provide an assessment of the effect of gender using data from both Study 1 and Milgram’s study. In Study 2, a confidence interval for (P4 + P2)/2 − P1 would provide additional evidence for the replicability of Milgram’s finding. A confidence interval for (P2 − P3) − (P4 − P5) would provide replication information about the effect of gender. Assuming the gender effect in Study 1 was replicated in Study 2, a confidence interval for (P2 + P3 + P4 + P5)/4 − (P6 + P7)/2 would provide information about the

412 direction and magnitude of the modeled-refusal effect based on all available data in Study 1 and Study 2. This line of research would not end with Study 2. For instance, a third study might replicate the effects of gender and modeled refusal using college students in another country to assess a possible moderating effect of cultural differences. Further, factors shown to have small effect sizes could be dropped from future studies. For instance, if the results of Study 3 suggested that the effect of modeled refusal was small, a fourth study might partially replicate Study 3 by dropping modeled-refusal condition and adding age as a factor by comparing college students with older adults.

Conclusion Psychological research methods have evolved in important ways over the past 100 years, and replication-extension studies are a logical next step. Replication-extension studies support a cumulative research process whereby results are subjected to repeated replication checks, modifications of theories are proposed, and those modifications are subsequently checked through replication and further modified. The use of replication-extension designs will improve the quality of published psychological research and will enhance psychology’s scientific status. Recommended Reading Mischel, W. (2008). The toothbrush problem. APS Observer, 21(11). Retrieved from http://www.psychologicalscience.org/index.php/ publications/observer/2008/december-08/the-toothbrush-problem .html. A discussion of the causes and ramifications of psychological research that focuses excessively on novel theories. Mischel, W. (2009). Becoming a cumulative science. APS Observer, 22(1). Retrieved from http://www.psychologicalscience.org/ index.php/publications/observer/2009/january-09/becominga-cumulative-science.html. A discussion of how current practice in psychology is undermining the building of a cumulative science. Medin, D. L. (2012). Rigor without rigor mortis: The APS Board discusses research integrity. APS Observer, 25(2). Retrieved from http://www.psychologicalscience.org/index.php/publications/ observer/2012/february-12/scientific-rigor.html. An article that summarizes ways to improve the quality of psychological research.

Bonett Neuliep, J. W. (1991). Replication research in the social sciences. Newbury Park, CA: Sage. A book that discusses the importance of replication research and gives several examples of published replication studies.

Declaration of Conflicting Interest The author declared that he had no conflicts of interest with respect to his authorship or the publication of this article.

Note 1.  To interpret a (1 − α)% confidence interval computed for a sample of size n, consider the fact that (1 − α)% of all possible samples of size n from a specific study population will capture the unknown parameter value (e.g., population correlation, population mean difference). If a researcher takes one random sample from the study population, the researcher can be (1 − α)% confident that the computed confidence interval contains the unknown parameter value.

References Bonett, D. G. (2008). Meta-analytic interval estimation for bivariate correlations. Psychological Methods, 13, 173–189. Bonett, D. G. (2009). Meta-analytic interval estimation for standardized and unstandardized mean differences. Psychological Methods, 14, 225–238. Bonett, D. G. (2010). Varying coefficient meta-analytic methods for alpha reliability. Psychological Methods, 15, 368–385. Bonett, D. G., & Price, R. M. (2012). Meta-analysis methods for 2 × 2 contingency tables (Preprint #2012–3). Retrieved from http://www.stat.iastate.edu/preprint/preprint.html Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). Introduction to meta-analysis. New York, NY: Wiley. Burger, J. M. (2009). Replicating Milgram: Would people still obey today? American Psychologist, 64, 1–11. John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23, 524–532. Milgram, S. (1963). Behavioral study of obedience. Journal of Abnormal and Social Psychology, 67, 371–378. Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). Falsepositive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366.