Prev Sci (2011) 12:103–117 DOI 10.1007/s11121-011-0217-6

Replication in Prevention Science Jeffrey C. Valentine & Anthony Biglan & Robert F. Boruch & Felipe González Castro & Linda M. Collins & Brian R. Flay & Sheppard Kellam & Eve K. Mościcki & Steven P. Schinke

Published online: 4 May 2011 # Society for Prevention Research 2011

Abstract Replication research is essential for the advancement of any scientific field. In this paper, we argue that prevention science will be better positioned to help improve public health if (a) more replications are conducted; (b) those replications are systematic, thoughtful, and conducted with full knowledge of the trials that have preceded them; and J. C. Valentine (*) University of Louisville, 309 College of Education, Louisville, KY 40205, USA e-mail: [email protected] A. Biglan Oregon Research Institute, Eugene, OR, USA R. F. Boruch University of Pennsylvania, Philadelphia, PA, USA F. G. Castro Arizona State University, Tempe, AZ, USA L. M. Collins Pennsylvania State University, University Park, PA, USA B. R. Flay Oregon State University, Corvallis, OR, USA S. Kellam American Institutes for Research, Washington, DC, USA E. K. Mościcki American Psychiatric Institute for Research and Education, Arlington, VA, USA S. P. Schinke Columbia University, New York, NY, USA

(c) state-of-the art techniques are used to summarize the body of evidence on the effects of the interventions. Under realworld demands it is often not feasible to wait for multiple replications to accumulate before making decisions about intervention adoption. To help individuals and agencies make better decisions about intervention utility, we outline strategies that can be used to help understand the likely direction, size, and range of intervention effects as suggested by the current knowledge base. We also suggest structural changes that could increase the amount and quality of replication research, such as the provision of incentives and a more vigorous pursuit of prospective research registers. Finally, we discuss methods for integrating replications into the roll-out of a program and suggest that strong partnerships with local decision makers are a key component of success in replication research. Our hope is that this paper can highlight the importance of replication and stimulate more discussion of the important elements of the replication process. We are confident that, armed with more and better replications and state-of-the-art review methods, prevention science will be in a better position to positively impact public health. Keywords Replication . Reproducibility . Systematic Review . Meta-Analysis . Effectiveness The Society for Prevention Research (SPR) created a task force on standards of evidence because it believed that prevention scientists should attempt to speak with one voice about the standards of research that are most appropriate for assisting with the selection of programs1 and policies. These standards are available on the Internet2 and a detailed 1 In this paper we use the terms program, intervention, and treatment interchangeably. 2 See http://www.preventionresearch.org/StandardsofEvidencebook. pdf.

104

discussion of them is provided in Flay et al. (2005). The standards of evidence established by SPR call for replications of studies that evaluate the efficacy and effectiveness of prevention programs. However, the Flay et al. article does not fully address the role of replication in prevention research or the ways in which replication should influence decisions regarding the suitability of programs and policies for dissemination. In particular, many investigators wanted guidance about how to answer what appeared to be a straightforward question: “Does Study B replicate Study A?” This paper examines these issues. Two considerations had a major impact on this document. First, replication is an ongoing process of assembling a body of empirical evidence that speaks to the authenticity, robustness, and size of an effect. This consideration led to reframing and expanding the question from “Does Study B replicate Study A?” to the broader “What does the available evidence say about the size of the effect attributable to Intervention A?” Often the replication question must be addressed based on only a single replication study, even if tentatively, until additional studies can be completed. This is so because for many preventive interventions, having even one replication represents a major step forward. However, there is an ever-increasing volume of scientific evidence available on prevention, and we hope the future will continue to bring an increase in the rate at which replication studies are carried out. Moreover, as in all areas of scientific inquiry, more information is always better. Thus, we suggest that it is most productive to take the long view and consider all available evidence when determining the extent to which a program has been shown to replicate. This raises the critical question of how to arrive at an evaluation of the information from all available studies. Methods are needed that can incorporate information from as few as two studies, where one is an original study and one is a replication, to a much larger set of studies. We review different approaches for accomplishing this. One approach we find particularly useful is to borrow some basic principles from meta-analysis. Most behavioral scientists think of meta-analysis primarily within the context of reviews of large numbers of studies, but more generally it is a well-established procedure for synthesizing information, even from as few as two studies—exactly what is needed when making decisions about replication. The second point we noted is that, when stated in adequately precise terms, questions concerning replication often vary considerably in what is actually of interest. For example, sometimes the question is whether a program that was shown to be efficacious in one population shows similar efficacy in a different population; sometimes the question is whether the program is robust to changes in how it is implemented; and sometimes the questions motivating a replication study are not specifically stated

Prev Sci (2011) 12:103–117

or involve an unsystematic combination of replications across several different dimensions. Because of this complexity, we did not feel prepared to issue standards of replication that would be comparable in specificity to the standards of evidence outlined in Flay et al. (2005). Instead, this article explores the reasons for conducting replication research, describes the types of research that can be considered replications, and considers factors that could influence the extent of replication research. We approached these issues from a public health perspective. The ultimate goal of prevention research is to affect the well-being of a population. Whether and to what extent replication research is needed and the relative merit of a particular type of replication should be evaluated in terms of the degree to which it contributes to affecting outcomes in entire populations. This should not be taken to imply, however, that prevention research is atheoretical or that replication is separable from the development of an empirically based understanding of human behavior. As we explain below, replication is fundamental to our confidence in empirical relations and contributes to our understanding of the factors relevant to effective and reliable prevention efforts. We hope that this article serves as the beginning of a broader discussion of the role of replication in prevention science.

The Context for Replication Research That research results ought to be reproducible is one of the foundations of all science. Reproducibility implies that (a) the specifics of study design and implementation are reported at a level of detail that allows other researchers to competently repeat the experiment, and (b) the results of studies of the same phenomenon ought to be equivalent. A person unfamiliar with this process may expect that if a study is conducted and then exactly repeated, the results of the second study should be identical to those of the first. According to this view, if the results of the studies are not identical, then it indicates a problem in one or perhaps both of the studies. However, this is not the view of scientists even in the so-called “hard” sciences. Hedges (1987), for example, noted that techniques for combining the results of multiple experiments have a long history in physics; these techniques were made necessary by the fact that wellimplemented experiments in physics can yield results that differ from one another. Perhaps the most pervasive reason for differences in study results is that it is very difficult to exactly repeat even a simple experiment (Schmidt 2009). Authors, for example, often cannot report every detail of an experimental procedure, in part because they are not fully aware of some of the details (i.e., some exist as implicit knowledge; Shadish et al. 2002, pp. 4-5). Therefore, these

Prev Sci (2011) 12:103–117

105

details will not be reported and might be enacted differently in a replication trial. Even very subtle differences in environmental conditions, sample characteristics, and procedures can lead to differences in results. Social researchers face similar problems but are working with less precise measurement methods and probably greater variability in sample characteristics, environmental conditions, and procedures. Further, perhaps due to the historical misunderstandings of the probability statements arising from tests of the null hypothesis (including the socalled replication fallacy; see Gigerenzer 1993), researchers have often operationally defined “equivalent results” as “both studies resulted in the rejection of the null hypothesis.” For a variety of reasons that we will address later in this paper, this is a problematic definition that often makes it more difficult to identify effective treatments. In addition to the general scientific context, we embed our work on replications in the current trend toward empirically based practice occurring in many fields, including medicine, public health, social welfare, and education. In general, interventions are more likely to be accurately labeled as effective if they have been thoroughly tested, especially if these investigations have occurred across diverse populations and settings. In addition, many interventions targeting similar problems share conceptually similar approaches even if the specifics of the treatment are somewhat different. Such replications are useful for assessing the utility of higher-order principles that can be applied across interventions. As an example, see Stice et al. (2008), whose meta-analysis suggested that dissonance-based approaches to the prevention of eating disorders were more effective than non-dissonance based interventions. Replications are therefore a vital part of efforts both to identify effective preventive interventions and to spur development of new interventions based on theoretical and empirical considerations. Our effort to refine the scientific principles relevant to replication occurs in the context of significant progress in the identification of influences that disrupt normative human development. This progress was followed by the design and evaluation of various interventions that proved efficacious in preventing these developmental problems (e.g., Kellam et al. 1999). These developments have spawned concerted efforts to demonstrate the efficacy and then the effectiveness of interventions and then to disseminate them widely. Projects of this sort include the What Works Clearinghouse (WWC) of the U.S. Department of Education’s Institute of Education Sciences (IES),3 the Blueprints Project (Elliott and Mihalic 2004), the Substance Abuse and Mental Health Services Administration’s National Registry of Evidence-Based Prevention Programs (NREPP).4 Each of these exploit informa-

tion from replications to make decisions regarding intervention effectiveness. In addition, the work of the Cochrane Collaboration5 (in medicine) and its sibling organization the Campbell Collaboration6 (in the social sciences) extend this idea by committing to the production of high-quality systematic reviews. We view the further understanding of replication research as a critical step in the evolution of prevention science and practice.

3

5

4

See http://ies.ed.gov/ncee/wwc/ See http://www.nrepp.samhsa.gov/

Types of Replication Research Intentional replications can be done for a variety of reasons. Researchers interested in testing the possibility that effects found in one study were due to chance (i.e., whether the presumed cause and the presumed effect due in fact covary) would conduct a statistical replication, in which all fixed effects of the original study are kept the same and a new random sample is drawn for all random effects (Hunter 2001). Note that this type of replication research is a close analogue to the idea of statistical conclusion validity (Shadish et al. 2002). Researchers interested in testing whether relations observed in one study would generalize to conditions not observed in that study would conduct a generalizability replication. This is conducted like a statistical replication except that the target of generalization is changed (e.g., participants are drawn from a new population), and is analogous to Shadish et al.’s conception of external validity. Researchers interested in the effects of variations on program implementation would conduct an implementation replication (Flay, 1986). This is a special case of generalizability replication and is conducted like a statistical replication except some implementation details are either different or are explicitly compared. Finally, researchers interested in testing causal mechanisms underlying an intervention would conduct a theory development replication. This kind of replication is again tested like a statistical replication, except with changes that are theoretically relevant. For example, researchers might collect data on a putative mediator of the intervention to further understand how the intervention works. Theory development replications are analogous to Shadish et al.’s conception of construct validity, but in some cases might be related to internal validity as well (e.g., if the goal of a replication test is to tease apart the effects of potentially confounding moderators). These kinds of replications are relatively rare, in part because even simple interventions can be hard to reproduce exactly. In addition, for a variety of reasons, replications are often undertaken with less than full knowledge of the trials

6

See http://www.cochrane.org See http://www.campbellcollaboration.org

106

that have preceded them. Because of these complexities, the most common replications are probably best termed ad hoc replications, in that they may not be conceptualized as replications per se and vary from one another in multiple known and unknown ways. For example, in Tobler et al.’s (2000) now classic review of school-based drug prevention programs, 47 studies were characterized as testing comprehensive life skills programs. While these studies shared commonalities at an abstract level (e.g., they all taught knowledge and refusal skills), they also differed at an abstract level (e.g., many programs focused on affect but some did not) and at a specific level (in terms of the actual content taught, program intensity, fidelity of implementation, and so on). When changes to study conditions are not systematic or covary with other changes, the results of a replication study are more difficult to interpret. However, this is not to suggest that ad hoc or unsystematic replications are of little value. Indeed, variation in the specifics of studies—such as the implementation, setting, and participant sampling—are most likely to be sources of consternation when studies are examined one at a time (as in comparing the results of one study to the results of another) and in narrative fashion. However, when multiple, somewhat different studies are treated as data points in a second round of scientific investigation, they can contribute to more confident, general, and properly contextualized guides to decision making if they are considered as part of a rigorous review of the evidence (as they were in Tobler et al.’s work). In other words, the methods used for drawing inferences about the effects of an intervention can make ad hoc replications either more or less difficult to interpret.

Important Considerations for Interpreting Replication Research A natural question concerns the conditions under which an intervention’s effects can legitimately claim to have been replicated. As an example, assume that a study of a preventive intervention has been carried out, and that this study has found positive and statistically significant effects that are large enough to be substantively meaningful. A replication is then carried out. What pattern of results in the replication study would be required for the researchers to conclude that the results of the original study have been replicated? Before attempting to address this question, we raise several points for consideration. Can One Study be Considered a Replicate of Another? While the focus of this paper is on methods for determining whether results from one study have been replicated in a second, an equally important and even more challenging

Prev Sci (2011) 12:103–117

issue that we cannot fully address concerns whether it even makes sense to pose this question. That is, there is no reason to ask whether studies have obtained similar results unless those studies are replicates, in some sense of the term. Determining this is not easy. As Hume (1739/1740) argued about identity in general, whether an evolving program will seem to be “the same” or “different” relative to its initial form will depend on one’s perspective. If we focus on the gradual nature of changes, we are more likely to see the program as “the same” as it was before. If, however, we focus on the beginning and end points we may be more likely to see the programs as “different.” This implies that there is often no single correct answer to the question of whether one program can be considered a replicate of another program, highlighting the difficulty of this judgment. As much as judgment will always play a role in answering this question, we do believe two strategies might be helpful. First, given sufficient understanding of the theoretical concerns relevant to a particular intervention, one strategy would be to use this as a basis for determining whether critical program elements have changed. If the critical program elements are largely unchanged then it makes more sense to label subsequent versions of the intervention as “the same” as the original. An alternative view is given by Glass (2000), who argued that the question of identity is actually empirical as opposed to logical. As long as two interventions seem to belong to the same general class of interventions, Glass would argue that programs are effectively “the same” if their effects on the dependent variable are “the same.” An even stronger version of this view is that one might consider programs to be “the same” if they affect mediating variables in a similar way. Suppose a preventive intervention is found to increase school commitment, and school commitment is found to increase graduation rates. If this same pattern of results were found in a subsequent study, then it might make more sense to label the second study a replicate of the first (despite any changes to that intervention over time). Of course, if one accepts the logic of this view it is still necessary to determine whether results are similar across studies. We turn now to strategies for addressing this point. Inferential Framework When thinking about the question of whether an intervention’s effects have been replicated, perhaps the most important point of emphasis is that the degree to which confident inferences can be made varies as a function of the amount of information that can be brought to bear on the question. Inferences about replication can be made with as few as two studies, but only within a very weak inferential

Prev Sci (2011) 12:103–117

framework. This means that any inferences will be highly tentative, and their value for decision making limited. It is for this reason that we echo Hunter’s (2001) call for a dramatic increase in the number of replications conducted and published. Furthermore, it suggests that policymakers, scholars, and administrators should avoid premature adoption or rejection of interventions that appear to be effective or ineffective within a weak inferential framework. By contrast, due to critical needs for specific interventions to address public health problems, we recognize that there will be times when a decision must be made even though the evidence is limited to a study or two. We believe in these cases that ongoing evaluation should be conducted within such dissemination efforts (Flay et al. 2005). Given that we believe that decisions based on some evidence are likely to be better than decisions based on no evidence, we present some options for approaching evidence when it is limited.

107

measured outcomes within studies are censored due to the lack of statistical significance, this can also lead to incorrect inferences (Williamson et al. 2005). The second assumption is that the quality of the study design and its execution are comparable in the studies being examined. Generally speaking, it is probably inappropriate to use results from one study with relatively good inferential properties and one with relatively weak inferential properties to support inferences about the effects of a particular intervention. For example, as articulated more fully in Flay et al. (2005), due to the strong plausibility of measured and unmeasured group differences on baseline variables in nonequivalent group designs, randomized experiments are preferred for studying the effects of interventions. The studies should be roughly similar in other ways as well (e.g., attrition rates). Reframing the Question About Replication

Statistical Framework In statistics, there is a well-articulated and long-standing tradition for thinking about replications (e.g., Cochran and Cox 1957). Given assumptions about the true population effect size and variance, probabilities of a successful replication can be precisely stated for statistical, generalizability, implementation, and theory development replications. Unfortunately, we have seen that these kinds of replications are relatively rare, and that the most common type of replication involves multiple planned and unplanned differences in study details. This means that understanding the results of replication studies in the real world is often not straightforward. Important Background Assumptions Regardless of the synthesis method ultimately used, two critical assumptions are required before further considering the question of whether results have been replicated in a subsequent study. First, we must assume that all studies that have been conducted on an intervention are available for analysis or, failing that, that study availability is unrelated to study outcomes. This assumption is closely tied to the well-known phenomenon of publication bias, which occurs because study authors are less likely to submit—and journal editors and reviewers are less likely to accept for publication—studies that lack statistically significant effects on their primary outcomes. The essential problem is that, all else being equal, less statistical significance means smaller intervention effects. Any method of drawing conclusions about interventions may be biased if studies are censored in this manner. Further, the assumption about study availability extends to the outcomes that were measured in these studies. For a variety of reasons, studies often present results only on measured outcomes that had statistically significant effects. If

We started this section with a fairly common scenario: An intervention is found to have positive effects in one study, and a second study is carried out in order to replicate the finding. The question that follows is whether the results of the second study replicated the results of the first. Unfortunately this formulation of the question ultimately can prove to be misleading because it focuses attention on the statistical conclusions arising from the two studies. As we noted in the introduction to this paper, an alternate and perhaps better formulation asks, “What does the available evidence say about the size of the effect attributable to Intervention A?” The advantage of this formulation is that it avoids thinking in dichotomous terms about replications (most notably, with respect to the statistical significance of the findings in each study), and instead focuses on the magnitude (i.e., effect size) and likely range (i.e., confidence interval) of intervention effects. This reframing of the replication question also serves to focus attention on research methods specifically designed to help synthesize the results of multiple studies. Systematic reviewing is now the state of the art in reviewing research. In highlighting the relation between replication research and clinical practice, the Institute of Medicine (2008) wrote, “Systematic reviews provide an essential bridge between the body of research evidence and the development of clinical guidance” (p. 83). A good systematic review involves a variety of techniques designed to enhance the transparency, objectivity, and ultimately the validity of the review process, characteristics notably absent from traditional literature review procedures (e.g., Cooper and Dorr 1995; Egger et al. 2001). After formulating a research question, a systematic review involves a thorough and systematic search for relevant studies (placing inferences at less risk of publication bias), a structured method for

108

extracting data from those studies, a thorough appraisal of the quality of the evidence ideally based on rules set before data collection began, and a synthesis, often statistical, of the evidence itself.

Statistical Options for Assessing the Results of a Small Number of Studies In the sections below, we consider several different statistical options for assessing the results of a small number of studies. In the discussion, we assume that deferring a decision about the effects of the intervention until more studies can be conducted is not a viable option. We emphasize situations in which only the original study and a single replication are available, because currently this is a scenario faced by many researchers and policy makers. We also discuss how the various options are expected to perform when additional studies are available. Options with good statistical properties will perform better (e.g., will have smaller standard errors, and be more likely to result in the correct conclusion) with more information, all else being equal. As a reminder, the validity of any of these inferential methods depends on the critical background assumptions presented above. Vote Counting Based on Statistical Significance Perhaps the most common approach to addressing the question of whether study results have been replicated is to examine the statistical significance of the studies’ primary results. For example, assume Study A found statistically significant results for an intervention. If Study B found statistically significant results on the same outcome that were in the same direction as the original study, Study B could reasonably be said to have replicated the results of Study A. This approach is known as “vote counting.” A variation of the vote counting approach, and one that is particularly arbitrary, involves claiming that an intervention is said to “work” if there are at least two good studies demonstrating that the intervention is more effective than some alternative. For example, Kirby (2001) reviewed studies of interventions targeting teen pregnancy rates, and labeled the interventions as having “strong evidence of success” if, among other characteristics, the interventions had been found to be effective in at least two studies conducted by independent research teams. Similarly, the Collaborative for Academic, Social, and Emotional Learning (2003) assigned the highest ratings of effectiveness to interventions that demonstrated positive results in at least two studies. Lists of “effective programs” selected using a vote counting criterion are quite common. Vote counting is based on a pervasive but incorrect belief about the interpretation of the probability values arising

Prev Sci (2011) 12:103–117

from tests of statistical significance. Gigerenzer (1993), using published examples, and Oakes (1986), through a survey, found that probability values are often taken as a direct statement of replicability. Construing a probability statement as a statement of replicability is incorrect. When conducting a test of statistical significance for the mean difference between two intervention conditions, the probability value is the chance of observing a mean difference at least as large as the one observed, given a true null hypothesis (Cohen 1994). Another way to think about the probability value is that it represents the confidence with which we can state that we have correctly identified the direction of the effect. The relationship between the probability values in one study and the likelihood of successful replication in even an exactly replicated second study is therefore not straightforward. If a study rejects the null hypothesis at p=.05, for example, that does not mean that the next statistical replication has a 95% chance of rejecting the null hypothesis (if the population and sample effect sizes are very similar the probability is actually closer to 50%; see Greenwald et al. 1996; Valentine et al. 2010). The limitations of vote counting based on statistical significance are therefore well-known: The majority of studies must have statistically significant results in order for the claim to be made that an intervention “works.” Unfortunately, in most circumstances when using vote counting it is unacceptably probable that studies will not reach the same statistical conclusion, even if they are estimating the same population parameter (e.g., if the intervention really is effective). For example, if two independent studies are conducted with statistical power of .80 (meaning that both have an 80% chance of correctly rejecting a false null hypothesis), in only 64% of cases will both studies result in a correct rejection of the null hypothesis. If both studies are conducted with statistical power of .50, then in only 25% of cases will both studies result in a correct rejection of the null hypothesis. As such, because studies are typically not highly powered, in most current real-world contexts requiring studies to reach the same statistical conclusion is an approach with an unacceptably high error rate (by failing to detect real intervention effects when they exist). In fact, Hedges and Olkin (1985) demonstrated the counterintuitive result that, in many situations common in social research (i.e., interventions with moderate effects investigated in studies with moderate statistical power), vote counting based on statistical significance can actually have less statistical power the more studies are available. In addition, the statistical conclusions reached in individual studies are to some extent dependent on the assumptions employed when conducting the statistical tests. For example, researchers analyzing one study may choose to conduct a two-tailed test, while those analyzing a second

Prev Sci (2011) 12:103–117

study may choose to conduct a one-tailed test. Even if all other aspects of the studies are exactly the same, this difference in analysis could lead the researchers to different statistical conclusions in their studies. For all of these reasons, comparing the statistical significance of the results of studies—while intuitively appealing and simple to implement—is a seriously limited inferential technique that is not well-suited to identifying effective interventions. Comparing the Directions of the Effects Another inferential technique that is simple to implement but limited is comparing the directions of the effects observed in both studies. In this case, for example, the researchers would determine that the results “replicated” if both studies produced positive effects. Note that direction is considered without reference to other information such as the size or statistical significance of the intervention’s effect. This would mean that an intervention with no effects on participants could be labeled “effective” if by chance both studies resulted in positive effects, and “harmful” if by chance both studies resulted in negative effects. Most readers will recognize that the essential problem with this approach is that it is a very blunt way to approach the question. With only two studies, this limitation is clear. The approach is, however, consistent with approaches taken by statisticians considering the problem of replicability (e.g., Killeen 2005). Furthermore, unlike vote counting based on statistical significance when power is not high in the individual studies, the statistical power of this approach does improve as the amount of information increases even if statistical power is low in the individual studies. This happens because a fair amount of information is contained in the proportion of results that come out in one direction or another. Bushman and Wang (2009) outlined procedures for conducting a vote count of the directions of the effect that takes into account the sample sizes of the individual studies. With a sufficient number of studies, this weighted vote count of directions can provide a reasonable approximation of the underlying population effect size. However, in most realworld situations, where the number of studies is small, its utility for answering questions about replication will be limited. Comparability of Effect Sizes and the Role of Confidence Intervals Yet another way of thinking about the question of whether study results have been replicated is to consider the effect sizes in the studies. If the effect sizes are judged to be comparable, then the study results could be said to have been replicated. One could, for example, conduct a statistical test of equivalence. This test requires the scholar

109

to be able to specify “how close is close enough” (that is, to determine the point at which effects are practically the same), and there is usually no non-arbitrary way of doing this. Readers are referred to the work of Seaman and Serlin (1998) for additional information. Confidence intervals express the likely range of a population effect and as such, provide information that tests of statistical significance do not. One way to make effect size comparability operational, without invoking an arbitrary rule for deciding how close is close enough, would be to determine whether the mean from an attempted replication fell within the confidence interval of the mean from an original study. Unlike the two vote counting methods, this approach uses much of the statistical information available in the two studies because it takes into account the estimate of the effect size and its uncertainty from one study along with the estimate of the effect size from a second study. Cumming and Maillardet (2006) have shown that this approach has comparatively good inferential properties: On average, 83% of means from a statistical replication should fall within the confidence interval from the original study (for ad hoc replications, the percentage of replication means that will fall within an original confidence interval cannot be specified without invoking additional assumptions). One therefore could argue that an intervention has been replicated if the effect size from the replication study fell within the first study’s confidence interval. A limitation of this inferential method is that it can be easily applied only when there are only two studies. The statistical technique known as homogeneity analysis can also be used to address the question of whether study effect sizes are comparable (Shadish and Haddock 1994). While the statistical power of this test can be low when there are few studies, it is closely related to the idea of determining that effect sizes are comparable if one falls within the confidence interval of the second, and can be used when there are more than two available studies. Combining Effects Using Techniques Borrowed from Fixed Effect Meta-Analysis All of the approaches we have considered so far have approached replication from the standpoint of asking whether the results of two studies “agree” in some sense. As we indicated earlier, another way to frame this question would be to ask, “What does the available evidence say about the size of the effect attributable to Intervention A?” This is the approach taken in meta-analysis when analyzing a mean weighted effect size. This technique has several advantages relative to the other inferential options discussed so far. Specifically, in meta-analysis, study effects from larger studies are assigned proportionally more weight than effects from smaller studies. Further, meta-analysis focuses attention

110

on both the weighted average effect size and its confidence interval. These issues are addressed less well or not at all by the other inferential techniques we have discussed. This suggests that applying techniques borrowed from fixed effect meta-analysis for combining effect size estimates can be useful in replication research. In a fixed effect meta-analysis, all studies are presumed to be estimating the same population parameter and yield sample statistics (i.e., study effect sizes) that differ from one another only because of random sampling error. Inferences arising from a fixed effect meta-analysis are conditional on the studies observed (Hedges and Vevea 1998), and are not influenced by, for example, studies that could have been done but are not in the analysis. These properties make a fixed effect meta-analysis approach attractive for combining effect size estimates produced by the four systematic replication types, namely scientific, generalizability, implementation, and theory development. Notably absent from this list are ad hoc replications. Given that ad hoc replications are the most common kind of replication, this is a serious limitation. In addition, the fixed effect meta-analysis approach will provide a relatively poor estimate of the mean effect size, indicated by a very large confidence interval, in most cases when there are few studies. Combining Effects Using Techniques Borrowed from Random Effects Meta-Analysis In the random effects meta-analytic approach, effects are presumed not to share the same underlying effect size. Instead, these effects are assumed to vary due to known and unknown characteristics of the studies. Stated more formally, studies yield sample statistics that vary from one another due to random sampling error and to other between-study differences. Compared to the fixed effect approach, inferences arising from a random effects meta-analysis approach are not conditioned as tightly on the observed studies (Hedges and Vevea 1998). One important implication of the choice between a fixed and a random effects meta-analysis approach is that the confidence intervals arising from a random effects approach will often be larger and will never be smaller than the confidence intervals that would have been generated by a fixed effect approach. Although statistical power can be higher in a random effects approach relative to a fixed effect approach, it is usually lower, and often non-trivially so. This is a major problem when there are few studies, since power is likely to be very low. A related problem concerns the estimation of the extent to which individual study level effects vary from one another. Essentially, if study effects differ by no more than would be expected given sampling error alone, a random effect approach reduces to a fixed effect approach. If study level effect sizes differ by more than would be

Prev Sci (2011) 12:103–117

expected given sampling error, then this information (the between-studies variance component) adds uncertainty to the estimation of the weighted average effect size. However, the estimation of whether effect sizes differ by more than would be expected given sampling error alone is poor when there are few studies. Despite these limitations, however, a random effects meta-analytic approach may be the most appropriate strategy for dealing with ad hoc replications. Using Multiple Inferential Strategies Because all of the alternatives to addressing the question of replication will be limited when few studies are available, one approach is to employ all of them when considering the state of the cumulative evidence generated from ad hoc replications. We provide several examples of how this might be done below. One caution: In many cases, it is not reasonable for all of the ways of thinking about the question of replication to agree with one another. However, valuable information can be gained from examining as a whole the nature and pattern of agreements. Table 1 presents simulated data illustrating how all of the ways of thinking about whether research results have been replicated might be used. For the purposes of this example, assume that an intervention effect of δ=.10 is the smallest effect size that could be considered to be important in this research area and for this outcome; observed effects smaller than this would be considered to be too small to be meaningful. The simulation data were randomly generated from a distribution that has a population effect size of δ=.20, which would indicate a small but real effect for the intervention. Sample sizes are from 10 to 44 per group. These cases were chosen as exemplars because they illustrate a range of inferential challenges. As can be seen in the discussion below, some of the resulting judgments are relatively easy to make, while others are much more difficult. Case 1 Two very small studies were conducted. Not surprisingly, neither study had statistically significant results, and the effects from the two studies were in opposite directions. The effect size from the first study did not fall within the 95% confidence interval of the second study. Further, neither a fixed effect nor a random effects meta-analytic approach revealed statistically significant effects. It is therefore difficult to argue that the intervention being studied is likely to have positive effects on participants given the available evidence. One caution, however, is that both the fixed and the random effects approaches resulted in very large confidence intervals around the weighted average effect size. As such, the most reasonable conclusion is that we know very little about the effects of this intervention, rather than that there is evidence that the

Prev Sci (2011) 12:103–117

111

Table 1 Example of strategies for assessing the results of multiple studies Case

Study 1 d±CI

Study 2 d±CI

Do the studies agree about the direction of the effect?

What is the pattern of statistical significance?

Is the effect size from the second study within the CI of the first study?

What are the results of a fixed effect analysis?

What are the results of a random effect analysis?

1

0.81±0.91

-0.80±0.91

No

No

d=.004,±.64, p=.99

d=.005,±1.58, p>.99

2

0.80±0.61

0.42±0.60

Yes

Yes

d=.61±.42, p=.006

d=.61±.42, p=.006

3

1.27±0.62

0.60±0.58

Yes

No

d=.91±.42, p

Replication in Prevention Science Jeffrey C. Valentine & Anthony Biglan & Robert F. Boruch & Felipe González Castro & Linda M. Collins & Brian R. Flay & Sheppard Kellam & Eve K. Mościcki & Steven P. Schinke

Published online: 4 May 2011 # Society for Prevention Research 2011

Abstract Replication research is essential for the advancement of any scientific field. In this paper, we argue that prevention science will be better positioned to help improve public health if (a) more replications are conducted; (b) those replications are systematic, thoughtful, and conducted with full knowledge of the trials that have preceded them; and J. C. Valentine (*) University of Louisville, 309 College of Education, Louisville, KY 40205, USA e-mail: [email protected] A. Biglan Oregon Research Institute, Eugene, OR, USA R. F. Boruch University of Pennsylvania, Philadelphia, PA, USA F. G. Castro Arizona State University, Tempe, AZ, USA L. M. Collins Pennsylvania State University, University Park, PA, USA B. R. Flay Oregon State University, Corvallis, OR, USA S. Kellam American Institutes for Research, Washington, DC, USA E. K. Mościcki American Psychiatric Institute for Research and Education, Arlington, VA, USA S. P. Schinke Columbia University, New York, NY, USA

(c) state-of-the art techniques are used to summarize the body of evidence on the effects of the interventions. Under realworld demands it is often not feasible to wait for multiple replications to accumulate before making decisions about intervention adoption. To help individuals and agencies make better decisions about intervention utility, we outline strategies that can be used to help understand the likely direction, size, and range of intervention effects as suggested by the current knowledge base. We also suggest structural changes that could increase the amount and quality of replication research, such as the provision of incentives and a more vigorous pursuit of prospective research registers. Finally, we discuss methods for integrating replications into the roll-out of a program and suggest that strong partnerships with local decision makers are a key component of success in replication research. Our hope is that this paper can highlight the importance of replication and stimulate more discussion of the important elements of the replication process. We are confident that, armed with more and better replications and state-of-the-art review methods, prevention science will be in a better position to positively impact public health. Keywords Replication . Reproducibility . Systematic Review . Meta-Analysis . Effectiveness The Society for Prevention Research (SPR) created a task force on standards of evidence because it believed that prevention scientists should attempt to speak with one voice about the standards of research that are most appropriate for assisting with the selection of programs1 and policies. These standards are available on the Internet2 and a detailed 1 In this paper we use the terms program, intervention, and treatment interchangeably. 2 See http://www.preventionresearch.org/StandardsofEvidencebook. pdf.

104

discussion of them is provided in Flay et al. (2005). The standards of evidence established by SPR call for replications of studies that evaluate the efficacy and effectiveness of prevention programs. However, the Flay et al. article does not fully address the role of replication in prevention research or the ways in which replication should influence decisions regarding the suitability of programs and policies for dissemination. In particular, many investigators wanted guidance about how to answer what appeared to be a straightforward question: “Does Study B replicate Study A?” This paper examines these issues. Two considerations had a major impact on this document. First, replication is an ongoing process of assembling a body of empirical evidence that speaks to the authenticity, robustness, and size of an effect. This consideration led to reframing and expanding the question from “Does Study B replicate Study A?” to the broader “What does the available evidence say about the size of the effect attributable to Intervention A?” Often the replication question must be addressed based on only a single replication study, even if tentatively, until additional studies can be completed. This is so because for many preventive interventions, having even one replication represents a major step forward. However, there is an ever-increasing volume of scientific evidence available on prevention, and we hope the future will continue to bring an increase in the rate at which replication studies are carried out. Moreover, as in all areas of scientific inquiry, more information is always better. Thus, we suggest that it is most productive to take the long view and consider all available evidence when determining the extent to which a program has been shown to replicate. This raises the critical question of how to arrive at an evaluation of the information from all available studies. Methods are needed that can incorporate information from as few as two studies, where one is an original study and one is a replication, to a much larger set of studies. We review different approaches for accomplishing this. One approach we find particularly useful is to borrow some basic principles from meta-analysis. Most behavioral scientists think of meta-analysis primarily within the context of reviews of large numbers of studies, but more generally it is a well-established procedure for synthesizing information, even from as few as two studies—exactly what is needed when making decisions about replication. The second point we noted is that, when stated in adequately precise terms, questions concerning replication often vary considerably in what is actually of interest. For example, sometimes the question is whether a program that was shown to be efficacious in one population shows similar efficacy in a different population; sometimes the question is whether the program is robust to changes in how it is implemented; and sometimes the questions motivating a replication study are not specifically stated

Prev Sci (2011) 12:103–117

or involve an unsystematic combination of replications across several different dimensions. Because of this complexity, we did not feel prepared to issue standards of replication that would be comparable in specificity to the standards of evidence outlined in Flay et al. (2005). Instead, this article explores the reasons for conducting replication research, describes the types of research that can be considered replications, and considers factors that could influence the extent of replication research. We approached these issues from a public health perspective. The ultimate goal of prevention research is to affect the well-being of a population. Whether and to what extent replication research is needed and the relative merit of a particular type of replication should be evaluated in terms of the degree to which it contributes to affecting outcomes in entire populations. This should not be taken to imply, however, that prevention research is atheoretical or that replication is separable from the development of an empirically based understanding of human behavior. As we explain below, replication is fundamental to our confidence in empirical relations and contributes to our understanding of the factors relevant to effective and reliable prevention efforts. We hope that this article serves as the beginning of a broader discussion of the role of replication in prevention science.

The Context for Replication Research That research results ought to be reproducible is one of the foundations of all science. Reproducibility implies that (a) the specifics of study design and implementation are reported at a level of detail that allows other researchers to competently repeat the experiment, and (b) the results of studies of the same phenomenon ought to be equivalent. A person unfamiliar with this process may expect that if a study is conducted and then exactly repeated, the results of the second study should be identical to those of the first. According to this view, if the results of the studies are not identical, then it indicates a problem in one or perhaps both of the studies. However, this is not the view of scientists even in the so-called “hard” sciences. Hedges (1987), for example, noted that techniques for combining the results of multiple experiments have a long history in physics; these techniques were made necessary by the fact that wellimplemented experiments in physics can yield results that differ from one another. Perhaps the most pervasive reason for differences in study results is that it is very difficult to exactly repeat even a simple experiment (Schmidt 2009). Authors, for example, often cannot report every detail of an experimental procedure, in part because they are not fully aware of some of the details (i.e., some exist as implicit knowledge; Shadish et al. 2002, pp. 4-5). Therefore, these

Prev Sci (2011) 12:103–117

105

details will not be reported and might be enacted differently in a replication trial. Even very subtle differences in environmental conditions, sample characteristics, and procedures can lead to differences in results. Social researchers face similar problems but are working with less precise measurement methods and probably greater variability in sample characteristics, environmental conditions, and procedures. Further, perhaps due to the historical misunderstandings of the probability statements arising from tests of the null hypothesis (including the socalled replication fallacy; see Gigerenzer 1993), researchers have often operationally defined “equivalent results” as “both studies resulted in the rejection of the null hypothesis.” For a variety of reasons that we will address later in this paper, this is a problematic definition that often makes it more difficult to identify effective treatments. In addition to the general scientific context, we embed our work on replications in the current trend toward empirically based practice occurring in many fields, including medicine, public health, social welfare, and education. In general, interventions are more likely to be accurately labeled as effective if they have been thoroughly tested, especially if these investigations have occurred across diverse populations and settings. In addition, many interventions targeting similar problems share conceptually similar approaches even if the specifics of the treatment are somewhat different. Such replications are useful for assessing the utility of higher-order principles that can be applied across interventions. As an example, see Stice et al. (2008), whose meta-analysis suggested that dissonance-based approaches to the prevention of eating disorders were more effective than non-dissonance based interventions. Replications are therefore a vital part of efforts both to identify effective preventive interventions and to spur development of new interventions based on theoretical and empirical considerations. Our effort to refine the scientific principles relevant to replication occurs in the context of significant progress in the identification of influences that disrupt normative human development. This progress was followed by the design and evaluation of various interventions that proved efficacious in preventing these developmental problems (e.g., Kellam et al. 1999). These developments have spawned concerted efforts to demonstrate the efficacy and then the effectiveness of interventions and then to disseminate them widely. Projects of this sort include the What Works Clearinghouse (WWC) of the U.S. Department of Education’s Institute of Education Sciences (IES),3 the Blueprints Project (Elliott and Mihalic 2004), the Substance Abuse and Mental Health Services Administration’s National Registry of Evidence-Based Prevention Programs (NREPP).4 Each of these exploit informa-

tion from replications to make decisions regarding intervention effectiveness. In addition, the work of the Cochrane Collaboration5 (in medicine) and its sibling organization the Campbell Collaboration6 (in the social sciences) extend this idea by committing to the production of high-quality systematic reviews. We view the further understanding of replication research as a critical step in the evolution of prevention science and practice.

3

5

4

See http://ies.ed.gov/ncee/wwc/ See http://www.nrepp.samhsa.gov/

Types of Replication Research Intentional replications can be done for a variety of reasons. Researchers interested in testing the possibility that effects found in one study were due to chance (i.e., whether the presumed cause and the presumed effect due in fact covary) would conduct a statistical replication, in which all fixed effects of the original study are kept the same and a new random sample is drawn for all random effects (Hunter 2001). Note that this type of replication research is a close analogue to the idea of statistical conclusion validity (Shadish et al. 2002). Researchers interested in testing whether relations observed in one study would generalize to conditions not observed in that study would conduct a generalizability replication. This is conducted like a statistical replication except that the target of generalization is changed (e.g., participants are drawn from a new population), and is analogous to Shadish et al.’s conception of external validity. Researchers interested in the effects of variations on program implementation would conduct an implementation replication (Flay, 1986). This is a special case of generalizability replication and is conducted like a statistical replication except some implementation details are either different or are explicitly compared. Finally, researchers interested in testing causal mechanisms underlying an intervention would conduct a theory development replication. This kind of replication is again tested like a statistical replication, except with changes that are theoretically relevant. For example, researchers might collect data on a putative mediator of the intervention to further understand how the intervention works. Theory development replications are analogous to Shadish et al.’s conception of construct validity, but in some cases might be related to internal validity as well (e.g., if the goal of a replication test is to tease apart the effects of potentially confounding moderators). These kinds of replications are relatively rare, in part because even simple interventions can be hard to reproduce exactly. In addition, for a variety of reasons, replications are often undertaken with less than full knowledge of the trials

6

See http://www.cochrane.org See http://www.campbellcollaboration.org

106

that have preceded them. Because of these complexities, the most common replications are probably best termed ad hoc replications, in that they may not be conceptualized as replications per se and vary from one another in multiple known and unknown ways. For example, in Tobler et al.’s (2000) now classic review of school-based drug prevention programs, 47 studies were characterized as testing comprehensive life skills programs. While these studies shared commonalities at an abstract level (e.g., they all taught knowledge and refusal skills), they also differed at an abstract level (e.g., many programs focused on affect but some did not) and at a specific level (in terms of the actual content taught, program intensity, fidelity of implementation, and so on). When changes to study conditions are not systematic or covary with other changes, the results of a replication study are more difficult to interpret. However, this is not to suggest that ad hoc or unsystematic replications are of little value. Indeed, variation in the specifics of studies—such as the implementation, setting, and participant sampling—are most likely to be sources of consternation when studies are examined one at a time (as in comparing the results of one study to the results of another) and in narrative fashion. However, when multiple, somewhat different studies are treated as data points in a second round of scientific investigation, they can contribute to more confident, general, and properly contextualized guides to decision making if they are considered as part of a rigorous review of the evidence (as they were in Tobler et al.’s work). In other words, the methods used for drawing inferences about the effects of an intervention can make ad hoc replications either more or less difficult to interpret.

Important Considerations for Interpreting Replication Research A natural question concerns the conditions under which an intervention’s effects can legitimately claim to have been replicated. As an example, assume that a study of a preventive intervention has been carried out, and that this study has found positive and statistically significant effects that are large enough to be substantively meaningful. A replication is then carried out. What pattern of results in the replication study would be required for the researchers to conclude that the results of the original study have been replicated? Before attempting to address this question, we raise several points for consideration. Can One Study be Considered a Replicate of Another? While the focus of this paper is on methods for determining whether results from one study have been replicated in a second, an equally important and even more challenging

Prev Sci (2011) 12:103–117

issue that we cannot fully address concerns whether it even makes sense to pose this question. That is, there is no reason to ask whether studies have obtained similar results unless those studies are replicates, in some sense of the term. Determining this is not easy. As Hume (1739/1740) argued about identity in general, whether an evolving program will seem to be “the same” or “different” relative to its initial form will depend on one’s perspective. If we focus on the gradual nature of changes, we are more likely to see the program as “the same” as it was before. If, however, we focus on the beginning and end points we may be more likely to see the programs as “different.” This implies that there is often no single correct answer to the question of whether one program can be considered a replicate of another program, highlighting the difficulty of this judgment. As much as judgment will always play a role in answering this question, we do believe two strategies might be helpful. First, given sufficient understanding of the theoretical concerns relevant to a particular intervention, one strategy would be to use this as a basis for determining whether critical program elements have changed. If the critical program elements are largely unchanged then it makes more sense to label subsequent versions of the intervention as “the same” as the original. An alternative view is given by Glass (2000), who argued that the question of identity is actually empirical as opposed to logical. As long as two interventions seem to belong to the same general class of interventions, Glass would argue that programs are effectively “the same” if their effects on the dependent variable are “the same.” An even stronger version of this view is that one might consider programs to be “the same” if they affect mediating variables in a similar way. Suppose a preventive intervention is found to increase school commitment, and school commitment is found to increase graduation rates. If this same pattern of results were found in a subsequent study, then it might make more sense to label the second study a replicate of the first (despite any changes to that intervention over time). Of course, if one accepts the logic of this view it is still necessary to determine whether results are similar across studies. We turn now to strategies for addressing this point. Inferential Framework When thinking about the question of whether an intervention’s effects have been replicated, perhaps the most important point of emphasis is that the degree to which confident inferences can be made varies as a function of the amount of information that can be brought to bear on the question. Inferences about replication can be made with as few as two studies, but only within a very weak inferential

Prev Sci (2011) 12:103–117

framework. This means that any inferences will be highly tentative, and their value for decision making limited. It is for this reason that we echo Hunter’s (2001) call for a dramatic increase in the number of replications conducted and published. Furthermore, it suggests that policymakers, scholars, and administrators should avoid premature adoption or rejection of interventions that appear to be effective or ineffective within a weak inferential framework. By contrast, due to critical needs for specific interventions to address public health problems, we recognize that there will be times when a decision must be made even though the evidence is limited to a study or two. We believe in these cases that ongoing evaluation should be conducted within such dissemination efforts (Flay et al. 2005). Given that we believe that decisions based on some evidence are likely to be better than decisions based on no evidence, we present some options for approaching evidence when it is limited.

107

measured outcomes within studies are censored due to the lack of statistical significance, this can also lead to incorrect inferences (Williamson et al. 2005). The second assumption is that the quality of the study design and its execution are comparable in the studies being examined. Generally speaking, it is probably inappropriate to use results from one study with relatively good inferential properties and one with relatively weak inferential properties to support inferences about the effects of a particular intervention. For example, as articulated more fully in Flay et al. (2005), due to the strong plausibility of measured and unmeasured group differences on baseline variables in nonequivalent group designs, randomized experiments are preferred for studying the effects of interventions. The studies should be roughly similar in other ways as well (e.g., attrition rates). Reframing the Question About Replication

Statistical Framework In statistics, there is a well-articulated and long-standing tradition for thinking about replications (e.g., Cochran and Cox 1957). Given assumptions about the true population effect size and variance, probabilities of a successful replication can be precisely stated for statistical, generalizability, implementation, and theory development replications. Unfortunately, we have seen that these kinds of replications are relatively rare, and that the most common type of replication involves multiple planned and unplanned differences in study details. This means that understanding the results of replication studies in the real world is often not straightforward. Important Background Assumptions Regardless of the synthesis method ultimately used, two critical assumptions are required before further considering the question of whether results have been replicated in a subsequent study. First, we must assume that all studies that have been conducted on an intervention are available for analysis or, failing that, that study availability is unrelated to study outcomes. This assumption is closely tied to the well-known phenomenon of publication bias, which occurs because study authors are less likely to submit—and journal editors and reviewers are less likely to accept for publication—studies that lack statistically significant effects on their primary outcomes. The essential problem is that, all else being equal, less statistical significance means smaller intervention effects. Any method of drawing conclusions about interventions may be biased if studies are censored in this manner. Further, the assumption about study availability extends to the outcomes that were measured in these studies. For a variety of reasons, studies often present results only on measured outcomes that had statistically significant effects. If

We started this section with a fairly common scenario: An intervention is found to have positive effects in one study, and a second study is carried out in order to replicate the finding. The question that follows is whether the results of the second study replicated the results of the first. Unfortunately this formulation of the question ultimately can prove to be misleading because it focuses attention on the statistical conclusions arising from the two studies. As we noted in the introduction to this paper, an alternate and perhaps better formulation asks, “What does the available evidence say about the size of the effect attributable to Intervention A?” The advantage of this formulation is that it avoids thinking in dichotomous terms about replications (most notably, with respect to the statistical significance of the findings in each study), and instead focuses on the magnitude (i.e., effect size) and likely range (i.e., confidence interval) of intervention effects. This reframing of the replication question also serves to focus attention on research methods specifically designed to help synthesize the results of multiple studies. Systematic reviewing is now the state of the art in reviewing research. In highlighting the relation between replication research and clinical practice, the Institute of Medicine (2008) wrote, “Systematic reviews provide an essential bridge between the body of research evidence and the development of clinical guidance” (p. 83). A good systematic review involves a variety of techniques designed to enhance the transparency, objectivity, and ultimately the validity of the review process, characteristics notably absent from traditional literature review procedures (e.g., Cooper and Dorr 1995; Egger et al. 2001). After formulating a research question, a systematic review involves a thorough and systematic search for relevant studies (placing inferences at less risk of publication bias), a structured method for

108

extracting data from those studies, a thorough appraisal of the quality of the evidence ideally based on rules set before data collection began, and a synthesis, often statistical, of the evidence itself.

Statistical Options for Assessing the Results of a Small Number of Studies In the sections below, we consider several different statistical options for assessing the results of a small number of studies. In the discussion, we assume that deferring a decision about the effects of the intervention until more studies can be conducted is not a viable option. We emphasize situations in which only the original study and a single replication are available, because currently this is a scenario faced by many researchers and policy makers. We also discuss how the various options are expected to perform when additional studies are available. Options with good statistical properties will perform better (e.g., will have smaller standard errors, and be more likely to result in the correct conclusion) with more information, all else being equal. As a reminder, the validity of any of these inferential methods depends on the critical background assumptions presented above. Vote Counting Based on Statistical Significance Perhaps the most common approach to addressing the question of whether study results have been replicated is to examine the statistical significance of the studies’ primary results. For example, assume Study A found statistically significant results for an intervention. If Study B found statistically significant results on the same outcome that were in the same direction as the original study, Study B could reasonably be said to have replicated the results of Study A. This approach is known as “vote counting.” A variation of the vote counting approach, and one that is particularly arbitrary, involves claiming that an intervention is said to “work” if there are at least two good studies demonstrating that the intervention is more effective than some alternative. For example, Kirby (2001) reviewed studies of interventions targeting teen pregnancy rates, and labeled the interventions as having “strong evidence of success” if, among other characteristics, the interventions had been found to be effective in at least two studies conducted by independent research teams. Similarly, the Collaborative for Academic, Social, and Emotional Learning (2003) assigned the highest ratings of effectiveness to interventions that demonstrated positive results in at least two studies. Lists of “effective programs” selected using a vote counting criterion are quite common. Vote counting is based on a pervasive but incorrect belief about the interpretation of the probability values arising

Prev Sci (2011) 12:103–117

from tests of statistical significance. Gigerenzer (1993), using published examples, and Oakes (1986), through a survey, found that probability values are often taken as a direct statement of replicability. Construing a probability statement as a statement of replicability is incorrect. When conducting a test of statistical significance for the mean difference between two intervention conditions, the probability value is the chance of observing a mean difference at least as large as the one observed, given a true null hypothesis (Cohen 1994). Another way to think about the probability value is that it represents the confidence with which we can state that we have correctly identified the direction of the effect. The relationship between the probability values in one study and the likelihood of successful replication in even an exactly replicated second study is therefore not straightforward. If a study rejects the null hypothesis at p=.05, for example, that does not mean that the next statistical replication has a 95% chance of rejecting the null hypothesis (if the population and sample effect sizes are very similar the probability is actually closer to 50%; see Greenwald et al. 1996; Valentine et al. 2010). The limitations of vote counting based on statistical significance are therefore well-known: The majority of studies must have statistically significant results in order for the claim to be made that an intervention “works.” Unfortunately, in most circumstances when using vote counting it is unacceptably probable that studies will not reach the same statistical conclusion, even if they are estimating the same population parameter (e.g., if the intervention really is effective). For example, if two independent studies are conducted with statistical power of .80 (meaning that both have an 80% chance of correctly rejecting a false null hypothesis), in only 64% of cases will both studies result in a correct rejection of the null hypothesis. If both studies are conducted with statistical power of .50, then in only 25% of cases will both studies result in a correct rejection of the null hypothesis. As such, because studies are typically not highly powered, in most current real-world contexts requiring studies to reach the same statistical conclusion is an approach with an unacceptably high error rate (by failing to detect real intervention effects when they exist). In fact, Hedges and Olkin (1985) demonstrated the counterintuitive result that, in many situations common in social research (i.e., interventions with moderate effects investigated in studies with moderate statistical power), vote counting based on statistical significance can actually have less statistical power the more studies are available. In addition, the statistical conclusions reached in individual studies are to some extent dependent on the assumptions employed when conducting the statistical tests. For example, researchers analyzing one study may choose to conduct a two-tailed test, while those analyzing a second

Prev Sci (2011) 12:103–117

study may choose to conduct a one-tailed test. Even if all other aspects of the studies are exactly the same, this difference in analysis could lead the researchers to different statistical conclusions in their studies. For all of these reasons, comparing the statistical significance of the results of studies—while intuitively appealing and simple to implement—is a seriously limited inferential technique that is not well-suited to identifying effective interventions. Comparing the Directions of the Effects Another inferential technique that is simple to implement but limited is comparing the directions of the effects observed in both studies. In this case, for example, the researchers would determine that the results “replicated” if both studies produced positive effects. Note that direction is considered without reference to other information such as the size or statistical significance of the intervention’s effect. This would mean that an intervention with no effects on participants could be labeled “effective” if by chance both studies resulted in positive effects, and “harmful” if by chance both studies resulted in negative effects. Most readers will recognize that the essential problem with this approach is that it is a very blunt way to approach the question. With only two studies, this limitation is clear. The approach is, however, consistent with approaches taken by statisticians considering the problem of replicability (e.g., Killeen 2005). Furthermore, unlike vote counting based on statistical significance when power is not high in the individual studies, the statistical power of this approach does improve as the amount of information increases even if statistical power is low in the individual studies. This happens because a fair amount of information is contained in the proportion of results that come out in one direction or another. Bushman and Wang (2009) outlined procedures for conducting a vote count of the directions of the effect that takes into account the sample sizes of the individual studies. With a sufficient number of studies, this weighted vote count of directions can provide a reasonable approximation of the underlying population effect size. However, in most realworld situations, where the number of studies is small, its utility for answering questions about replication will be limited. Comparability of Effect Sizes and the Role of Confidence Intervals Yet another way of thinking about the question of whether study results have been replicated is to consider the effect sizes in the studies. If the effect sizes are judged to be comparable, then the study results could be said to have been replicated. One could, for example, conduct a statistical test of equivalence. This test requires the scholar

109

to be able to specify “how close is close enough” (that is, to determine the point at which effects are practically the same), and there is usually no non-arbitrary way of doing this. Readers are referred to the work of Seaman and Serlin (1998) for additional information. Confidence intervals express the likely range of a population effect and as such, provide information that tests of statistical significance do not. One way to make effect size comparability operational, without invoking an arbitrary rule for deciding how close is close enough, would be to determine whether the mean from an attempted replication fell within the confidence interval of the mean from an original study. Unlike the two vote counting methods, this approach uses much of the statistical information available in the two studies because it takes into account the estimate of the effect size and its uncertainty from one study along with the estimate of the effect size from a second study. Cumming and Maillardet (2006) have shown that this approach has comparatively good inferential properties: On average, 83% of means from a statistical replication should fall within the confidence interval from the original study (for ad hoc replications, the percentage of replication means that will fall within an original confidence interval cannot be specified without invoking additional assumptions). One therefore could argue that an intervention has been replicated if the effect size from the replication study fell within the first study’s confidence interval. A limitation of this inferential method is that it can be easily applied only when there are only two studies. The statistical technique known as homogeneity analysis can also be used to address the question of whether study effect sizes are comparable (Shadish and Haddock 1994). While the statistical power of this test can be low when there are few studies, it is closely related to the idea of determining that effect sizes are comparable if one falls within the confidence interval of the second, and can be used when there are more than two available studies. Combining Effects Using Techniques Borrowed from Fixed Effect Meta-Analysis All of the approaches we have considered so far have approached replication from the standpoint of asking whether the results of two studies “agree” in some sense. As we indicated earlier, another way to frame this question would be to ask, “What does the available evidence say about the size of the effect attributable to Intervention A?” This is the approach taken in meta-analysis when analyzing a mean weighted effect size. This technique has several advantages relative to the other inferential options discussed so far. Specifically, in meta-analysis, study effects from larger studies are assigned proportionally more weight than effects from smaller studies. Further, meta-analysis focuses attention

110

on both the weighted average effect size and its confidence interval. These issues are addressed less well or not at all by the other inferential techniques we have discussed. This suggests that applying techniques borrowed from fixed effect meta-analysis for combining effect size estimates can be useful in replication research. In a fixed effect meta-analysis, all studies are presumed to be estimating the same population parameter and yield sample statistics (i.e., study effect sizes) that differ from one another only because of random sampling error. Inferences arising from a fixed effect meta-analysis are conditional on the studies observed (Hedges and Vevea 1998), and are not influenced by, for example, studies that could have been done but are not in the analysis. These properties make a fixed effect meta-analysis approach attractive for combining effect size estimates produced by the four systematic replication types, namely scientific, generalizability, implementation, and theory development. Notably absent from this list are ad hoc replications. Given that ad hoc replications are the most common kind of replication, this is a serious limitation. In addition, the fixed effect meta-analysis approach will provide a relatively poor estimate of the mean effect size, indicated by a very large confidence interval, in most cases when there are few studies. Combining Effects Using Techniques Borrowed from Random Effects Meta-Analysis In the random effects meta-analytic approach, effects are presumed not to share the same underlying effect size. Instead, these effects are assumed to vary due to known and unknown characteristics of the studies. Stated more formally, studies yield sample statistics that vary from one another due to random sampling error and to other between-study differences. Compared to the fixed effect approach, inferences arising from a random effects meta-analysis approach are not conditioned as tightly on the observed studies (Hedges and Vevea 1998). One important implication of the choice between a fixed and a random effects meta-analysis approach is that the confidence intervals arising from a random effects approach will often be larger and will never be smaller than the confidence intervals that would have been generated by a fixed effect approach. Although statistical power can be higher in a random effects approach relative to a fixed effect approach, it is usually lower, and often non-trivially so. This is a major problem when there are few studies, since power is likely to be very low. A related problem concerns the estimation of the extent to which individual study level effects vary from one another. Essentially, if study effects differ by no more than would be expected given sampling error alone, a random effect approach reduces to a fixed effect approach. If study level effect sizes differ by more than would be

Prev Sci (2011) 12:103–117

expected given sampling error, then this information (the between-studies variance component) adds uncertainty to the estimation of the weighted average effect size. However, the estimation of whether effect sizes differ by more than would be expected given sampling error alone is poor when there are few studies. Despite these limitations, however, a random effects meta-analytic approach may be the most appropriate strategy for dealing with ad hoc replications. Using Multiple Inferential Strategies Because all of the alternatives to addressing the question of replication will be limited when few studies are available, one approach is to employ all of them when considering the state of the cumulative evidence generated from ad hoc replications. We provide several examples of how this might be done below. One caution: In many cases, it is not reasonable for all of the ways of thinking about the question of replication to agree with one another. However, valuable information can be gained from examining as a whole the nature and pattern of agreements. Table 1 presents simulated data illustrating how all of the ways of thinking about whether research results have been replicated might be used. For the purposes of this example, assume that an intervention effect of δ=.10 is the smallest effect size that could be considered to be important in this research area and for this outcome; observed effects smaller than this would be considered to be too small to be meaningful. The simulation data were randomly generated from a distribution that has a population effect size of δ=.20, which would indicate a small but real effect for the intervention. Sample sizes are from 10 to 44 per group. These cases were chosen as exemplars because they illustrate a range of inferential challenges. As can be seen in the discussion below, some of the resulting judgments are relatively easy to make, while others are much more difficult. Case 1 Two very small studies were conducted. Not surprisingly, neither study had statistically significant results, and the effects from the two studies were in opposite directions. The effect size from the first study did not fall within the 95% confidence interval of the second study. Further, neither a fixed effect nor a random effects meta-analytic approach revealed statistically significant effects. It is therefore difficult to argue that the intervention being studied is likely to have positive effects on participants given the available evidence. One caution, however, is that both the fixed and the random effects approaches resulted in very large confidence intervals around the weighted average effect size. As such, the most reasonable conclusion is that we know very little about the effects of this intervention, rather than that there is evidence that the

Prev Sci (2011) 12:103–117

111

Table 1 Example of strategies for assessing the results of multiple studies Case

Study 1 d±CI

Study 2 d±CI

Do the studies agree about the direction of the effect?

What is the pattern of statistical significance?

Is the effect size from the second study within the CI of the first study?

What are the results of a fixed effect analysis?

What are the results of a random effect analysis?

1

0.81±0.91

-0.80±0.91

No

No

d=.004,±.64, p=.99

d=.005,±1.58, p>.99

2

0.80±0.61

0.42±0.60

Yes

Yes

d=.61±.42, p=.006

d=.61±.42, p=.006

3

1.27±0.62

0.60±0.58

Yes

No

d=.91±.42, p