1

Hierarchical Linear Modeling: A Review of Methodological Issues and Applications

John Ferron Melinda R. Hess Kris Y. Hogarty Robert F. Dedrick Jeffery D. Kromrey Thomas R. Lang John Niles

University of South Florida

Paper presented at the 2004 annual meeting of the American Educational Research Association, San Diego, April 12-16.

2 Hierarchical Linear Modeling: A Review of Methodological Issues and Applications INTRODUCTION Researchers in education and many other fields (e.g., psychology, sociology) are frequently confronted with data that are hierarchical or multilevel in nature. For example, in the context of school organizations, students are nested in classes, classes are nested in schools, schools are nested in school districts, etc. In longitudinal research, repeated observations are nested within individuals (i.e., units) and these individuals may be nested within groups. The pervasiveness of multilevel data has led to a proliferation of statistical methods, referred to under a number of names including hierarchical linear modeling (HLM, Bryk & Raudenbush, 1992), multi-level modeling, mixed linear modeling, or growth curve modeling, and a parallel increase in the number of applications of these methods to educational problems. The complexity of these multilevel methods provides potential for misuse and confusion, which may act as barriers for applied researchers attempting to use these methods. Careful consideration of the methodological nuances of multilevel analyses is critical, as misuses may result in statistical artifacts that may potentially influence statistical inference and cloud interpretation. A close examination of some of the more common methods employed across various disciplines, as well as an exploration of recent research trends can serve to both inform the practice of research as well as to broaden our understanding of the various methodologies and critical issues facing practitioners. Given this context, it is appropriate to pause and critically analyze the methodological issues inherent in multilevel modeling techniques. Similar methodological reviews have been conducted with more commonly used techniques such as analysis of variance, multivariate analysis of variance, analysis of covariance, path analysis, and structural equation modeling. Keselman et al. (1998) note that one consistent finding of methodological research reviews is that a substantial gap often exists between the methods that are recommended in the statistical research literature and those techniques that are actually practiced by applied researchers (Goodwin & Goodwin, 1985b; Ridgeway, Dunston & Qian, 1993). Methodological reviews can serve to identify issues, controversies, and current trends as well as provide direction to applied researchers. In addition, these reviews may help bridge the gap between statistical theory and application.

3

Purpose & Intended Audience The purpose of this review is to provide an overview of the methodological landscape and the critical issues surrounding multilevel modeling and to report on the current application and reporting of multilevel analyses in education and related fields. Because a single review cannot include every methodological consideration or technical nuance, it is important to clarify that our intent is to focus primarily on what might be termed ‘traditional’ hierarchical linear modeling. That is, we consider linear models of continuous outcomes where the random effects are assumed normally distributed. This allows consideration of applications where individuals are nested in contexts (e.g., students nested in schools), and applications where observations are nested within individuals (e.g., growth curve models). Models in which the outcome is represented by binary, count, or ordinal data are not considered (see Raudenbush & Bryk, 2002 for discussions of these types of applications). Nor did we venture beyond these modeling techniques to explore related methods such as multilevel structural equation modeling (SEM), hierarchical linear measurement modeling or applications of item response theory (IRT). This review is intended to be useful to several distinct audiences in the research arena. For example, practitioners whose inquiries in applied areas include multilevel models should benefit from this treatment of the published literature in the field. A careful consideration of one’s research methods, designs and habits of reporting in contrast to those evidenced in the field in general will tend to suggest areas for refinement of techniques and serve as a source for professional reflection. Similarly, manuscript reviewers and journal editors who serve the critical roles of both gatekeepers and navigators for authors in the reporting of research results should gain from this examination of multilevel models. A critical appraisal of the strengths and limitations in the reporting of results from such models is intended to sharpen the eyes of these scholars and lead to improvements in the corpus of research to be published. In addition, research methodologists and others who serve as technical consultants in large-scale research projects may glean additional distinctions and insights from the technical issues raised in this paper. A tremendous amount of methodological work is needed to advance our understanding of the limits and possibilities of hierarchical models, and this review of the published literature highlights several areas needing focused attention. Further, university professors and teachers of research methods and applied statistics may find that the issues raised here and the resources cited will hone their craft and enhance the content of their curricula. Finally, students of educational research (the scholars of tomorrow) will find a readable introduction to hierarchical models, together with citations of sources that will provide logical entries into the growing body

4 of technical work in this area. Our treatment of hierarchical models as they are ‘practiced’ by those publishing in education and related fields is intended to both guide and inspire researchers at multiple levels of experience and interest. Organization of the Review To provide an overview of the methodological nuances and critical issues surrounding multilevel modeling, literature in an array of disciplines (education, medicine, public health, psychology, business, chemistry, physics, biology, statistics and math) was reviewed. Electronic databases (e.g., ERIC, PsychInfo, etc.) were searched using a variety of keywords (e.g., hierarchical linear modeling, mixed linear models, nested designs, and multilevel modeling) and additional articles were gleaned from the reference lists of select articles. Lastly, key issues were identified from the reference list of pivotal articles, by exploring technical software manuals, and by monitoring online conversations on certain Listservs. Based on these reviews, four broad issues were identified. These issues are explored during the initial phase of this paper and served as the framework for the coding protocol used to analyze the hierarchical linear modeling applications in the literature. These issues include: (a) model development and specification, which include issues of centering, selection of predictors, specification of covariance structure, fit indices, generalizability and checks on specification; (b) data considerations including distributional assumptions, outliers, measurement error for predictors and outcomes, power, and missing data; (c) estimation procedures including maximum likelihood, restricted maximum likelihood, Bayesian estimation, and alternative procedures such as bootstrapping; and (d) hypothesis testing and statistical inference including inferences about variance parameters and fixed effects. Following the explication of these issues, we describe the development of the framework and coding protocol used as a lens for our analysis of the literature and the search strategies employed in this review. The results of our review of applications are then presented, along with recommendations for improved reporting practice. Critical Issues in Hierarchical Modeling Model Development & Specification Model development is a central component of inquiry in many disciplines, with notably different approaches evidenced among researchers both within and across disciplines. According to Dickmeyer (1989), a patchwork of styles and worldviews with respect to model development exists in the educational research community. Models are typically built to allow researchers to test theories or hypotheses, to manipulate and test changes in simplified systems, and to allow for the exploration of relationships between variables that in some way characterize

5 a complex system. Support for various models may emerge from the extant literature and research in a particular field. Other models may be the product of developing theories employing exploratory types of analyses utilizing more data driven approaches. To aid discussion of the particular specification issues confronted by those using hierarchical models we first review the basic notation and terminology in the context of an application. Consider a team of educational researchers who wish to study the relative effectiveness of two reading programs. These researchers randomly assign participating classrooms to one of the two programs, and gather reading comprehension data both prior to and following the use of the program. A level-1 model could be developed to model the reading comprehension of students within a class as a function of their reading comprehension prior to the study. More specifically,

yij

= β0 j +

β1 j prior readingij

+ rij

(1)

where yij is the reading comprehension of the ith child in the jth classroom, β0j is the intercept of the regression equation predicting reading comprehension at the end of the study in the jth classroom, β1j is the regression coefficient indexing the relationship between reading comprehension at the end of the study and reading comprehension before the study in the jth classroom, and rij is the error, which is assumed to be normally distributed with a covariance of Σ. A level-2 model could then be used to examine the relative effectiveness of the two programs, and whether the relative effectiveness of the two programs depended on the prior levels of reading comprehension. The level-2 model would use program to predict the intercepts and slopes of the level-1 model.

β0 j β1 j

= γ 00 = γ 10

+ γ 01 program j + γ 11 program j

+ u0 j + u1 j ,

(2) (3)

where programj is a dummy coded variable indicating whether the jth classroom received Program A (coded 0) or Program B (coded 1), and u0j and u1j are the errors, which are assumed to be normally distributed with a covariance of Γ. Although models are frequently described by using equations for each level, it is possible to combine all the equations into one. By substituting the second level model for β0j and β1j in the first level model, a combined model is obtained,

yij = γ 00 + γ 01 program j + γ 10 prior readij + γ 11 program j * prior read ij + u0 j + u1 j prior read ij + rij (4) In the combined model it becomes clear why γ 11 is referred to as a cross-level interaction. It should also be noted that the combined model has the same form as the mixed linear model,

6

y = Xγ + Zu + ε ,

(5)

where y is a vector of outcome data, γ is a vector of fixed effects, X and Z are known model matrices, u is a vector of random effects, and ε is a vector of errors (Henderson, 1975). Centering. Centering of the level-1 and level-2 predictors has important implications for interpreting the results and therefore is an important consideration in specifying the statistical model. In our example, suppose the level-1 predictor, prior reading comprehension, was measured on a scale ranging from 200 to 800. If the predictor was kept in its natural metric, γ 00 would be the predicted reading comprehension for a student in Program A (coded 0) who had a prior reading comprehension of zero. Since a prior reading comprehension of zero is not possible, the coefficient is difficult to interpret. The effect of the program, γ 01 , would be interpreted as the difference in the effectiveness of the two programs when prior reading comprehension was zero. Again, a value that is not particularly informative. Prior reading comprehension would need to be scaled or centered to make the interpretation of the coefficients more meaningful. Although centering is used outside the context of multilevel modeling, it is particularly important in multilevel modeling because the level-1 coefficients become outcomes to be explained in higher level models (e.g., level 2). One approach to scaling the predictor variable is to subtract the grand mean of the predictor variable from each score ( xij - x⋅⋅ ). Using grand mean centering with our example, γ 00 would become the predicted reading comprehension for a student from Program A, who had a mean level of prior reading comprehension. The effect of the program, γ 01 , would become the difference in the effectiveness of the two programs for students with a mean level of prior reading comprehension. A second approach to scaling the predictor variable is to subtract the level-2 unit mean of the predictor variable from each score ( xij - x⋅ j ). Using group-mean centering with our example,

γ 00 , would become the predicted reading comprehension for a student in Program A, who’s prior reading comprehension was at the mean of her class. The effect of the program, γ 01 , would become the difference in effectiveness of the two programs for students who are at their classroom’s mean level of prior reading comprehension. A third approach to scaling a predictor variable is to subtract a theoretically meaningful value from each score ( xij - Specific Value). This approach is similar to grand-mean centering in that a constant is subtracted from each score. The β 0 j is interpreted as the expected outcome for

7 individuals who score at the specific value that has been set by the researcher. For example, in a growth curve model examining change in achievement from grades 6, 7, 8, a research may center the grade predictor at grade 8. In this case, β 0 j is interpreted as the expected value of the outcome for a student in 8th grade. In the case of level-2 predictor variables, Raudenbush and Bryk (2002) have noted that the choice of the scale metric (e.g., natural metric, grand mean centered) is less critical. However, when interaction terms are included at level-2, Raudenbush and Bryk (2002) have suggest that grand mean centering has the advantage of reducing multicolinearity. Selection of predictors. The selection of predictors is a critical aspect of the design of a study. According to Little, Lindenberger and Nesselroade (1999) the issue of variable selection is directly related to the quality of the research design and the value of the results. In hierarchical modeling, variable selection can be complicated since predictors can be selected for each level of the model, and interactions between predictors can be considered at either level or across levels. In addition, the process of variable selection can take many forms. In some instances, the selection of predictors is established prior to looking at the data, while in others the data help guide selection decisions. Inclusion may be based partially on significance tests, effect sizes, or fit indices. In the reading comprehension example, one can imagine the researchers using the research goals and a priori considerations to select prior reading achievement as a predictor at the first level, and program as a predictor at the second level. Interpreting the regression coefficients is influenced by the degree to which one believes the regression coefficients are unbiased estimates of effects. In our illustrative example, γ 01 can be interpreted as a program effect. Since classrooms were randomly assigned to programs, we would anticipate that program was not related to any other classroom level variables, and consequently we would anticipate an unbiased estimate of effect. If we had not randomly assigned classrooms to program, our ability to argue γ 01 was an unbiased estimate of effect would depend on our ability to argue that all relevant variables were included in the model, or that the set of predictors for the model was correctly specified. Researchers who wish to include all relevant variables, but who are unsure if particular variables need to be included, may let the data help them decide which variables to include. For example, a researcher may start with a model like we have previously described and then add variables one at a time, keeping only those that are statistically significant. Other researchers may start with a fuller set of variables, and eliminate ones that do not seem to be affecting the

8 results. When the data are used to guide the selection of predictors the researcher increases the odds of capitalizing on chance, which heightens the need for replication. Specification of covariance structure. The most notable difference between hierarchical models and more common regression models is that hierarchical models have more error terms and consequently more flexibility in defining the covariance structure. This greater flexibility leads to two distinct advantages. First, it allows researchers to be more flexible in the questions they ask about the covariance structure. In applications like our example, researchers can ask questions about the degree to which the outcome varies within classrooms relative to the degree to which it varies across classrooms. In growth curve modeling applications, researchers can ask to what extent initial levels (intercepts) vary across participants and to what extent growth rates (slopes) vary across participants. Second, the degree to which the standard errors for the regression coefficients are unbiased depends on the degree to which the covariance structure is correctly specified. Having the flexibility to model a more complex covariance structure improves the chances of correct specification, which leads to better estimates of the standard errors of the regression coefficients, which in turn leads to more accurate confidence intervals and/or more valid statistical tests. The covariance structure for the first-level model, Σ, is often assumed to be σ2I in applications where students are nested in contexts. In repeated measures contexts this assumption is more questionable since errors that are close in time may be correlated. A wide variety of alternative structures have been discussed including first-order autoregressive, banded, unstructured, toeplitz, banded toeplitz, and first-order autoregressive plus a diagonal (Wolfinger, 1993). With so many options available, researchers are left with questions about how to best specify the covariance structure of the first-level model. Questions also arise as to how to best specify the covariance structure of the second-level model. In the previous example, which is relatively simple, there are alternative specifications for Γ depending on whether one wanted to let both intercepts ( β 0 j ) and slopes ( β1 j ) randomly vary and whether or not one wanted to allow for covariance between the errors in predicting the intercepts and slopes. As the number of predictors in the first-level model increases the potential size of the Γ matrix also increases. With more elements there are more variance parameters that could be estimated. The question becomes which coefficients should be allowed to randomly vary, and if the answer is more than one, which of the possible covariances between errors should be estimated.

9 One could generally divide the covariance parameters into three categories: those that are assumed to be zero and not estimated, those that are assumed to be non-zero and thus estimated, and those that the researcher is less sure about. If researchers routinely leave out all questionable variance parameters, they run the risk of leaving out needed parameters, biasing their standard errors, and jeopardizing their inferences. If researchers routinely add in all questionable parameters, they may estimate a model that is overly complex, which increases the chance that they will encounter estimation problems. For example the estimation may not converge or a variance component may be inadmissible (e.g., a variance less than zero or a covariance that implies the correlation would exceed 1.0). Even when estimation seems smooth, estimating many parameters that are equal to zero will negatively affect the precision in estimating the other parameters in the model. Consequently, one would ideally only estimate the needed parameters. Fit Indices. With the growing recognition of the importance of the selection of an appropriate covariance structure, several methods have been developed that allow researchers to use the data to help make decisions about which covariance structure to estimate. As it is often not possible to know the underlying structure in advance, researchers will often examine multiple structures and rely on fit indices to select among possible covariance structures (Singer, 1998; Wolfinger, 1993). Among the indices commonly used are Akaike’s Information Criterion (Akaike, 1974) and Schwartz’s Bayesian Criterion (Schwartz, 1978). Akaike’s Information Criterion is given by:

AIC = log(L) – q

(6)

where q is the number of covariance parameters. Schwartz’s Bayesian Criterion is given by:

SBC = log(L) – (qlog(N – p))/2

(7)

Both AIC and SBC start with the log likelihood value and then penalize for the number of covariance parameters estimated, with SBC employing a stiffer penalty. For each of these indices values closer to zero represent better fit, so typically the model with the value closest to zero is selected. This approach, however, does not always lead to identification of the correct covariance model, especially when data are somewhat limited. For example, with repeated measures data, it is difficult to correctly select the covariance structure when the series length is short (Ferron, Dailey, & Yi, 2002; Keselman, Algina, Kowalchuk, & Wolfinger, 1998). Furthermore,

10 misspecification can affect estimation and inference (Ferron, Dailey, & Yi, 2002; Lange & Laird, 1989). Other work in this area has focused not on a single estimate, but rather a ‘confidence set of models’ (Shimodaira, 1998). Instead of using the minimum AIC, Shimodaira proposed the use of an ‘interval’ estimate of the best model. This author was quick to note that the confidence set approach in not intended to replace the use of an obtained point estimate of the minimum AIC, but rather provides supplemental information on model selection. This approach employs a series of pairwise analyses in which a standardized difference of AIC is calculated for every pair of models. Potential models are compared to the model evidencing the best estimate of sample fit, and those models that are not observed to differ by a statistically significant amount become part of a set of models for consideration. Generalizability and Sensitivity. The degree to which the findings of a particular analysis are generalizable, as well as how sensitive the findings are to characteristics of the data, should be a concern of all researchers. Limitations of a particular sample, the nature of the data, as well as techniques employed, all impact the breadth and depth of the inferences made. These issues can, at least in part, be examined to help ascertain the strength of findings using a variety of statistical methods and techniques. Such techniques include cross-validation, sensitivity analysis, replication and extension of previous research, and internal replication. Typically associated with more traditional statistical methods such as regression analysis, the use of a technique such as cross-validation is a useful technique in HLM analyses that provides further evidence of validity and model soundness. The primary purpose of crossvalidation is to provide a check of model integrity and generalizability. This model ‘check’ is accomplished through using one set of data, sometimes referred to as the screening (or training) set and one set of data that may be called the calibration (or test) set. The screening set is used to estimate the model and then the calibration set is applied to the model to determine how well the model was able to predict the degree of fitness relative to the screening data set. This process allows the calculation of a magnitude of generalization error based on how well the calibration data set fits the model identified by the training data. Depending on the data structure and sample size, cross-validation may be conducted using various strategies. These include the holdout method (also called the data splitting method), dual-sample method, k-fold cross validation and the leave-one-out cross validation (LOO). The procedure can be further complicated when a researcher might not believe, for either theoretical, conceptual, or data-based reasons, that a single cross-validation process is sufficient and thus engages in double cross validation. Double cross-validation is nothing more than doing cross-validation twice and then

11 using a combined equation or model. Depending on the statistical analysis being employed (e.g., regression or HLM) this process may vary in complexity and applicability. Another means of addressing generalizability and sensitivity issues is through conduct of a sensitivity analysis. This type of analysis examines the impact of data anomalies (e.g., extreme data values, distribution irregularities) on model fit and parameter specifications. Bayesian techniques such as the Gibbs sampling methods as well as other strategies and algorithms can be used to examine impact of extreme observations at either level one or level two of the model (Seltzer, Novak, Choi, & Lim, 2002). Other techniques, such as data transformations (e.g., loglinear, square root) can be effective in addressing issues such as nonnormal distributions. The degree to which model fit and parameter specifications remain constant when data issues such as these are controlled for is critical to determine how ‘sensitive’ a given model is to fluctuations or peculiarities in the data. As with virtually all other techniques and methods of data analysis, HLM can be used for both replication and extension of other studies as well as internal replication. The replication and extension of other studies can be done in a multitude of ways. HLM can be a complementary method used to examine a sample of a population previously analyzed using other statistical techniques such as regression analysis. The degree to which HLM is more robust for accounting for such issues as lack of independence among observations makes it well suited to replicate previous research with very similar populations to either help strengthen the inferences made from that research, or to identify possible areas of concern that were not identifiable with more traditional analyses. Furthermore, replication can be conducted within a given study by analyzing subsets of the data independently and subsequently examining the degree to which the results are similar. Depending on the method(s) used, as well as the data source(s), replication efforts often serve to enhance either the external or internal validity of the findings reported and conclusions reached. These are just a few of the means that researchers might consider using when conducting HLM analyses. These techniques and approaches have the potential to enhance the credibility, validity, and generalizability, depending on the focus, purpose, and resources considered in the study. A careful and considerate selection of one of these analyses will enhance the integrity of the findings of a research study. Data Considerations Distributional assumptions. All inferential statistical tools are based on a set of core assumptions. Provided the assumptions are met, the method will function as planned or intended. These underlying assumptions are often not satisfied, and it is common knowledge

12 that under some data-analytic conditions certain procedures will not produce the desired results. According to Keselman et al. (1998), “the applied researcher who routinely adopts a traditional procedure without giving thought to its associated assumptions may unwittingly be filling the literatures with nonreplicable results” (p. 351). For a hierarchical linear model, distributional assumptions are made about the errors at each level in the model. The first level errors, the rij in equation 1, are assumed to be independently and normally distributed with a covariance of Σ. Lack of normality can lead to biases in the standard errors at both levels, and thus introduces questions about the validity of statistical tests and the accuracy of reported confidence intervals. The normality assumption is not realistic for certain types of outcome variables (e.g., binary outcomes, multinomial outcomes, and ordinal outcomes), and in these cases it is generally recognized that hierarchical generalized linear models are more appropriate. When one has a continuous outcome variable, the normality assumption of hierarchical linear models may be reasonable, but even here the assumption may not hold. Researchers can assess normality by examining the distribution of the level-1 residuals. The distributions can be examined separately for each level-2 unit, or by pooling across the level-2 units. If evidence of nonnormality is found, the researcher may wish to consider transforming the outcome variable. Also implicit in the assumptions about the first level error, is that the variance of the errors is the same for each level-2 unit. If the variances are not homogeneous, but vary randomly, it does not appear that the fixed effects or standard errors are biased (Kasim & Raudenbush, 1998), but if the variances vary as a function of the level-1 or level-2 predictors there may be more serious consequences (Raudenbush & Bryk, 2002). A researcher can examine the homogeneity assumption by examining the variance of the level-1 residuals for each level-2 unit. The researcher could then look for units with variances that were notably different from the others, or test whether the differences among the variances were greater than what could be attributed to sampling error. Researchers could also examine the correlations between the variance estimates and the values of the level-2 predictors. Distributional assumptions are also made about the level-2 errors, the u0j and u1j from equations 2 and 3. These errors are assumed to be normally distributed with a covariance of Γ. Checking normality of these errors is a bit more complicated since the outcomes of the level-2 model are not directly observed but a procedure for estimating skewness and kurtosis of the random effects has been presented (Teuscher, Herrendorfer, & Guiard, 1994).

13 It is well known that model-based statistical inference is dependent upon the scrupulous attention to the assumed models, which necessarily includes the distributional assumptions underlying a particular model. This is, of course, necessary if a researcher hopes to find a suitable model or models that fit the data well. Although a number of researchers have investigated these issues in the past, according to Ghosh and Rao (1994), the literature on diagnostics for mixed linear models involving random effects is not as extensive as the literature with respect to the treatment of standard regression diagnostics. Recently, however, Jiang (2001) advanced a technique using goodness-of-fit tests to examine the distributional assumptions with regard to mixed linear models. Outliers. As with other statistical methods, researchers should screen their data for outliers. These outlying observations may arise from data entry errors (e.g., a 27 that should have been a 72), an inaccurate assessment of a student (e.g., a 0 used to indicate achievement for an absent student), failure to identify a missing data code (e.g., a missing value entered as a 999), failure to screen out participants who fall outside the inclusion parameters for the study (e.g., a score from a student who was not part of the school during the focal time period for the study), or simply from an individual who is different from the others in the sample. As with other analyses, illegitimate outliers (e.g., data entry mistakes) can distort analyses and should be corrected. With legitimate outliers (e.g., a score that is atypical but truly part of the population being considered), the researcher needs to be aware of their presence and influence on the results. When the influence is substantial, ameliorative strategies may need to be considered. Initially the researcher may wish to look for univariate outliers by inspecting box plots, or examining the distance from the mean in standard deviation units for the smallest and the largest observations. Although these univariate checks are helpful, the researcher should also consider examining the residuals at each level of the model (e.g., Raudenbush & Bryk, 2002). As an example, consider a study that examined students nested within classrooms. At the first level, one could look for outlying students, individual scores that were far from expectation given the class’s regression equation. At the second level, one could look for outlying classes, where an outlying class is one that has an atypical regression coefficient. In addition to the examination of residuals, one may wish to examine simulation-based methods (Longford, 2001). Measurement error for predictors and outcomes. Most measures in educational studies contain error. Consequently, it is likely that the predictors and outcome variables used in educational applications of hierarchical modeling will contain measurement error. These errors, if not accounted for, can bias estimates of variance parameters, variance ratios – like the intraclass correlation, fixed effects, and the standard errors of fixed effects (Woodhouse, Yang, Goldstein, &

14 Rashbash, 1996). Consequently, it is important for educational researchers to consider the reliability of the data used in their applications of hierarchical linear models. In situations where measurement error is anticipated, there are methods for specifying and adjusting for the measurement error (Longford, 1993; Woodhouse et al., 1996). Power. Considerable work has investigated the power of statistical tests of treatment effects in multilevel data. Sample size formulas have been provided for obtaining given powers in experiments where the 2nd-level units have been randomized (Donner, Birkett, & Buck, 1981; Hsieh, 1988). Power calculations are also available through a website and through specialized software. Optimal allocation of units among levels (e.g., fewer large groups versus more small groups) has been considered (Raudenbush, 1997; Snijders & Boskers, 1993). Also the level of randomization has been found to impact power, such that randomization of the 2nd-level units leads to less power than randomization of the 1st-level units (Donner, Birkett, & Buck, 1981; Hsieh, 1988). Missing Data. It is not uncommon for missing data to occur on one or more variables within an empirical investigation. Missing data may adversely affect data analyses, interpretations and conclusions. Collins, Schafer, and Kam (2001) indicate that missing data may potentially bias parameter estimates, inflate Type I and Type II error rates and influence the performance of confidence bands. Further, because a loss of data is almost always associated with a loss of information, concerns arise with regard to reductions of statistical power. Unfortunately, researchers’ recommendations for managing missing data are not in complete agreement (Guertin, 1968; Beale & Little, 1975; Gleason & Staelin, 1975; Frane, 1976; Kim & Curry, 1977; Santos 1981; Basilevsky, Sabourin, Hum, & Anderson, 1985; Raymond & Roberts, 1987). Many studies that have examined missing data treatments are not comparable due to the various methods used, the stratification categories (number of variables, sample size, proportion of missing data, and degree of multicollinearity), and the criteria that measure effectiveness (Anderson, Basilevsky, & Hum, 1983). Contemporary discussion of missing data and their treatment can often be confusing and at times may appear somewhat counterintuitive. For example, the term ignorable, introduced by Little and Rubin (1987) was not intended to convey a message that a particular aspect of missing data could be ignored, but rather under what circumstances the missing data mechanism is ignorable. Additionally, when one speaks of data missing at random, these words should not convey the notion that the missingness is derived from a random process external or unrelated to other variables under study (Collins et al., 2001). According to Heitjan and Rubin (1991) missing data can take many forms, and missing values are part of a more general concept of coarsened data. This general category of missing

15 values results when data are grouped, aggregated, rounded, censored, or truncated, resulting in a partial loss of information. The major classifications of missing data mechanisms can be best explained by the relationship among the variables under investigation. Rubin (1987) identified three general processes that can produce missing data. First, data that are missing purely due to chance are considered to represent data that are missing completely at random (MCAR). Specifically, data are missing completely at random if the probability of a missing response is completely independent of all other measured or unmeasured characteristics under examination. Accordingly, analyses of data of this nature will result in unbiased estimates of the population parameters under investigation. Second, data that are classified as missing at random (MAR), do not depend on the missing value itself, but may depend on other variables that are measured for all participants under study. Lastly, and most problematic statistically, are data missing not at random (MNAR). This type of missingness, also referred to as nonignorable missing data, is directly related to the value that would have been observed for a particular variable. A commonly encountered situation, in which data would be classified as MNAR, arises when respondents in a certain income or age strata fail to provide responses to questions of this nature. Given the nature of the data typically analyzed using hierarchical linear modeling, it is not surprising that the issue of missing data becomes pertinent to inquiry of this nature (Roy & Lin, 2002). Missing data may occur at the different levels of a model, or the loss of multiple data points across time may be unavoidable or inevitable due to attrition of mortality. It is also not uncommon to face a combination of these challenges when examining longitudinal outcomes. The careful researcher must be concerned not only with nonignorable nonresponses but with missing covariates as well (Roy & Lin, 2002). Estimation There is no a single agreed upon way to estimate the parameters in a hierarchical linear model. Several methods of estimation can be employed, including maximum likelihood (ML), restricted maximum likelihood (REML), and Bayesian (Raudenbush & Bryk, 2002; Kreft & De Leeuw, 1998). These methods of estimation can be carried out using many different algorithms. For example, ML estimation may be accomplished using the EM algorithm, the Newton-Raphson algorithm, the Fisher scoring algorithm, or iterative generalized least squares (IGLS), while Bayesian estimation may be accomplished using the Gibbs sampler. In addition, these algorithms have been programmed into many different software packages. Thus one researcher may accomplish REML estimation using the EM algorithm programmed into HLM, while another may accomplish REML estimation using restricted iterative generalized least squares (RIGLS)

16 using MLn, while a third may accomplish REML using the Newton-Raphson algorithm programmed in Proc MIXED within SAS. Maximum likelihood estimation. The principle behind ML estimation is to select parameter estimates that maximize the likelihood of the data. We consider how likely it is that we would have obtained the data for each of many different values for the fixed effects (γs) and variance parameters (elements in Σ and Γ), and then pick the values for which the likelihood is the greatest. This involves an iterative algorithm that steps through possible values until the likelihood reaches its maximum. When the maximum is reached the algorithm is said to have converged. The goal for computational statisticians is to develop an algorithm that converges fairly quickly across a wide range of applications. If the algorithm meanders through the possibilities too slowly it may not converge given the time allocated, and if the algorithm moves too quickly it may miss the maximum and fail to converge. Since the desirable properties of maximum likelihood estimators are not realized when convergence fails, the objective for applied researchers is to select an algorithm that will converge given their data and time constraints. Maximum likelihood estimation is currently available through a variety of algorithms and software packages. It can be accomplished using the EM algorithm (Dempster, Laird, & Rubin, 1977), which is implemented in the software package HLM (Raudenbush, Bryk, Cheong, & Congdon, 2000), or by using the Newton-Raphson algorithm (Lindstom & Bates, 1988) which is implemented in Proc MIXED (SAS, 2000), or by using the Fisher scoring algorithm (Longford, 1987) which is implemented in VARCL, or by using iterative generalized least squares (IGS; Goldstein, 1986) which is implemented in MLn. The EM algorithm has the advantage that it will always converge if given enough time, but the disadvantage is that it may take a relatively long time to converge (Draper, 1995). If convergence is met and the estimated variance/covariance matrices are positive definite (i.e., the variances are positive and the absolute value of the implied correlations do not exceed 1.0), then the estimators have some desirable properties. The fixed effects (γs) are unbiased (Kacker & Harville, 1981, 1984), and the estimates of the variance parameters (elements in Σ and Γ) are asymptotically unbiased, that is the bias disappears as sample size gets large (Raudenbush & Bryk, 2000). The estimates of the fixed effects and variance parameters also tend to be asymptotically efficient, which implies that when the sample size is large the maximum likelihood estimates show minimum variance from sample to sample (Raudenbush & Bryk, 2000). Finally, as sample size increases the sampling distributions of the estimates become approximately normal, which facilitate construction of confidence intervals and statistical tests

17 (Raudenbush & Bryk, 2000). Note that these properties hold for relatively large sample sizes, where what is considered large is heavily influenced by the number of upper level units. For example, if one studies students who are nested in classes, then many classes must be sampled if one wishes to obtain these desirable properties. Restricted maximum likelihood estimation (REML). In REML, maximum likelihood estimates are obtained for the variance parameters (elements in Σ and Γ). These values are then used in obtaining generalized least squares estimates of the fixed effects (γs). The REML estimates of the variance parameters may be considered preferable to ML estimates because REML takes into account uncertainty in the fixed effects (γs) when the variance parameters are estimated. Since the uncertainty in the fixed effects is more pronounced with smaller sample sizes, one may suspect the difference in these methods would tend to be greater when sample sizes were smaller. A couple of empirical studies have been done which have found differences between ML and REML estimates under a variety of conditions (Kreft & de Leeuw, 1998), but these studies do not lead to uniform recommendation of one method over the other. As with ML estimates, REML estimates can be obtained from a variety of software packages (e.g., HLM, SAS Proc MIXED, MLn, VARCL) and through a variety of algorithms (e.g., EM, Newton-Raphson, Fisher scoring, and RIGLS), and have been shown to have desirable properties under many conditions. Again under general conditions, the fixed effects (γs) are unbiased (Kacker & Harville, 1981, 1984), the estimates of the variance parameters (elements in Σ and Γ) are asymptotically unbiased (Raudenbush & Bryk, 2002), the estimates of the fixed effects and variance parameters are asymptotically efficient (Raudenbush & Bryk, 2002), and as sample size increases the sampling distributions of the estimates become approximately normal (Raudenbush & Bryk, 2002). Consequently, both ML and REML are often recommended for large sample size conditions. When sample sizes are smaller, and particularly when the data are unbalanced, the functioning of both ML and REML becomes questionable, which may lead researchers to consider alternatives. In addition, inferences about the fixed effects (e.g., confidence intervals for the γs) assume the variance estimates have no error. This also becomes exceedingly questionable when the sample sizes are not large. Bayesian estimation. With Bayesian estimation (Lindley & Smith, 1972) one can acknowledge the uncertainty in the estimates of the variance parameters when the fixed effects are estimated. Consequently, Bayesian estimation provides an appealing option for researchers working with smaller data sets. This form of estimation can be accomplished using Markov Chain Monte Carlo algorithms like the Gibbs sampler, which is implemented in the software

18 BUGS. Although Bayesian estimation is appealing in some circumstances, it also has some drawbacks. Prior distributions must be specified, but this specification may conflict with some researchers’ desire to not let prior beliefs influence the results of their analyses (Raudenbush & Bryk, 2002). In addition, the algorithms are not as readily available, as they have only been implemented in a couple of software packages, and the algorithms are very computer intensive, making them impractical for large data sets. Alternative Estimation Methods. Since none of the estimation methods is entirely satisfactory across all data conditions that may be encountered in research, statisticians continue to work on the development of alternatives. Bootstrapping has been presented as one option to deal with the bias in the variance estimates and standard errors that results from using ML or REML estimation with samples that are not large and normal. Bootstrapping is available in MlwiN, and both parametric (Meijer, van der Leeden, & Busing, 1995), and nonparametric (Carpenter, Goldstein, & Rashbash, 1999) versions have been discussed. Another alternative stems from the motivation to restrict the influence of outlying observations. Robust ML estimation methods and robust REML estimation methods have been proposed and show promise (Richardson & Welsh, 1995), but as far as we know they have not been programmed into readily available hierarchical modeling software. Hypothesis Testing and Statistical Inference The estimation method will produce point estimates of each parameter in the hierarchical model. These point estimates are often valuable in addressing particular research questions, but additional information is often provided to aid the researcher in making inferences. This additional information may take the form of confidence intervals for parameters of interest or hypothesis tests of these parameters. When considering the options available it becomes convenient to distinguish between inferences made about variance parameters (elements in Σ and Γ), inferences made about fixed effects (γs), and inferences made about the random level-1 coefficients (e.g., β0j). Inferences about variance parameters. A researcher may be interested in creating a confidence interval (CI) for a variance parameter. The simplest approach would be to make use of the standard error of the variance parameter estimate, which is computed from the inverse of the information matrix. By adding and subtracting 1.96 times the standard error of the parameter estimate, one can create a 95% CI, assuming a normal sampling distribution. This approach, however, has limitations, especially when the sample size is small or the variance parameter is near zero (e.g., Littell, Milliken, Stroup, & Wolfinger 1996; Raudenbush & Bryk, 2002). Under these conditions the variance parameter will tend to have a skewed sampling distribution,

19 making symmetric intervals based on the standard error unrealistic. Under these conditions researchers should turn to other options including the Satterthwaite approach (Littell, Milliken, Stroup, & Wolfinger 1996), bootstrapping (Meijer, van der Leeden, & Busing, 1995; Carpenter, Goldstein, & Rashbash; 1999), a method based on local asymptotic approximations (Stern & Welsh, 2000), and if the data are balanced, the approach proposed by Yu and Burdick (1995). For researchers wishing to test hypotheses regarding variance components, again a variety of choices are available. The simplest would be to conduct a z-test by dividing the estimate by its standard error. Although this approach is asymptotically valid, it, like the standard error based CIs noted previously, becomes questionable when the sampling distribution cannot be assumed normal. A somewhat more appealing option is to use a likelihood ratio χ2 (e.g., Little, Milliken, Stroup, & Wolfinger 1996). This test requires the user to estimate two models, one with and one without the questionable variance parameter(s). The difference in the log likelihoods obtained in these analyses is then used to construct a statistic that in large samples follows a χ2 distribution. Note this method can be used for single parameter tests or multiple parameter tests. Additional alternatives include an approximate χ2 test described by Raudenbush and Bryk (2002), bootstrapping (Meijer, van der Leeden, & Busing, 1995; Carpenter, Goldstein, & Rashbash; 1999), a likelihood ratio test based on the local asymptotic approximation (Stern & Welsh, 2000), and exact tests that have been established for some contexts (Christensen, 1996; Ofversten, 1993). Finally, it should be noted that in addition to point estimates, confidence intervals, and statistical tests, researchers should consider whether combining variance estimates and or making variance ratios could help to answer the research questions. For example, one may be interested in the explained variance (R2) at one or more levels of the model (Snijders & Bosker, 1994), the intraclass correlation (e.g., Raudenbush & Bryk, 2002), or the reliability of estimators (Raudenbush & Bryk, 2002). Those interested in creating confidence intervals for variance ratios are referred to the statistical literature (Lee & Seeley, 1996). As far as we know the methods described there have not been implemented in the hierarchical linear modeling software programs. Inferences about fixed effects. A researcher interested in making inferences about fixed effects may wish to construct confidence intervals for the effects of interest. A 95% CI could be constructed around the point estimate by adding and subtracting 1.96 times the standard error. This of course assumes a normal sampling distribution, which can be demonstrated asymptotically, but which becomes questionable for smaller samples. Consequently, one would

20 typically substitute a t-value with ν degrees of freedom for the 1.96. Several methods for defining the degrees of freedom have been given (Giesbrecht & Burns, 1985; Kenward & Rogers, 1997), and some software packages (e.g., Proc Mixed) allow for different definitions to be specified. An alternative to assuming an approximate t-distribution is to turn to bootstrapping to construct the confidence intervals. Hypothesis tests can also be conducted by using t- or F-tests with the approximate degrees of freedom. Again, different definitions have been suggested, and thus researchers need to be clear about the method used for obtaining the degrees of freedom for these tests. Several alternatives to these approximate tests have been discussed. These include a test based on a Bartlett corrected likelihood ratio statistic (Zucker, Lieberman, & Manor, 2000), a permutation test (Reboussin & DeMets (1996), and bootstrapping. Inferences about random level-1 coefficients. Researchers may also be interested in estimating the random level-1 coefficients and making inferences about these coefficients. For example, a researcher who is interested in estimating the effects of prior reading achievement on end of school year reading achievement, may wish to get a separate effect estimate for each classroom. One approach would be estimate the level-one model separately for each classroom using standard ordinary least square (OLS) estimation methods, in which case standard methods are available for constructing confidence intervals and testing hypotheses about coefficients. The drawback of the OLS approach is that each estimate is based on relatively few observations, only those from the classroom of interest, thus leaving a lot of room for sampling error. An alternative is to obtain empirical Bayes estimates, which consider all the available information. Empirical Bayes estimates tend to pull the effect estimates toward the overall average estimate by an amount that depends on the uncertainty in the effect estimate being considered and the variability in the effect estimates. This process biases the estimates, but leaves us with values that tend to be closer to the parameter values (i.e., a smaller expected mean square error) than those based on OLS estimation (Raudenbush & Bryk, 2002). For empirical Bayes estimates the standard errors can be computed and used for the creation of confidence intervals or z-tests of statistical significance. These methods assume a normal sampling distribution, and thus may be unrealistic unless there is a large number of level-2 units (Raudenbusch & Bryk, 2002). METHOD Coding Protocol To analyze the articles representing multilevel applications, we developed a coding framework based in large part on the issue identified during the first phase of our review of the

21 literature. Within each area (i.e., model development and specification, data considerations, estimation, and hypothesis testing and inference) specific questions were devised to guide our review. The current issues and critical questions were organized into a checklist that was refined using a series of pilot tests. In these pilot tests, members of the research team independently analyzed the same application article using the checklist; members then came together as a group to check the consistency of the responses, discuss coding decisions and possible alterations of the checklist. A codebook, which facilitated coding efforts, was developed during these meetings to capture in more detail the coding process. The final version of the checklist, which was used to code each of the articles, is provided as an appendix. Searching Strategies for Applications To describe the current application and reporting of multilevel analyses in the field of education, prominent educational and behavioral research journals were initially selected for examination. We examined the same set of journals provided in the methodological research review published in Review of Educational Research by Keselman et al. (1998). It was deemed appropriate to begin with this set of journals, as these journals publish empirical research, represent different sub disciplines in education and are highly regarded in the fields of education and psychology. Additionally, we relied on the expertise of our research team to identify other well known publications that might provide similar applications of multilevel modeling. For this phase of our review, all of the issues of each volume of the chosen journals, published between 1998 – 2002, were hand searched for evidence of the employment of hierarchical linear modeling techniques. That is, our research team did not rely solely on article titles and abstracts to make our determination to include or exclude a particular study. Description of the Sample Of the identified articles, 20 have been reviewed at this time. The largest proportion (40%) came from the most recent year considered for this study, 2002 (see Figure 1) and only one study (5%) was from the earliest time point considered, 1998. The remainder of the sample (55%) was distributed almost equally over the middle three years, 1999-2001. The sample was drawn from 10 peer-reviewed journals that are fairly prominent in the social sciences (see Table 1). Studies from four of the journals (The American Educational Research Journal, the Journal of Educational Research, the Journal of Personality and Social Psychology, and the Journal of Applied Psychology) accounted for the majority of the sample, 60%, with each supplying three studies (15%) that were used in this analysis. Four of the remaining six journals were each a source for one article in the analysis while two articles were retrieved from the remaining two journals.

22

Figure 1. Distribution of Sample Studies Based on Year Published

Table 1 Journals and Years of Sample Studies JOURNAL American Educational Research Journal Child Development Journal of Educational Psychology Sociology of Education Journal of Applied Psychology

Journal of Educational Research Journal of Personality and Social Psychology Reading Research Quarterly Cognition and Instruction Developmental Psychology Total

YEAR 2002 0 0 1 0 2 3 0 1 0 1 8

2001 1 0 1 0 1 0 0 0 0 0 3

2000 2 0 0 1 0 0 1 0 0 0 4

1999 0 1 0 1 0 0 1 0 1 0 4

1998 0 0 0 0 0 0 1 0 0 0 1

Total 3 1 2 2 3 3 3 1 1 1 20

RESULTS Study Characteristics Before turning to the four central issues that were identified as important in the analysis and presentation of multilevel models, we took care to examine a host of characteristics germane to the set of articles examined. For this investigation, we thought that it would be prudent to articulate the types of studies being examined (e.g., individuals nested in contexts versus

23 repeated measures), the rationale provided by authors for employing HLM methods, the study design, sampling, the average number of units at varying levels of the model and the description provided regarding the distribution of level 1 units across level 2 units. The studies were typically nonexperimental (85%) and often did not use probability sampling (65%). They covered a wide range of applications. Half the studies used two-level models where individuals were nested in contexts. Two studies (10%) involved thee-level models where individuals were nested in contexts that were nested in contexts, while the remaining eight studies (40%) involved repeated measures data. Almost all of the studies (90%) explicitly stated a rationale for using hierarchical modeling, but the level of detail in the rationales varied greatly. The studies also differed widely in the amount of data used in the analysis, where the number of level-two units ranged from 19 to 1406, and the average number of level one-units per level-two unit ranged from a low just over 2 to a high of 160. Model Development & Specification Model development is a central component of inquiry in many disciplines, with notably different approaches evidenced among researchers both within and across disciplines. In our examination of this critical component, we considered a host of aspects related to divergent approaches to the development and specification of multilevel statistical models. A considerable amount of variability was evidenced in the number of models examined by researchers and the clarity of how well the number of models was communicated. For example, in only 45% of the articles reviewed were we able to determine with confidence the number of models analyzed. For this subset of articles (n=9), the number of models examined ranged from 4 to 430 with the median number equal to 9 (M = 51, SD = 126). For the set of published articles that we scrutinized, baseline models (i.e., unconditional models) were frequently investigated as part of data analysis (n=9, 45%). For 11 of the articles, we could not determine with confidence if baseline models were examined (see Table 2). It was also common to encounter studies that examined more than one set of predictor variables for each of the dependent variables under investigation (n=15, 75%). For these studies, researchers employed between two and six sets of predictors. In all of the studies that we examined, the predictors were selected based at least partially on apriori considerations. In most of these cases, strong support was provided by the literature base and empirical research. In six cases there was evidence that predictor variables were selected, in part, on significance tests for the individual predictors. With respect to the subset of researchers who explored multiple sets of predictors, the exact number of sets could not accurately be determined for approximately 35% of the studies. Further, four of the studies reported level two statistical interactions, while nine reported across level interactions. During

24 our examination of how researchers typically specify the covariance structure underlying the data, we observed that for approximately two-third of the studies, there was no clear discussion of this issue. For these instances, it appeared that software defaults were used in the analyses. Although centering has important implications for interpreting the results from the statistical modeling, 40% (n=8) of the studies provided no discussion of centering at level-1 and 60% (n=12) of the studies did not provide any discussion of centering at level-2. When centering was used at level-1, researchers either used grand mean (30%), group mean (15%) or other approaches (25%). Other types of centering for level-1 variables included from the last time point, coding from a given point in time, centered from time of loss, or some form of standardization. Grand mean centering was reported for the eight studies that reported the use of centering at level-2.

Table 2 Model Development and Specification Characteristic Examination of baseline models Yes No Unable to determine Selection of predictors: Based at least partially on: Aprior considerations Significance test for individual predictors Effect size for individual predictors Fit statistics (e.g. AIC or SBC) More than one set of predictor variables for each DV Yes, but exact number could not be determined with confidence Yes, number of sets of predictors could be determined Could not be determined with confidence No Interactions examined Level 1 Level 2 Across level No interactions Selection of covariance structure Not discussed, or unclear, and/or appears that defaults were used Established apriori prior to looking at the data Based partially on LRTS or significance tests for individual variance components Based partially on fit statistics (e.g. AIC or SBC)

N

Percent (%)

10 3 7

50 15 35

20 6 1 0

100 30 5 0

7

35

8

40

1 4

5 20

1 4 9 6

5 20 45 30

13

65

4 7

20 35

0

0

25 Centering Level 1 No discussion of centering 8 40 Grand mean 6 30 Group mean 3 15 Other 5 25 Level 2 No discussion of centering 12 60 Grand mean 8 40 Note: Counts may exceed 100% if multiple methods were applied (i.e., selection of predictors, centering, selection of variance structure). When we critiqued the extent to which models were well communicated, we observed 35% of the studies did not explicitly communicate the nature and number of the models, yet we were able to glean this information through close scrutiny of the text, tables, and footnotes (Table 2). For the remaining studies, only 10% (n=2) provided explicit statements of the number of models examined, while for the other 55% we could not determine this information with any degree of confidence. Given the complexity and number of models run, researchers tended to use multiple approaches to reporting the results. The most prominent method of communicating fixed effects was through the use of verbal descriptions (n=20), followed closely by lists of estimated effects (n=19), and communication through a series of regression equations (70%). The most common methods of communicating the estimated variance structure was a list of parameters (55%) and verbal description (75%). Eight of the studies examined provided evidence of variance parameters through the use of equations. None of the articles included matrix representations of these relationships. As researchers we are keenly interested in the extent to which our results are generalizable. To examine the critical issue of generalizability, we considered a broad range of evidence with respect to this aspect of inquiry. For example, we looked for both sensitivity analysis and traditional cross-validation methods as evidence of generalizability. Further, we also included both the replication or extension of previous research as well as internal replications (e.g., between group differences). None of the studies addressed the possibility of capitalizing on chance in model development by employing cross-validation analyses. However, six studies provide evidence of internal replication and three studies provided evidence of sensitivity analysis, while a single study reported replication/extension of previous research.

Table 3 Model Communication

26 Characteristic

N

Percent (%)

Communication of models presented Not explicitly stated, but could be determined from the information provided in the text, footnotes, etc.

7

35

Explicit statement of the number of models examined

2

10

Could not be determined with confidence

11

55

14 19 20

70 95 100

8 11 15

40 55 75

3 6 1

15 30 5

Communication of fixed effects Equation representation List of estimated effects Verbal description Communication of Variance structure Equation representation List of estimated effects Verbal description Generalizability Sensitivity analysis Internal replication Replication

Data Considerations Because inferences in multilevel models are based on an analysis of the covariances between and within the nested units, the consideration of distributional assumptions, outliers, statistical power, and missing data are critical to obtaining credible results. The results of the analysis of the treatment of such data considerations in the 20 articles reviewed are presented in Table 4. Despite the recent advances in statistical power analysis in multilevel models, none of the studies examined included an explicit discussion of statistical power in the study design or interpretation of results. Similarly, only three of the articles (15%) provided evidence of outlier screening and only one article described a consideration of the potential impact of measurement error on the resulting models. Conversely, 90% of the studies (n = 18) provided some discussion of missing data in the analysis and six of the 17 studies that acknowledged missing data (35%) included a consideration of the randomness of such missing data. However, details on the treatment of missing data in the analysis were less prevalent. Ten of the studies used listwise deletion for missing data at level 1 and two studies used a simple imputation procedure. For missing data at level 2, eight of the studies used listwise deletion, two studies used imputation, and two studies used other procedures (i.e., selecting a proxy variable with less missing data and

27 incorporating a missingness indicator vector in the analysis). Even when the nature of the missingness was discussed, the articles generally provided little insight as to the overall impact of the missing data treatment on the resulting estimates. Multilevel models require assumptions about the errors at each level of the analysis, and a consideration of the tenability of these assumptions is important in assessing the credibility of the results. In the 20 studies examined in our study, only four (20%) discussed normality of the level 1 residuals and three (15%) discussed variance homogeneity of the residuals. Only two studies provided details about the results of checking these residuals with corrective action taken. Finally, four studies (20%) mentioned the assumption of residual normality at level 2, but only one of these provided details on the extent to which this assumption was met.

Table 4 Data Considerations Characteristic

N

Percent (%)

Statistical Power

0

0

Discussion of Missing Data Randomness of Missing Data Treatment of Missing Data at Level 1 Listwise Deletion Imputation Treatment of Missing Data at Level 2 Listwise Deletion Imputation Other

18 6

90 351

10 2

672 132

8 2 2

1003 253 253

Discussion of Outliers Screening for Outliers

3 1

15 5

Treatment of Imperfect Measurement

1

5

Assumptions Level 1 Residual Normality Level 1 Residual Homogeneity of Variance Level 2 Residual Normality

4 3 4

20 15 20

1 Percent

based on 17 papers that acknowledged missing data. 2 Percent based on 15 papers that acknowledged Level 1 missing data. 3 Percent based on 8 papers that acknowledged Level 2 missing data.

Estimation and Testing Each article was examined for details about the analysis performed. In particular, articles were examined for information regarding the software utilized as well as general estimation techniques, including the method used, the algorithm used, whether or not convergence

28 problems were encountered, or if matrices were positive definite. Additionally, studies were examined for details regarding the method employed for drawing inferences for variance parameters and fixed effects. In general, limited information regarding methods of estimation and testing was provided, including the type of software used in the analysis. Only 40% of the studies explicitly state the type of software used in the analysis (Table 5). Six of the eight indicated the use of a variation of Byrk and Raudenbush’s HLM software and two indicated the use of the SAS software. Other available software for these types of analyses (e.g., ML-WIN, M-PLUS ) were not explicitly noted.

Table 5 Software Identified for Use in HLM Studies N

Percent (%)

Details

No information about software used

12

60

Information about program used, no information regarding date/version

2

10

SAS Proc Mixed

Information about program used as well as date or version used

6

30

HLM

General information about model estimation methods, algorithms, convergence issues and whether matrices were positive definite, was virtually non-existent. As Table 6 illustrates, few of the studies examined thus far provided any information on these issues. Since there are multiple ways to estimate hierarchical models, and evidence that these different methods can lead to different results and potentially to estimation problems, it is important that authors provide detail about the estimation process if other researchers are to be able to critically evaluate or replicate the analyses. Table 6 Model Estimation Considerations N

%

Estimation Method Stated

3

15%

Estimation Algorithm Stated

0

0%

Convergence Addressed

0

0%

29 Positive Definite Matrices Addressed

0

0%

Estimates of variance and covariance of model parameters varied across the studies. Variance estimates tended to be provided more often than covariance estimates between intercept and slope errors (Table 7). In 75% of the studies, one could not determine whether the covariance had been estimated. The large percentage of articles containing incomplete information was somewhat surprising. For other types of statistical models (e.g., multiple regression or structural equation modeling) it is expected that a complete listing of the estimated parameters will be given. It seems reasonable to expect the same in hierarchical linear modeling, at least for the models that are presented and interpreted. Table 7 Frequency of Reporting Variance and Covariance Estimates

Error Variance of Intercepts Error Variance of Regression Coefficient or Slope Covariance Between the Intercept and Slope Errors First Level Error Variance

Provided

Estimated but Not Provided

Insufficient Information Given

Not Applicable Since Not Estimated

10 (50%)

8 (40%)

2 (10%)

0 (0%)

7 (35%)

6 (30%)

3 (15%)

4 (20%)

1 (5%)

0 (0%)

15 (75%)

4(20%)

8 (40%)

12 (60%)

The degree to which variance information and fixed effect information was reported also fluctuated across studies, as well as how such information was reported. Table 8 provides a summary of what type of variance information was reported in the studies as well as how that information was reported. Significance tests (n = 12) were the most prevalent method of reporting additional information regarding variance. In addition, only six of the studies reported the method they used for these significance tests, a chi-square analysis, and none of the articles specified the type of chi-square analysis used. Again we observe the tendency to not provide enough detail for thorough critique or replication. None of the studies used confidence intervals to gauge precision of variance estimates although four (20%) studies provided such information for fixed effect estimates. For fixed effects, significance tests and point estimates were widely reported (95% and 100% of the time, respectively). However, although information on fixed

30 effects tended to be reported more often than variance estimates, the details of how inferences were made were often not included. Only eight of the studies indicated the type of test used (e.g., t-test) and of these none reported the method for determining the degrees of freedom. Table 8 Additional Information on Variance and Fixed Effects Reported N

Percent (%)

7 2 0 12 2 6

35 10 0 60 10 30

5

25

8 6 0 6 0

40 30 0 30 0

1 12 4 19 20 6

5 60 20 95 100 30

12 0 8 0

60 0 40 0

20 0 0 0 0

100 0 0 0 0

Additional variance information provideda None Standard Errors Confidence Intervals Significance Tests Reliabilities Inter Class Correlations Explained Variance Method used for CIs or Significance Tests for Variance Parameters Not Applicable Not Stated SE/z-estimate Chi-Square Other Fixed Effect Information Provideda None Standard Errors Confidence Intervals Signficance Tests Point Estimates Other Method used for CIs or Signficance Tests for Fixed Effects Not Stated Likelihood Ratio T or F test Other Level-1 Parameter Information Provided None Extimates Provided, Method Not Stated OLS or EB Estimates Statistical Tests for OLS or EB Estimates CIs for OLS or EB Estimates

31 Other may exceed 100% if multiple methods were applied

0

0

aCounts

CONCLUSIONS AND RECOMMENDATIONS The results presented from this study should be viewed as preliminary, and although we will offer some conclusions and recommendations, it is important to note that these are being offered tentatively at this point. We have only reviewed 20 of the articles published in the selected journals between 1998 and 2002, which is about ¼ of the hierarchical modeling articles published in these journals. After reviewing the remainder of the articles, we will be able to make more precise statements about analysis and reporting practices. It should also be noted that questions could be raised about the reliability of the coding. Each article was reviewed independently by all members of the research team and then discussed at a team meeting, at which time a master checklist was created. There were many items for which 100% agreement was obtained (e.g., was there a statement of the statistical software used?), there were other items, however, that involved greater levels of inference and that sometimes led to disagreements (e.g., how many models were estimated?). These disagreements were resolved through discussion, and a codebook, which facilitated coding efforts, was developed to capture in more detail the coding process. We anticipate estimating reliability for all coding decisions for a sample of the remaining articles, and then using these estimates to guide the number of coders used to examine the remaining articles. When the reliability has been estimated and more articles have been coded, it will be possible to make less tentative conclusions and recommendations. Even keeping in mind the preliminary nature of our results, there seems to be some relatively clear problems in the reporting of HLM analyses. There is often not enough information for a reader to technically critique the reported analyses, even when the writers have done an admirable job in discussing the critical methodological issues of sampling, research procedures, and measurement. With this in mind, we suggest the following reporting guidelines for hierarchical modeling. 1. Provide a clear description of the process used to arrive at the model(s) presented. This should include discussion of how the predictors were selected, how the covariance structure was chosen, and a statement of how many models were examined. Readers can more carefully consider the presented models if they understand the process from which the models were generated.

32 2. Explicitly state whether centering was or was not used, and if it was, provide details on which variables were centered and how they were centered. Without knowledge of centering decisions, readers cannot easily interpret the regression coefficients. 3. Explicitly state whether there were specification checks, if distributional assumptions were considered, and whether data were screened for outliers. If such checks were made, state both the method used and what was found. Without this sort of information it is easier to question the creditability of the results. 4. State whether the data were complete. If they were not complete, describe the missingness and attempt to provide insight into its possible effects on the results. 5. Provide details on the analysis methods, including a statement of the software used, the method of estimation, whether convergence was obtained, and whether all variance estimates were admissible. It is important for authors to list the version of the software used in case bugs in a specific version of the software are found at a later date that may call into question the interpretation. The other details are helpful for interpreting the parameter estimates. 6. For any interpreted model, provide a complete list of all parameter estimates. In addition to providing critical information for interpreting the results, this helps to communicate the precise model estimated. 7. Provide either standard errors or interval estimates of the parameters of interest. This recommendation is consistent with the general reporting guidelines provided by the APA taskforce on statistical inference (Wilkinson & Task Force on Statistical Inference, 1999). Statistical significance tests provide limited inferential information, and can be difficult to interpret when large numbers of tests have been conducted, which was typical in the reviewed applications. We recognize that it would also appear helpful if we provided some concrete guidelines regarding the conduct of hierarchical modeling. Unfortunately, the models are complex and the methodological decisions regarding their implementation are best made only after careful consideration of a particular application. For example, whether group mean centering makes for a good recommendation depends on the application being considered. Whether restricted maximum likelihood estimation is the best recommendation for estimation depends on the application being considered. We hope that providing some reporting guidelines will heighten awareness of some of the technical issues among researchers and reviewers involved in a particular application. This in turn may lead to a careful examination and critical dialog about

33 the issues within the context of the application, which may facilitate improvements in applied practice.

34 References Akaike, H. (1974). A new look at the statistical model of identification. IEEE Transaction on Automatic Control, 19, 716-723. Anderson, A. B., Basilevsky, A., & Hum, D. P. (1983). Missing data. In P. H. Rossi, J. D. Wright, & A. B. Anderson (Eds.), Handbook of survey research (pp. 415-494). New York: Academic Press. Basilevsky, A., Sabourin, D., Hum, D., & Anderson, A. (1985). Missing data estimators in the general linear model: An evaluation of simulated data as an experimental design. Communications in Statistics, 14, 371-394. Beale, E. M. L., & Little R. J. A. (1975). Missing values in multivariate analysis. Journal of the Royal Statistical Society, Series B, 37, 129-145. Bremer, R.H. (1993). Choosing and modeling your mixed linear-model. Commun Stat – Theory, 22, 3491-3521. Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical linear models: Applications and data analysis methods. Newbury Park: Sage Publications. Burstein, Kim, & Delandshere, (1989). Carpenter, J. Goldstein, H., & Rasbash, J. (1999). A non-parametric bootstrap for multilevel models. Multilevel Modeling Newsletter, 11, 2-5. Christensen, R. (1996). Exact tests for variance components. Biometrics, 52, 309-314. Collins, L. M., Schafer, J. L., & Kam, C. M (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330-351. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1-8. Dickmeyer, N. (1989). Metaphor, model, and theory in education research. Teachers College Record, 91, 151-160. Donner A, Birkett N, Buck C. (1981). Randomization by cluster: Sample size requirements and analysis. American Journal of Epidemiology, 114, 906-914. Draper, D. (1995). Inference and hierarchical modeling in the social sciences. Journal of Educational and Behavioral Statistics, 20 (2), 115-147. Ferron, J., Dailey, R., & Yi, Q. (2002). Effects of misspecifying the first-level error structure in two-level models of change. Multivariate Behavioral Research, 37, 379-403. Frane, J. W. (1976). Some simple procedures for handling missing data in multivariate analysis. Psychometrika, 41, 409-415.

35 Giesbrecht, F., & Burns, J. (1985). Two-stage analysis based on a mixed model: Large-sample asymptotic theory and small-sample simulation results. Biometrics, 41, 477-486. Gleason, T. C., & Staelin, R. (1975). A proposal for handling missing observations. Psychometrika, 40, 229-252. Goldstein, (1986). Multilevel mixed linear model analysis using iterative generalized least squares. Biometrika, 73, 43-56. Goodwin, L. D. & Goodwin, W. L. (1985b). Statistical techniques in AERJ articles, 1979 – 1983: The preparation of graduate students to read the educational literature. Educational Researcher, 14, 5 – 11. Gosh, M., & Rao, J. (1994). Small area estimation: An appraisal. Statistical Science, 9, 55-93. Guertin, W. H. (1968). Comparison of three methods of handling missing observations. Psychological Reports, 22, 896. Heitjan, D. F. & Rubin, D.B. (1991). Ignorability and coarse data. Annals of Statistics, 12, 22442253. Hopkins, K. D. (1982). The unit of analysis: Group means versus individual observations. American Educational Research Journal, 19, 5-18. Hsieh, F. Y. (1988). Sample size formulae for intervention studies with the cluster as unit of randomization. Statistics in Medicine, 7, 1195–1201. Jiang, JM (2001) Goodness-of-fit tests for mixed model diagnostics. The Annals of Statistics, 29, 1137-1164. Kackar, R., & Harville, D. (1981). Unbiasedness of two-stage estimation and prediction procedures for mixed linear models. Communications in Statistics-Theory and Methods-A, 10, 1249-1261. Kackar, R., & Harville, D. (1984). Approximations for standard errors of estimators of fixed and random effects in mixed linear models. Journal of the American Statistical Association, 79, 853862. Kasim, R., & Raudenbush, S. (1998). Application of Gibbs sampling to nested variance components models with heterogeneous with-in group variance. Journal of Educational and Behavioral Statistics, 20, 93-116. Kenward, M., & Roger, J. (1997). Small sample inference for fixed effects from restricted maximum likelihood. Biometrics, 53, 983-997. Keselman, H.J., Algina, J., Kowalchuk, R. K., & Wolfinger, R.D. (1998). A comparison of two approaches for selecting covariance structures in the analysis of repeated measurements. Communications in Statistics: Simulation, 27, 591-604.

36 Keselman, H. J., Huberty, C. J., Lix, L.M., Olejnik, S., Cribbie, R.A., Donahue, B., Kowalchuk, Rhonda K., Lowman, L. L., Petoskey, M. D., Keselman, J. C., & Levin, J. R. (1998). Statistical Practices of Educational Researchers: An Analysis of Their ANOVA, MANOVA, and ANCOVA Analyses. Review of Educational, 68, 350-386. Kim, J., & Curry J. (1977). The treatment of missing data in multivariate analysis. Sociological Methods and Research, 6, 215-240. Kreft, I. G. G. (1995). Hierarchical linear models: Problems and prospects. Journal of Educational and Behavioral Statistics, 20 (2), 109-133. Kreft, I.G.G., & de Leeuw, J. (1998). Introducing Multilevel Models. London: Sage. Kromrey, J. D., & Dickenson, W. B. (1996). Detecting unit of analysis problems in nested designs: Statistical power and Type I error rates of the F test for groups-within-treatments effects. Educational and Psychological Measurement, 56, 215-231. Lange, L., & Laird, N. M. (1989). The effect of covariance structure on variance estimation in balanced growth-curve models with random parameters. Journal of the American Statistical Association, 84, 241-247. Lee, Y., & Seely, J. (1996). Computing the Wald interval for a variance ratio. Biometrics, 52, 14861491. Lindley, D., & Smith, A. (1972). Bayes estimates for the linear model. Journal of the Royal Statistical Soceity, B, 34, 1-41. Linstrom, M., & Bates, D. (1989). Newton-Raphson and EM algorithms for linear mixed-effects models for repeated-measures data. Journal of the American Statistical Association, 83, 10141022. Little, T. D., Lindenberger, U., & Nesselroade, J. R. (1999). On selecting indicators for multivariate measurement and modeling with latent variables: When “good” indicators are bad and “bad” indicators are good. Psychological Methods, 4, 192-211. Little, R. J., & Rubin, D. B. (1987). Statistical analysis with missing data. New York: John Wiley & Sons. Littell, R., Milliken, G., Stroup, W., & Wolfinger, R. (1996). SAS system for mixed models. Cary, NC: SAS Institute Inc. Longford, N. (1987). A fast scoring algorithm for maximum likelihood estimation in unbalanced mixed models with nested random effects. Biometrika, 74, 817-827. Longford, N. (1993). Regression analysis of multilevel data with measurement error. British Journal of Mathematical and Statistical Psychology, 46, 301-311. Longford, N. (2001). Simulation-based diagnostics in random-coefficient models. Journal of the Royal Statistical Society Serie A-Statistics in Society, 164, 259-273.

37 Meijer, E., van der Leeden, R., & Busing, F. (1995). Implementing the bootstrap multilevel model. Multilevel Modeling Newsletter, 7, 7-11. Ofversten, J. (1993). Exact tests for variance components in unbalanced mixed linear models. Biometrics, 49, 45-57. Raudenbush, S. W. (1997). Statistical analysis and optimal design for cluster randomized trials. Psychological Methods, 2, 173-185. Raudenbush, S. W. & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods. Newbury Park: Sage Publications. Raudenbush, S., Bryk, A., Cheong, Y. & Congdon, R. (2000). HLM 5: Hierarchical Linear and Nonlinear Modeling. Chicago: Scientific Software International. Raymond, M. R., & Roberts, D. M. (1987). A comparison of methods for treating incomplete data in selection research. Educational and Psychological Measurement, 47, 13-26. Reboussin, D. M., & DeMets, D. L. (1996). Exact permutation inference for two sample repeated measures data. Communications in Statistical Theory and Methods, 25, 2223-2238. Richardson, A., & Welsh, A. (1995). Robust restricted maximum likelihood in mixed linear models. Biometrics, 51, 1429-1439. Ridgeway, V. G., Dunston, P. J., & Quian, G. (1993). A methodological analysis of teaching and learning strategy research at the secondary school level. Reading Research Quarterly, 28, 335 – 349. Roy, J., & Lin, X. (2002). Analysis of multivariate longitudinal outcomes with nonignorable dropouts and missing covariates: Changes in methadone treatment practices. Journal of the American Statistical Association, 97, 40-52. Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: John Wiley & Sons, Inc. Santos, R. (1981). Effects of imputation on regression coefficients. Proceedings of the Section on Survey Research Methods, American Statistical Association, 141-145. SAS Institute Inc. (2000). SAS/Proc Mixed (version 8) [comuputer program], Carey, NC: SAS Institute Inc. Schwartz, G. (1978). Estimating the dimensions of a model. Annals of Statistics, 6, 461-464. Seltzer, M. Novak, J., Choi, K., & Lim, N. (2002). Sensitivity analysis for hierarchical models employing t level-1 assumptions. Journal of Educational and Behavioral Statistics, 27, 181-222. Shimodaira, H. (1998). An application of multiple comparison techniques to model selection. Annals of the Institute of Statistical Mathematics, 50, 1-13.

38 Singer, J. D. (1998). Using SAS PROC MIXED to fit multilevel models, hierarchical models, and individual growth models. Journal of Educational and Behavioral Statistics, 24, 323-355. Snijder, T., & Bosker, R. (1993). Standard errors and sample sizes for two-level research. Journal of Educational Statistics, 18, 237-259. Snijder, T., & Bosker, R. (1994). Modeled variance in two-level models. Sociological Methods and Research, 22, 342-363. Stern, S., & Welsh, A. (2000). Likelihood inference for small variance components. Canadian Journal of Statistics, 28, 517-532. Teuscher, F., Herrendorfer, G., & Guiard, G. (1994). The estimation of skewness and kurtosis of random effects in the linear model. Biometrical Journal, 36, 661-672. Wada, Y. & Kashiwagi, N. (1990). Selecting statistical models with information statistics. Journal of Dairy Science, 73, 3575-3582. Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604. Wolfinger, R. (1993). Covariance structure selection in general mixed models. Communication Statistics – Simulation, 22, 1079-1106. Woodhouse, G., Yang, M., Goldstein, H., & Rashbash, J. (1996). Adjusting for measurement error in multilevel analysis. Journal of the Royal Statistical Society-A, 159, 201-212. Yu, Q., & Burdick, R. (1995). Confidence-intervals on variance components in regression-models with balanced (Q-1)-Fold nested error structure. Communications in Statistics-Theory and Methods, 24, 1151-1167. Zucker, D., Lieberman, O., & Manor, O. (2000). Improved small sample inference in the mixed linear model: Bartlett correction and adjusted likelihood. Journal of the Royal Statistical SocietyB, 62, 827-838.

39

Title:___________________________________________________________________________________ Author(s):_______________________________________________________________ Journal, Year, Vol (Number), pgs: ___________________________________________ Study Characteristics (place holder) Is there an appendix provided with technical details? ______ Yes ________No

1. What best describes the study type?

2. Is a rationale (and/or advantage(s)) provided for using HLM methods in the study?

____ ____ ____ ____ ____ ____ ____

a. b. c. d. e. a. b.

individuals nested in contexts growth curves individuals nested in contexts within contexts growth curves nested within contexts other, describe: ___________________________ no yes: _____________________________________ __________________________________________ __________________________________________ __________________________________________ __________________________________________

Page(s):

Page(s):

Page(s):

3. What is the study design?

_____ a. nonexperimental _____ b. experimental Data set:__________________________________________ _________________________________________________ _________________________________________________ _________________________________________________ _________________________________________________

Page(s):

4. Thoroughly describe the data set, including scope (national, etc.) if known.

_____ a. nonprobability _____ b. probability _____ c. mixed—describe: _________________________ ________________________________________ ________________________________________

Page(s):

5. What type of sampling was used?

6. How many level 1 units per level 2 unit? (e.g., average number of students per school in nested designs, number of observations per student in growth curve designs) ________________

Page(s):

Page(s):

7. How many level 2 units? (e.g., number of schools in a nested design, number of students in a growth curve design, etc.) __________________

8. How well was the distribution of level 1 units across level 2 units addressed?

_____ a. not described _____ b. minimal description, e.g., only average number of students per school or only number of observations per student with no further information _____ c. described partially, e.g., one may know the maximum number of observations per student _____ d. described fully so that it is clear how many level 1 units there are for each level 2 unit.

Page(s):

Comments/Notes

9. Fill out the following tables by listing the outcome(s) and predictors modeled

Outcome

Predictor

Type

Nature

Normality

Outliers

Reliability

Validity

1= achievement 2 = other (specify)

1 = continuous 2 = binary 3 = count 4 = ordinal 5 = multinomial

0 = not discussed 1 = normal 2 = nonnormal

0 = not discussed 1 = no 2 = yes

0 = not discussed 1 = estimated from data set 2 = other

0 = not discussed 1 = validity evidence gathered using this data 2 = other

Nature

Normality

Outliers

Reliability

Validity

1 = continuous 2 = binary 3 = count 4 = ordinal 5 = multinomial

0 = not discussed 1 = normal 2 = nonnormal

0 = not discussed 1 = no 2 = yes

0 = not discussed 1 = estimated from data set 2 = other

0 = not discussed 1 = validity evidence gathered using this data 2 = other

MODEL SPECIFICATION Page(s):

10. How many models are examined in the study?

____________________________________________

11. How well were the number of models presented in this communicated?

_____ a. not explicit but can be determined from information provided in text, tables, footnotes, etc. _____ b. explicit statement(s) of number of models examined _____ c. cannot be determined with confidence

12. Were baseline models run?

_____ a. no _____ b. yes _____ c. cannot be determined with confidence

Page(s):

Page(s):

Page(s):

NOTE: If different methods were used for different models in the study, please list all methods used

_____ a. not discussed or unclear _____ b. based at least partially on apriori considerations _____ c. based at least partially on significance tests for individual predictors _____ d. based at least partially on effect size for individual predictors _____ e. based at least partially on fit statistics like AIC or SBC _____ f. other: ____________________________________

14. Were there more than one set of predictors for each Dependent Variable?

_____ _____ _____ _____

a. b. c. d.

no yes, but exact number of sets could not be determined yes, number of different sets of predictors: _________ cannot be determined with confidence

Page(s):

15. Were interactions examined and communicated? Check ALL that apply

_____ _____ _____ _____ _____

a. b. c. d. e.

no yes, level 1 interaction(s) yes, level 2 interaction(s) yes, across level interaction(s) cannot be determined with confidence

Page(s):

13. How were the predictors selected?

16. How was the covariance structure of the model(s) specified? NOTE: If different methods were used for different models in the study, please list all methods used 17. Was there centering of variables at level 1?

18. Was there centering of variables at level 2?

Page(s):

_____ _____ _____ _____

a. b. c. d.

not discussed, unclear, and/or appears defaults were used established prior to looking at the data based partially on fit statistics like the AIC or SBC based partially on LRTs or significance tests for individual variance components _____ e. other:____________________________________ _____ _____ _____ _____ _____

a. b. c. d. e.

_____ _____ _____ _____ _____

a. b. c. d. e.

no discussion of centering no centering grand mean centering group mean centering other: ___________________________________ (e.g., growth curve centered at last point) no discussion of centering no centering grand mean centering group mean centering other: ___________________________________ (e.g., growth curve centered at last point)

Page(s):

Page(s):

Comments/Notes

19 How were the fixed effects (regression coefficients) in the model communicated (check all that apply)? 20. How were variance structures in the model communicated (check all that apply)?

21. Is there evidence of generalizability?

_____ _____ _____ _____

a. b. c. d.

series of regression equations single mixed model equation list of estimated effects verbal description

_____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____

a. not mentioned b. equation representation c. list of estimated variance parameters d. verbal description e. other: __________________________________ a. no b. sensitivity analysis c. cross-validation d. replication/extension of previous research (explicit) e. internal replication e.g., between group differences f. other: ______________________________________

Page(s):

Page(s):

Page(s):

DATA

22. Was power considered?

_____ _____ _____ _____

23. Was there missing data?

_____ a. no missing data (skip to # 27) _____ b. no discussion of completeness of data (skip to #27) _____ c. missing data noted at level 1, e.g., attrition, absence during testing, failure to complete instruments, etc. _____ d. missing data noted at level 2, e.g., attrition, absence during testing, failure to complete instruments, etc. _____ e. other: ___________________________________

24. If missing data were discussed, were relationships among missingness and other variables discussed?

_____ a. no _____ b. yes __________________________________________ __________________________________________

25. If there was missing data at level 1, how was it handled?

26. If there was missing data at level 2, how was it handled? 27. Were outliers present?

28. What method was used to screen for outliers?

a. b. c. d.

no discussion of power general guidelines considered power analysis conducted other: __________________________________

Page(s):

Page(s):

Page(s):

_____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____

a. b. c. d. e. a. b. c. d. e. a. b. c. a. b. c. d. e. f.

not applicable not discussed listwise deletion imputation. Type: _______________________ other: __________________________________ not applicable not discussed listwise deletion imputation. Type: _______________________ other: __________________________________ not discussed no yes not discussed can’t tell univariate methods simulation diagnostics residuals other: _______________________________________ ____________________________________________

Page(s):

Page(s):

Page(s):

Page(s):

Comments/Notes

29. How was imperfect measurement handled?

Page(s):

_____ a. not discussed _____ b. consequences considered _____ c. other: _______________

30. Using the chart, indicate if distributional assumptions of the model were considered. Evidence of Violation 0 = not discussed 1 = examined and no violation found 2 = examined and violation found

Considered 0 = not discussed 1 = considered Assumption

Action Taken (if violated) 0 = ignored 1 = consequences considered 2 = corrective action taken

Level- 1 residuals~N Lvl-1 residuals have equal variance for each lvl-2 unit

Level-2 residuals~N ESTIMATION AND TESTING 31. What software package/ version was used? Please list name and version or year 32. What method of estimation was used 33. What estimation algorithm was used?

34. Were any convergence problems encountered?

35. Were any of the covariance matrices not positive definite?

_____ a. not given _____ b. package stated _____ c. package and version or year stated Software Information: __________________ _____ a. not given _____ b. given: _____________________________________

Page(s):

_____ a. not given _____ b. given: _____________________________________

Page(s):

_____ a. not mentioned _____ b. no _____ c. yes __________________________________________ __________________________________________ __________________________________________

Page(s):

_____ a. not mentioned _____ b. no _____ c. yes __________________________________________ __________________________________________ __________________________________________

Page(s):

Page(s):

36. For which variance/covariance parameters were estimates provided? 36-a. error variance of the intercepts (τ00) 36-b. error variance of each regression coefficient or slope (e.g., τ11, τ22)

_____ _____ _____ _____ _____ _____ _____ _____

a. b. c. d. a. b. c. d.

not applicable since not estimated provided estimated but not provided insufficient information provided not applicable since not estimated provided estimated but not provided insufficient information provided

Page(s):

Page(s):

Comments/Notes

36-c. covariance between the intercept and slope errors (e.g., τ12, τ23)

_____ _____ _____ _____

a. b. c. d.

not applicable since not estimated provided estimated but not provided Insufficient information provided

36-d. first level error variance (typically σ2, but could be both σ2 and ρ) _____ _____ _____ 37. What additional variance _____ parameter information is _____ provided? Check all that apply _____ _____

a. b. c. d. e. f. g.

none SEs confidence intervals significance tests reliabilities ICCs explained variance

_____ _____ _____ _____ _____

a. b. c. d. e.

not applicable not stated SE/z-estimate χ2 , type (if given): ____________________ exact, bootstrap, other__________________

_____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____

a. b. c. d. e. f. a. b. c. d. e. a. b. c. d. e. f.

38. If CIs or significance tests were reported for variance parameters, what method was used

39. What fixed effect parameter information is provided? Check all that apply.

40. If CIs or significance tests were reported for fixed effects, what method was used?

41. What level-1 parameter information is provided? Please check all that apply.

42. Is there something else in this study that needs to be ‘captured’ that hasn’t been addressed in other items??

_____ a. provided _____ b. estimated but not provided

none SEs confidence intervals significance tests point estimates __________________________ other (e.g., effect size): _____________________ not stated likelihood ratio t or F test, degree of freedom method NOT stated t or F test, degree of freedom method IS stated permutation, bootstrap, other none estimates provided, but method not stated OLS or EB estimates statistical tests for OLS or EB estimates CIs for OLS or EB estimates other (e.g., just equations)___________________ ________________________________________ ____________________________________________ ____________________________________________ ____________________________________________ ____________________________________________ ____________________________________________

Page(s):

Page(s):

Page(s):

Page(s):

Page(s):

Page(s):

Page(s):

Hierarchical Linear Modeling: A Review of Methodological Issues and Applications

John Ferron Melinda R. Hess Kris Y. Hogarty Robert F. Dedrick Jeffery D. Kromrey Thomas R. Lang John Niles

University of South Florida

Paper presented at the 2004 annual meeting of the American Educational Research Association, San Diego, April 12-16.

2 Hierarchical Linear Modeling: A Review of Methodological Issues and Applications INTRODUCTION Researchers in education and many other fields (e.g., psychology, sociology) are frequently confronted with data that are hierarchical or multilevel in nature. For example, in the context of school organizations, students are nested in classes, classes are nested in schools, schools are nested in school districts, etc. In longitudinal research, repeated observations are nested within individuals (i.e., units) and these individuals may be nested within groups. The pervasiveness of multilevel data has led to a proliferation of statistical methods, referred to under a number of names including hierarchical linear modeling (HLM, Bryk & Raudenbush, 1992), multi-level modeling, mixed linear modeling, or growth curve modeling, and a parallel increase in the number of applications of these methods to educational problems. The complexity of these multilevel methods provides potential for misuse and confusion, which may act as barriers for applied researchers attempting to use these methods. Careful consideration of the methodological nuances of multilevel analyses is critical, as misuses may result in statistical artifacts that may potentially influence statistical inference and cloud interpretation. A close examination of some of the more common methods employed across various disciplines, as well as an exploration of recent research trends can serve to both inform the practice of research as well as to broaden our understanding of the various methodologies and critical issues facing practitioners. Given this context, it is appropriate to pause and critically analyze the methodological issues inherent in multilevel modeling techniques. Similar methodological reviews have been conducted with more commonly used techniques such as analysis of variance, multivariate analysis of variance, analysis of covariance, path analysis, and structural equation modeling. Keselman et al. (1998) note that one consistent finding of methodological research reviews is that a substantial gap often exists between the methods that are recommended in the statistical research literature and those techniques that are actually practiced by applied researchers (Goodwin & Goodwin, 1985b; Ridgeway, Dunston & Qian, 1993). Methodological reviews can serve to identify issues, controversies, and current trends as well as provide direction to applied researchers. In addition, these reviews may help bridge the gap between statistical theory and application.

3

Purpose & Intended Audience The purpose of this review is to provide an overview of the methodological landscape and the critical issues surrounding multilevel modeling and to report on the current application and reporting of multilevel analyses in education and related fields. Because a single review cannot include every methodological consideration or technical nuance, it is important to clarify that our intent is to focus primarily on what might be termed ‘traditional’ hierarchical linear modeling. That is, we consider linear models of continuous outcomes where the random effects are assumed normally distributed. This allows consideration of applications where individuals are nested in contexts (e.g., students nested in schools), and applications where observations are nested within individuals (e.g., growth curve models). Models in which the outcome is represented by binary, count, or ordinal data are not considered (see Raudenbush & Bryk, 2002 for discussions of these types of applications). Nor did we venture beyond these modeling techniques to explore related methods such as multilevel structural equation modeling (SEM), hierarchical linear measurement modeling or applications of item response theory (IRT). This review is intended to be useful to several distinct audiences in the research arena. For example, practitioners whose inquiries in applied areas include multilevel models should benefit from this treatment of the published literature in the field. A careful consideration of one’s research methods, designs and habits of reporting in contrast to those evidenced in the field in general will tend to suggest areas for refinement of techniques and serve as a source for professional reflection. Similarly, manuscript reviewers and journal editors who serve the critical roles of both gatekeepers and navigators for authors in the reporting of research results should gain from this examination of multilevel models. A critical appraisal of the strengths and limitations in the reporting of results from such models is intended to sharpen the eyes of these scholars and lead to improvements in the corpus of research to be published. In addition, research methodologists and others who serve as technical consultants in large-scale research projects may glean additional distinctions and insights from the technical issues raised in this paper. A tremendous amount of methodological work is needed to advance our understanding of the limits and possibilities of hierarchical models, and this review of the published literature highlights several areas needing focused attention. Further, university professors and teachers of research methods and applied statistics may find that the issues raised here and the resources cited will hone their craft and enhance the content of their curricula. Finally, students of educational research (the scholars of tomorrow) will find a readable introduction to hierarchical models, together with citations of sources that will provide logical entries into the growing body

4 of technical work in this area. Our treatment of hierarchical models as they are ‘practiced’ by those publishing in education and related fields is intended to both guide and inspire researchers at multiple levels of experience and interest. Organization of the Review To provide an overview of the methodological nuances and critical issues surrounding multilevel modeling, literature in an array of disciplines (education, medicine, public health, psychology, business, chemistry, physics, biology, statistics and math) was reviewed. Electronic databases (e.g., ERIC, PsychInfo, etc.) were searched using a variety of keywords (e.g., hierarchical linear modeling, mixed linear models, nested designs, and multilevel modeling) and additional articles were gleaned from the reference lists of select articles. Lastly, key issues were identified from the reference list of pivotal articles, by exploring technical software manuals, and by monitoring online conversations on certain Listservs. Based on these reviews, four broad issues were identified. These issues are explored during the initial phase of this paper and served as the framework for the coding protocol used to analyze the hierarchical linear modeling applications in the literature. These issues include: (a) model development and specification, which include issues of centering, selection of predictors, specification of covariance structure, fit indices, generalizability and checks on specification; (b) data considerations including distributional assumptions, outliers, measurement error for predictors and outcomes, power, and missing data; (c) estimation procedures including maximum likelihood, restricted maximum likelihood, Bayesian estimation, and alternative procedures such as bootstrapping; and (d) hypothesis testing and statistical inference including inferences about variance parameters and fixed effects. Following the explication of these issues, we describe the development of the framework and coding protocol used as a lens for our analysis of the literature and the search strategies employed in this review. The results of our review of applications are then presented, along with recommendations for improved reporting practice. Critical Issues in Hierarchical Modeling Model Development & Specification Model development is a central component of inquiry in many disciplines, with notably different approaches evidenced among researchers both within and across disciplines. According to Dickmeyer (1989), a patchwork of styles and worldviews with respect to model development exists in the educational research community. Models are typically built to allow researchers to test theories or hypotheses, to manipulate and test changes in simplified systems, and to allow for the exploration of relationships between variables that in some way characterize

5 a complex system. Support for various models may emerge from the extant literature and research in a particular field. Other models may be the product of developing theories employing exploratory types of analyses utilizing more data driven approaches. To aid discussion of the particular specification issues confronted by those using hierarchical models we first review the basic notation and terminology in the context of an application. Consider a team of educational researchers who wish to study the relative effectiveness of two reading programs. These researchers randomly assign participating classrooms to one of the two programs, and gather reading comprehension data both prior to and following the use of the program. A level-1 model could be developed to model the reading comprehension of students within a class as a function of their reading comprehension prior to the study. More specifically,

yij

= β0 j +

β1 j prior readingij

+ rij

(1)

where yij is the reading comprehension of the ith child in the jth classroom, β0j is the intercept of the regression equation predicting reading comprehension at the end of the study in the jth classroom, β1j is the regression coefficient indexing the relationship between reading comprehension at the end of the study and reading comprehension before the study in the jth classroom, and rij is the error, which is assumed to be normally distributed with a covariance of Σ. A level-2 model could then be used to examine the relative effectiveness of the two programs, and whether the relative effectiveness of the two programs depended on the prior levels of reading comprehension. The level-2 model would use program to predict the intercepts and slopes of the level-1 model.

β0 j β1 j

= γ 00 = γ 10

+ γ 01 program j + γ 11 program j

+ u0 j + u1 j ,

(2) (3)

where programj is a dummy coded variable indicating whether the jth classroom received Program A (coded 0) or Program B (coded 1), and u0j and u1j are the errors, which are assumed to be normally distributed with a covariance of Γ. Although models are frequently described by using equations for each level, it is possible to combine all the equations into one. By substituting the second level model for β0j and β1j in the first level model, a combined model is obtained,

yij = γ 00 + γ 01 program j + γ 10 prior readij + γ 11 program j * prior read ij + u0 j + u1 j prior read ij + rij (4) In the combined model it becomes clear why γ 11 is referred to as a cross-level interaction. It should also be noted that the combined model has the same form as the mixed linear model,

6

y = Xγ + Zu + ε ,

(5)

where y is a vector of outcome data, γ is a vector of fixed effects, X and Z are known model matrices, u is a vector of random effects, and ε is a vector of errors (Henderson, 1975). Centering. Centering of the level-1 and level-2 predictors has important implications for interpreting the results and therefore is an important consideration in specifying the statistical model. In our example, suppose the level-1 predictor, prior reading comprehension, was measured on a scale ranging from 200 to 800. If the predictor was kept in its natural metric, γ 00 would be the predicted reading comprehension for a student in Program A (coded 0) who had a prior reading comprehension of zero. Since a prior reading comprehension of zero is not possible, the coefficient is difficult to interpret. The effect of the program, γ 01 , would be interpreted as the difference in the effectiveness of the two programs when prior reading comprehension was zero. Again, a value that is not particularly informative. Prior reading comprehension would need to be scaled or centered to make the interpretation of the coefficients more meaningful. Although centering is used outside the context of multilevel modeling, it is particularly important in multilevel modeling because the level-1 coefficients become outcomes to be explained in higher level models (e.g., level 2). One approach to scaling the predictor variable is to subtract the grand mean of the predictor variable from each score ( xij - x⋅⋅ ). Using grand mean centering with our example, γ 00 would become the predicted reading comprehension for a student from Program A, who had a mean level of prior reading comprehension. The effect of the program, γ 01 , would become the difference in the effectiveness of the two programs for students with a mean level of prior reading comprehension. A second approach to scaling the predictor variable is to subtract the level-2 unit mean of the predictor variable from each score ( xij - x⋅ j ). Using group-mean centering with our example,

γ 00 , would become the predicted reading comprehension for a student in Program A, who’s prior reading comprehension was at the mean of her class. The effect of the program, γ 01 , would become the difference in effectiveness of the two programs for students who are at their classroom’s mean level of prior reading comprehension. A third approach to scaling a predictor variable is to subtract a theoretically meaningful value from each score ( xij - Specific Value). This approach is similar to grand-mean centering in that a constant is subtracted from each score. The β 0 j is interpreted as the expected outcome for

7 individuals who score at the specific value that has been set by the researcher. For example, in a growth curve model examining change in achievement from grades 6, 7, 8, a research may center the grade predictor at grade 8. In this case, β 0 j is interpreted as the expected value of the outcome for a student in 8th grade. In the case of level-2 predictor variables, Raudenbush and Bryk (2002) have noted that the choice of the scale metric (e.g., natural metric, grand mean centered) is less critical. However, when interaction terms are included at level-2, Raudenbush and Bryk (2002) have suggest that grand mean centering has the advantage of reducing multicolinearity. Selection of predictors. The selection of predictors is a critical aspect of the design of a study. According to Little, Lindenberger and Nesselroade (1999) the issue of variable selection is directly related to the quality of the research design and the value of the results. In hierarchical modeling, variable selection can be complicated since predictors can be selected for each level of the model, and interactions between predictors can be considered at either level or across levels. In addition, the process of variable selection can take many forms. In some instances, the selection of predictors is established prior to looking at the data, while in others the data help guide selection decisions. Inclusion may be based partially on significance tests, effect sizes, or fit indices. In the reading comprehension example, one can imagine the researchers using the research goals and a priori considerations to select prior reading achievement as a predictor at the first level, and program as a predictor at the second level. Interpreting the regression coefficients is influenced by the degree to which one believes the regression coefficients are unbiased estimates of effects. In our illustrative example, γ 01 can be interpreted as a program effect. Since classrooms were randomly assigned to programs, we would anticipate that program was not related to any other classroom level variables, and consequently we would anticipate an unbiased estimate of effect. If we had not randomly assigned classrooms to program, our ability to argue γ 01 was an unbiased estimate of effect would depend on our ability to argue that all relevant variables were included in the model, or that the set of predictors for the model was correctly specified. Researchers who wish to include all relevant variables, but who are unsure if particular variables need to be included, may let the data help them decide which variables to include. For example, a researcher may start with a model like we have previously described and then add variables one at a time, keeping only those that are statistically significant. Other researchers may start with a fuller set of variables, and eliminate ones that do not seem to be affecting the

8 results. When the data are used to guide the selection of predictors the researcher increases the odds of capitalizing on chance, which heightens the need for replication. Specification of covariance structure. The most notable difference between hierarchical models and more common regression models is that hierarchical models have more error terms and consequently more flexibility in defining the covariance structure. This greater flexibility leads to two distinct advantages. First, it allows researchers to be more flexible in the questions they ask about the covariance structure. In applications like our example, researchers can ask questions about the degree to which the outcome varies within classrooms relative to the degree to which it varies across classrooms. In growth curve modeling applications, researchers can ask to what extent initial levels (intercepts) vary across participants and to what extent growth rates (slopes) vary across participants. Second, the degree to which the standard errors for the regression coefficients are unbiased depends on the degree to which the covariance structure is correctly specified. Having the flexibility to model a more complex covariance structure improves the chances of correct specification, which leads to better estimates of the standard errors of the regression coefficients, which in turn leads to more accurate confidence intervals and/or more valid statistical tests. The covariance structure for the first-level model, Σ, is often assumed to be σ2I in applications where students are nested in contexts. In repeated measures contexts this assumption is more questionable since errors that are close in time may be correlated. A wide variety of alternative structures have been discussed including first-order autoregressive, banded, unstructured, toeplitz, banded toeplitz, and first-order autoregressive plus a diagonal (Wolfinger, 1993). With so many options available, researchers are left with questions about how to best specify the covariance structure of the first-level model. Questions also arise as to how to best specify the covariance structure of the second-level model. In the previous example, which is relatively simple, there are alternative specifications for Γ depending on whether one wanted to let both intercepts ( β 0 j ) and slopes ( β1 j ) randomly vary and whether or not one wanted to allow for covariance between the errors in predicting the intercepts and slopes. As the number of predictors in the first-level model increases the potential size of the Γ matrix also increases. With more elements there are more variance parameters that could be estimated. The question becomes which coefficients should be allowed to randomly vary, and if the answer is more than one, which of the possible covariances between errors should be estimated.

9 One could generally divide the covariance parameters into three categories: those that are assumed to be zero and not estimated, those that are assumed to be non-zero and thus estimated, and those that the researcher is less sure about. If researchers routinely leave out all questionable variance parameters, they run the risk of leaving out needed parameters, biasing their standard errors, and jeopardizing their inferences. If researchers routinely add in all questionable parameters, they may estimate a model that is overly complex, which increases the chance that they will encounter estimation problems. For example the estimation may not converge or a variance component may be inadmissible (e.g., a variance less than zero or a covariance that implies the correlation would exceed 1.0). Even when estimation seems smooth, estimating many parameters that are equal to zero will negatively affect the precision in estimating the other parameters in the model. Consequently, one would ideally only estimate the needed parameters. Fit Indices. With the growing recognition of the importance of the selection of an appropriate covariance structure, several methods have been developed that allow researchers to use the data to help make decisions about which covariance structure to estimate. As it is often not possible to know the underlying structure in advance, researchers will often examine multiple structures and rely on fit indices to select among possible covariance structures (Singer, 1998; Wolfinger, 1993). Among the indices commonly used are Akaike’s Information Criterion (Akaike, 1974) and Schwartz’s Bayesian Criterion (Schwartz, 1978). Akaike’s Information Criterion is given by:

AIC = log(L) – q

(6)

where q is the number of covariance parameters. Schwartz’s Bayesian Criterion is given by:

SBC = log(L) – (qlog(N – p))/2

(7)

Both AIC and SBC start with the log likelihood value and then penalize for the number of covariance parameters estimated, with SBC employing a stiffer penalty. For each of these indices values closer to zero represent better fit, so typically the model with the value closest to zero is selected. This approach, however, does not always lead to identification of the correct covariance model, especially when data are somewhat limited. For example, with repeated measures data, it is difficult to correctly select the covariance structure when the series length is short (Ferron, Dailey, & Yi, 2002; Keselman, Algina, Kowalchuk, & Wolfinger, 1998). Furthermore,

10 misspecification can affect estimation and inference (Ferron, Dailey, & Yi, 2002; Lange & Laird, 1989). Other work in this area has focused not on a single estimate, but rather a ‘confidence set of models’ (Shimodaira, 1998). Instead of using the minimum AIC, Shimodaira proposed the use of an ‘interval’ estimate of the best model. This author was quick to note that the confidence set approach in not intended to replace the use of an obtained point estimate of the minimum AIC, but rather provides supplemental information on model selection. This approach employs a series of pairwise analyses in which a standardized difference of AIC is calculated for every pair of models. Potential models are compared to the model evidencing the best estimate of sample fit, and those models that are not observed to differ by a statistically significant amount become part of a set of models for consideration. Generalizability and Sensitivity. The degree to which the findings of a particular analysis are generalizable, as well as how sensitive the findings are to characteristics of the data, should be a concern of all researchers. Limitations of a particular sample, the nature of the data, as well as techniques employed, all impact the breadth and depth of the inferences made. These issues can, at least in part, be examined to help ascertain the strength of findings using a variety of statistical methods and techniques. Such techniques include cross-validation, sensitivity analysis, replication and extension of previous research, and internal replication. Typically associated with more traditional statistical methods such as regression analysis, the use of a technique such as cross-validation is a useful technique in HLM analyses that provides further evidence of validity and model soundness. The primary purpose of crossvalidation is to provide a check of model integrity and generalizability. This model ‘check’ is accomplished through using one set of data, sometimes referred to as the screening (or training) set and one set of data that may be called the calibration (or test) set. The screening set is used to estimate the model and then the calibration set is applied to the model to determine how well the model was able to predict the degree of fitness relative to the screening data set. This process allows the calculation of a magnitude of generalization error based on how well the calibration data set fits the model identified by the training data. Depending on the data structure and sample size, cross-validation may be conducted using various strategies. These include the holdout method (also called the data splitting method), dual-sample method, k-fold cross validation and the leave-one-out cross validation (LOO). The procedure can be further complicated when a researcher might not believe, for either theoretical, conceptual, or data-based reasons, that a single cross-validation process is sufficient and thus engages in double cross validation. Double cross-validation is nothing more than doing cross-validation twice and then

11 using a combined equation or model. Depending on the statistical analysis being employed (e.g., regression or HLM) this process may vary in complexity and applicability. Another means of addressing generalizability and sensitivity issues is through conduct of a sensitivity analysis. This type of analysis examines the impact of data anomalies (e.g., extreme data values, distribution irregularities) on model fit and parameter specifications. Bayesian techniques such as the Gibbs sampling methods as well as other strategies and algorithms can be used to examine impact of extreme observations at either level one or level two of the model (Seltzer, Novak, Choi, & Lim, 2002). Other techniques, such as data transformations (e.g., loglinear, square root) can be effective in addressing issues such as nonnormal distributions. The degree to which model fit and parameter specifications remain constant when data issues such as these are controlled for is critical to determine how ‘sensitive’ a given model is to fluctuations or peculiarities in the data. As with virtually all other techniques and methods of data analysis, HLM can be used for both replication and extension of other studies as well as internal replication. The replication and extension of other studies can be done in a multitude of ways. HLM can be a complementary method used to examine a sample of a population previously analyzed using other statistical techniques such as regression analysis. The degree to which HLM is more robust for accounting for such issues as lack of independence among observations makes it well suited to replicate previous research with very similar populations to either help strengthen the inferences made from that research, or to identify possible areas of concern that were not identifiable with more traditional analyses. Furthermore, replication can be conducted within a given study by analyzing subsets of the data independently and subsequently examining the degree to which the results are similar. Depending on the method(s) used, as well as the data source(s), replication efforts often serve to enhance either the external or internal validity of the findings reported and conclusions reached. These are just a few of the means that researchers might consider using when conducting HLM analyses. These techniques and approaches have the potential to enhance the credibility, validity, and generalizability, depending on the focus, purpose, and resources considered in the study. A careful and considerate selection of one of these analyses will enhance the integrity of the findings of a research study. Data Considerations Distributional assumptions. All inferential statistical tools are based on a set of core assumptions. Provided the assumptions are met, the method will function as planned or intended. These underlying assumptions are often not satisfied, and it is common knowledge

12 that under some data-analytic conditions certain procedures will not produce the desired results. According to Keselman et al. (1998), “the applied researcher who routinely adopts a traditional procedure without giving thought to its associated assumptions may unwittingly be filling the literatures with nonreplicable results” (p. 351). For a hierarchical linear model, distributional assumptions are made about the errors at each level in the model. The first level errors, the rij in equation 1, are assumed to be independently and normally distributed with a covariance of Σ. Lack of normality can lead to biases in the standard errors at both levels, and thus introduces questions about the validity of statistical tests and the accuracy of reported confidence intervals. The normality assumption is not realistic for certain types of outcome variables (e.g., binary outcomes, multinomial outcomes, and ordinal outcomes), and in these cases it is generally recognized that hierarchical generalized linear models are more appropriate. When one has a continuous outcome variable, the normality assumption of hierarchical linear models may be reasonable, but even here the assumption may not hold. Researchers can assess normality by examining the distribution of the level-1 residuals. The distributions can be examined separately for each level-2 unit, or by pooling across the level-2 units. If evidence of nonnormality is found, the researcher may wish to consider transforming the outcome variable. Also implicit in the assumptions about the first level error, is that the variance of the errors is the same for each level-2 unit. If the variances are not homogeneous, but vary randomly, it does not appear that the fixed effects or standard errors are biased (Kasim & Raudenbush, 1998), but if the variances vary as a function of the level-1 or level-2 predictors there may be more serious consequences (Raudenbush & Bryk, 2002). A researcher can examine the homogeneity assumption by examining the variance of the level-1 residuals for each level-2 unit. The researcher could then look for units with variances that were notably different from the others, or test whether the differences among the variances were greater than what could be attributed to sampling error. Researchers could also examine the correlations between the variance estimates and the values of the level-2 predictors. Distributional assumptions are also made about the level-2 errors, the u0j and u1j from equations 2 and 3. These errors are assumed to be normally distributed with a covariance of Γ. Checking normality of these errors is a bit more complicated since the outcomes of the level-2 model are not directly observed but a procedure for estimating skewness and kurtosis of the random effects has been presented (Teuscher, Herrendorfer, & Guiard, 1994).

13 It is well known that model-based statistical inference is dependent upon the scrupulous attention to the assumed models, which necessarily includes the distributional assumptions underlying a particular model. This is, of course, necessary if a researcher hopes to find a suitable model or models that fit the data well. Although a number of researchers have investigated these issues in the past, according to Ghosh and Rao (1994), the literature on diagnostics for mixed linear models involving random effects is not as extensive as the literature with respect to the treatment of standard regression diagnostics. Recently, however, Jiang (2001) advanced a technique using goodness-of-fit tests to examine the distributional assumptions with regard to mixed linear models. Outliers. As with other statistical methods, researchers should screen their data for outliers. These outlying observations may arise from data entry errors (e.g., a 27 that should have been a 72), an inaccurate assessment of a student (e.g., a 0 used to indicate achievement for an absent student), failure to identify a missing data code (e.g., a missing value entered as a 999), failure to screen out participants who fall outside the inclusion parameters for the study (e.g., a score from a student who was not part of the school during the focal time period for the study), or simply from an individual who is different from the others in the sample. As with other analyses, illegitimate outliers (e.g., data entry mistakes) can distort analyses and should be corrected. With legitimate outliers (e.g., a score that is atypical but truly part of the population being considered), the researcher needs to be aware of their presence and influence on the results. When the influence is substantial, ameliorative strategies may need to be considered. Initially the researcher may wish to look for univariate outliers by inspecting box plots, or examining the distance from the mean in standard deviation units for the smallest and the largest observations. Although these univariate checks are helpful, the researcher should also consider examining the residuals at each level of the model (e.g., Raudenbush & Bryk, 2002). As an example, consider a study that examined students nested within classrooms. At the first level, one could look for outlying students, individual scores that were far from expectation given the class’s regression equation. At the second level, one could look for outlying classes, where an outlying class is one that has an atypical regression coefficient. In addition to the examination of residuals, one may wish to examine simulation-based methods (Longford, 2001). Measurement error for predictors and outcomes. Most measures in educational studies contain error. Consequently, it is likely that the predictors and outcome variables used in educational applications of hierarchical modeling will contain measurement error. These errors, if not accounted for, can bias estimates of variance parameters, variance ratios – like the intraclass correlation, fixed effects, and the standard errors of fixed effects (Woodhouse, Yang, Goldstein, &

14 Rashbash, 1996). Consequently, it is important for educational researchers to consider the reliability of the data used in their applications of hierarchical linear models. In situations where measurement error is anticipated, there are methods for specifying and adjusting for the measurement error (Longford, 1993; Woodhouse et al., 1996). Power. Considerable work has investigated the power of statistical tests of treatment effects in multilevel data. Sample size formulas have been provided for obtaining given powers in experiments where the 2nd-level units have been randomized (Donner, Birkett, & Buck, 1981; Hsieh, 1988). Power calculations are also available through a website and through specialized software. Optimal allocation of units among levels (e.g., fewer large groups versus more small groups) has been considered (Raudenbush, 1997; Snijders & Boskers, 1993). Also the level of randomization has been found to impact power, such that randomization of the 2nd-level units leads to less power than randomization of the 1st-level units (Donner, Birkett, & Buck, 1981; Hsieh, 1988). Missing Data. It is not uncommon for missing data to occur on one or more variables within an empirical investigation. Missing data may adversely affect data analyses, interpretations and conclusions. Collins, Schafer, and Kam (2001) indicate that missing data may potentially bias parameter estimates, inflate Type I and Type II error rates and influence the performance of confidence bands. Further, because a loss of data is almost always associated with a loss of information, concerns arise with regard to reductions of statistical power. Unfortunately, researchers’ recommendations for managing missing data are not in complete agreement (Guertin, 1968; Beale & Little, 1975; Gleason & Staelin, 1975; Frane, 1976; Kim & Curry, 1977; Santos 1981; Basilevsky, Sabourin, Hum, & Anderson, 1985; Raymond & Roberts, 1987). Many studies that have examined missing data treatments are not comparable due to the various methods used, the stratification categories (number of variables, sample size, proportion of missing data, and degree of multicollinearity), and the criteria that measure effectiveness (Anderson, Basilevsky, & Hum, 1983). Contemporary discussion of missing data and their treatment can often be confusing and at times may appear somewhat counterintuitive. For example, the term ignorable, introduced by Little and Rubin (1987) was not intended to convey a message that a particular aspect of missing data could be ignored, but rather under what circumstances the missing data mechanism is ignorable. Additionally, when one speaks of data missing at random, these words should not convey the notion that the missingness is derived from a random process external or unrelated to other variables under study (Collins et al., 2001). According to Heitjan and Rubin (1991) missing data can take many forms, and missing values are part of a more general concept of coarsened data. This general category of missing

15 values results when data are grouped, aggregated, rounded, censored, or truncated, resulting in a partial loss of information. The major classifications of missing data mechanisms can be best explained by the relationship among the variables under investigation. Rubin (1987) identified three general processes that can produce missing data. First, data that are missing purely due to chance are considered to represent data that are missing completely at random (MCAR). Specifically, data are missing completely at random if the probability of a missing response is completely independent of all other measured or unmeasured characteristics under examination. Accordingly, analyses of data of this nature will result in unbiased estimates of the population parameters under investigation. Second, data that are classified as missing at random (MAR), do not depend on the missing value itself, but may depend on other variables that are measured for all participants under study. Lastly, and most problematic statistically, are data missing not at random (MNAR). This type of missingness, also referred to as nonignorable missing data, is directly related to the value that would have been observed for a particular variable. A commonly encountered situation, in which data would be classified as MNAR, arises when respondents in a certain income or age strata fail to provide responses to questions of this nature. Given the nature of the data typically analyzed using hierarchical linear modeling, it is not surprising that the issue of missing data becomes pertinent to inquiry of this nature (Roy & Lin, 2002). Missing data may occur at the different levels of a model, or the loss of multiple data points across time may be unavoidable or inevitable due to attrition of mortality. It is also not uncommon to face a combination of these challenges when examining longitudinal outcomes. The careful researcher must be concerned not only with nonignorable nonresponses but with missing covariates as well (Roy & Lin, 2002). Estimation There is no a single agreed upon way to estimate the parameters in a hierarchical linear model. Several methods of estimation can be employed, including maximum likelihood (ML), restricted maximum likelihood (REML), and Bayesian (Raudenbush & Bryk, 2002; Kreft & De Leeuw, 1998). These methods of estimation can be carried out using many different algorithms. For example, ML estimation may be accomplished using the EM algorithm, the Newton-Raphson algorithm, the Fisher scoring algorithm, or iterative generalized least squares (IGLS), while Bayesian estimation may be accomplished using the Gibbs sampler. In addition, these algorithms have been programmed into many different software packages. Thus one researcher may accomplish REML estimation using the EM algorithm programmed into HLM, while another may accomplish REML estimation using restricted iterative generalized least squares (RIGLS)

16 using MLn, while a third may accomplish REML using the Newton-Raphson algorithm programmed in Proc MIXED within SAS. Maximum likelihood estimation. The principle behind ML estimation is to select parameter estimates that maximize the likelihood of the data. We consider how likely it is that we would have obtained the data for each of many different values for the fixed effects (γs) and variance parameters (elements in Σ and Γ), and then pick the values for which the likelihood is the greatest. This involves an iterative algorithm that steps through possible values until the likelihood reaches its maximum. When the maximum is reached the algorithm is said to have converged. The goal for computational statisticians is to develop an algorithm that converges fairly quickly across a wide range of applications. If the algorithm meanders through the possibilities too slowly it may not converge given the time allocated, and if the algorithm moves too quickly it may miss the maximum and fail to converge. Since the desirable properties of maximum likelihood estimators are not realized when convergence fails, the objective for applied researchers is to select an algorithm that will converge given their data and time constraints. Maximum likelihood estimation is currently available through a variety of algorithms and software packages. It can be accomplished using the EM algorithm (Dempster, Laird, & Rubin, 1977), which is implemented in the software package HLM (Raudenbush, Bryk, Cheong, & Congdon, 2000), or by using the Newton-Raphson algorithm (Lindstom & Bates, 1988) which is implemented in Proc MIXED (SAS, 2000), or by using the Fisher scoring algorithm (Longford, 1987) which is implemented in VARCL, or by using iterative generalized least squares (IGS; Goldstein, 1986) which is implemented in MLn. The EM algorithm has the advantage that it will always converge if given enough time, but the disadvantage is that it may take a relatively long time to converge (Draper, 1995). If convergence is met and the estimated variance/covariance matrices are positive definite (i.e., the variances are positive and the absolute value of the implied correlations do not exceed 1.0), then the estimators have some desirable properties. The fixed effects (γs) are unbiased (Kacker & Harville, 1981, 1984), and the estimates of the variance parameters (elements in Σ and Γ) are asymptotically unbiased, that is the bias disappears as sample size gets large (Raudenbush & Bryk, 2000). The estimates of the fixed effects and variance parameters also tend to be asymptotically efficient, which implies that when the sample size is large the maximum likelihood estimates show minimum variance from sample to sample (Raudenbush & Bryk, 2000). Finally, as sample size increases the sampling distributions of the estimates become approximately normal, which facilitate construction of confidence intervals and statistical tests

17 (Raudenbush & Bryk, 2000). Note that these properties hold for relatively large sample sizes, where what is considered large is heavily influenced by the number of upper level units. For example, if one studies students who are nested in classes, then many classes must be sampled if one wishes to obtain these desirable properties. Restricted maximum likelihood estimation (REML). In REML, maximum likelihood estimates are obtained for the variance parameters (elements in Σ and Γ). These values are then used in obtaining generalized least squares estimates of the fixed effects (γs). The REML estimates of the variance parameters may be considered preferable to ML estimates because REML takes into account uncertainty in the fixed effects (γs) when the variance parameters are estimated. Since the uncertainty in the fixed effects is more pronounced with smaller sample sizes, one may suspect the difference in these methods would tend to be greater when sample sizes were smaller. A couple of empirical studies have been done which have found differences between ML and REML estimates under a variety of conditions (Kreft & de Leeuw, 1998), but these studies do not lead to uniform recommendation of one method over the other. As with ML estimates, REML estimates can be obtained from a variety of software packages (e.g., HLM, SAS Proc MIXED, MLn, VARCL) and through a variety of algorithms (e.g., EM, Newton-Raphson, Fisher scoring, and RIGLS), and have been shown to have desirable properties under many conditions. Again under general conditions, the fixed effects (γs) are unbiased (Kacker & Harville, 1981, 1984), the estimates of the variance parameters (elements in Σ and Γ) are asymptotically unbiased (Raudenbush & Bryk, 2002), the estimates of the fixed effects and variance parameters are asymptotically efficient (Raudenbush & Bryk, 2002), and as sample size increases the sampling distributions of the estimates become approximately normal (Raudenbush & Bryk, 2002). Consequently, both ML and REML are often recommended for large sample size conditions. When sample sizes are smaller, and particularly when the data are unbalanced, the functioning of both ML and REML becomes questionable, which may lead researchers to consider alternatives. In addition, inferences about the fixed effects (e.g., confidence intervals for the γs) assume the variance estimates have no error. This also becomes exceedingly questionable when the sample sizes are not large. Bayesian estimation. With Bayesian estimation (Lindley & Smith, 1972) one can acknowledge the uncertainty in the estimates of the variance parameters when the fixed effects are estimated. Consequently, Bayesian estimation provides an appealing option for researchers working with smaller data sets. This form of estimation can be accomplished using Markov Chain Monte Carlo algorithms like the Gibbs sampler, which is implemented in the software

18 BUGS. Although Bayesian estimation is appealing in some circumstances, it also has some drawbacks. Prior distributions must be specified, but this specification may conflict with some researchers’ desire to not let prior beliefs influence the results of their analyses (Raudenbush & Bryk, 2002). In addition, the algorithms are not as readily available, as they have only been implemented in a couple of software packages, and the algorithms are very computer intensive, making them impractical for large data sets. Alternative Estimation Methods. Since none of the estimation methods is entirely satisfactory across all data conditions that may be encountered in research, statisticians continue to work on the development of alternatives. Bootstrapping has been presented as one option to deal with the bias in the variance estimates and standard errors that results from using ML or REML estimation with samples that are not large and normal. Bootstrapping is available in MlwiN, and both parametric (Meijer, van der Leeden, & Busing, 1995), and nonparametric (Carpenter, Goldstein, & Rashbash, 1999) versions have been discussed. Another alternative stems from the motivation to restrict the influence of outlying observations. Robust ML estimation methods and robust REML estimation methods have been proposed and show promise (Richardson & Welsh, 1995), but as far as we know they have not been programmed into readily available hierarchical modeling software. Hypothesis Testing and Statistical Inference The estimation method will produce point estimates of each parameter in the hierarchical model. These point estimates are often valuable in addressing particular research questions, but additional information is often provided to aid the researcher in making inferences. This additional information may take the form of confidence intervals for parameters of interest or hypothesis tests of these parameters. When considering the options available it becomes convenient to distinguish between inferences made about variance parameters (elements in Σ and Γ), inferences made about fixed effects (γs), and inferences made about the random level-1 coefficients (e.g., β0j). Inferences about variance parameters. A researcher may be interested in creating a confidence interval (CI) for a variance parameter. The simplest approach would be to make use of the standard error of the variance parameter estimate, which is computed from the inverse of the information matrix. By adding and subtracting 1.96 times the standard error of the parameter estimate, one can create a 95% CI, assuming a normal sampling distribution. This approach, however, has limitations, especially when the sample size is small or the variance parameter is near zero (e.g., Littell, Milliken, Stroup, & Wolfinger 1996; Raudenbush & Bryk, 2002). Under these conditions the variance parameter will tend to have a skewed sampling distribution,

19 making symmetric intervals based on the standard error unrealistic. Under these conditions researchers should turn to other options including the Satterthwaite approach (Littell, Milliken, Stroup, & Wolfinger 1996), bootstrapping (Meijer, van der Leeden, & Busing, 1995; Carpenter, Goldstein, & Rashbash; 1999), a method based on local asymptotic approximations (Stern & Welsh, 2000), and if the data are balanced, the approach proposed by Yu and Burdick (1995). For researchers wishing to test hypotheses regarding variance components, again a variety of choices are available. The simplest would be to conduct a z-test by dividing the estimate by its standard error. Although this approach is asymptotically valid, it, like the standard error based CIs noted previously, becomes questionable when the sampling distribution cannot be assumed normal. A somewhat more appealing option is to use a likelihood ratio χ2 (e.g., Little, Milliken, Stroup, & Wolfinger 1996). This test requires the user to estimate two models, one with and one without the questionable variance parameter(s). The difference in the log likelihoods obtained in these analyses is then used to construct a statistic that in large samples follows a χ2 distribution. Note this method can be used for single parameter tests or multiple parameter tests. Additional alternatives include an approximate χ2 test described by Raudenbush and Bryk (2002), bootstrapping (Meijer, van der Leeden, & Busing, 1995; Carpenter, Goldstein, & Rashbash; 1999), a likelihood ratio test based on the local asymptotic approximation (Stern & Welsh, 2000), and exact tests that have been established for some contexts (Christensen, 1996; Ofversten, 1993). Finally, it should be noted that in addition to point estimates, confidence intervals, and statistical tests, researchers should consider whether combining variance estimates and or making variance ratios could help to answer the research questions. For example, one may be interested in the explained variance (R2) at one or more levels of the model (Snijders & Bosker, 1994), the intraclass correlation (e.g., Raudenbush & Bryk, 2002), or the reliability of estimators (Raudenbush & Bryk, 2002). Those interested in creating confidence intervals for variance ratios are referred to the statistical literature (Lee & Seeley, 1996). As far as we know the methods described there have not been implemented in the hierarchical linear modeling software programs. Inferences about fixed effects. A researcher interested in making inferences about fixed effects may wish to construct confidence intervals for the effects of interest. A 95% CI could be constructed around the point estimate by adding and subtracting 1.96 times the standard error. This of course assumes a normal sampling distribution, which can be demonstrated asymptotically, but which becomes questionable for smaller samples. Consequently, one would

20 typically substitute a t-value with ν degrees of freedom for the 1.96. Several methods for defining the degrees of freedom have been given (Giesbrecht & Burns, 1985; Kenward & Rogers, 1997), and some software packages (e.g., Proc Mixed) allow for different definitions to be specified. An alternative to assuming an approximate t-distribution is to turn to bootstrapping to construct the confidence intervals. Hypothesis tests can also be conducted by using t- or F-tests with the approximate degrees of freedom. Again, different definitions have been suggested, and thus researchers need to be clear about the method used for obtaining the degrees of freedom for these tests. Several alternatives to these approximate tests have been discussed. These include a test based on a Bartlett corrected likelihood ratio statistic (Zucker, Lieberman, & Manor, 2000), a permutation test (Reboussin & DeMets (1996), and bootstrapping. Inferences about random level-1 coefficients. Researchers may also be interested in estimating the random level-1 coefficients and making inferences about these coefficients. For example, a researcher who is interested in estimating the effects of prior reading achievement on end of school year reading achievement, may wish to get a separate effect estimate for each classroom. One approach would be estimate the level-one model separately for each classroom using standard ordinary least square (OLS) estimation methods, in which case standard methods are available for constructing confidence intervals and testing hypotheses about coefficients. The drawback of the OLS approach is that each estimate is based on relatively few observations, only those from the classroom of interest, thus leaving a lot of room for sampling error. An alternative is to obtain empirical Bayes estimates, which consider all the available information. Empirical Bayes estimates tend to pull the effect estimates toward the overall average estimate by an amount that depends on the uncertainty in the effect estimate being considered and the variability in the effect estimates. This process biases the estimates, but leaves us with values that tend to be closer to the parameter values (i.e., a smaller expected mean square error) than those based on OLS estimation (Raudenbush & Bryk, 2002). For empirical Bayes estimates the standard errors can be computed and used for the creation of confidence intervals or z-tests of statistical significance. These methods assume a normal sampling distribution, and thus may be unrealistic unless there is a large number of level-2 units (Raudenbusch & Bryk, 2002). METHOD Coding Protocol To analyze the articles representing multilevel applications, we developed a coding framework based in large part on the issue identified during the first phase of our review of the

21 literature. Within each area (i.e., model development and specification, data considerations, estimation, and hypothesis testing and inference) specific questions were devised to guide our review. The current issues and critical questions were organized into a checklist that was refined using a series of pilot tests. In these pilot tests, members of the research team independently analyzed the same application article using the checklist; members then came together as a group to check the consistency of the responses, discuss coding decisions and possible alterations of the checklist. A codebook, which facilitated coding efforts, was developed during these meetings to capture in more detail the coding process. The final version of the checklist, which was used to code each of the articles, is provided as an appendix. Searching Strategies for Applications To describe the current application and reporting of multilevel analyses in the field of education, prominent educational and behavioral research journals were initially selected for examination. We examined the same set of journals provided in the methodological research review published in Review of Educational Research by Keselman et al. (1998). It was deemed appropriate to begin with this set of journals, as these journals publish empirical research, represent different sub disciplines in education and are highly regarded in the fields of education and psychology. Additionally, we relied on the expertise of our research team to identify other well known publications that might provide similar applications of multilevel modeling. For this phase of our review, all of the issues of each volume of the chosen journals, published between 1998 – 2002, were hand searched for evidence of the employment of hierarchical linear modeling techniques. That is, our research team did not rely solely on article titles and abstracts to make our determination to include or exclude a particular study. Description of the Sample Of the identified articles, 20 have been reviewed at this time. The largest proportion (40%) came from the most recent year considered for this study, 2002 (see Figure 1) and only one study (5%) was from the earliest time point considered, 1998. The remainder of the sample (55%) was distributed almost equally over the middle three years, 1999-2001. The sample was drawn from 10 peer-reviewed journals that are fairly prominent in the social sciences (see Table 1). Studies from four of the journals (The American Educational Research Journal, the Journal of Educational Research, the Journal of Personality and Social Psychology, and the Journal of Applied Psychology) accounted for the majority of the sample, 60%, with each supplying three studies (15%) that were used in this analysis. Four of the remaining six journals were each a source for one article in the analysis while two articles were retrieved from the remaining two journals.

22

Figure 1. Distribution of Sample Studies Based on Year Published

Table 1 Journals and Years of Sample Studies JOURNAL American Educational Research Journal Child Development Journal of Educational Psychology Sociology of Education Journal of Applied Psychology

Journal of Educational Research Journal of Personality and Social Psychology Reading Research Quarterly Cognition and Instruction Developmental Psychology Total

YEAR 2002 0 0 1 0 2 3 0 1 0 1 8

2001 1 0 1 0 1 0 0 0 0 0 3

2000 2 0 0 1 0 0 1 0 0 0 4

1999 0 1 0 1 0 0 1 0 1 0 4

1998 0 0 0 0 0 0 1 0 0 0 1

Total 3 1 2 2 3 3 3 1 1 1 20

RESULTS Study Characteristics Before turning to the four central issues that were identified as important in the analysis and presentation of multilevel models, we took care to examine a host of characteristics germane to the set of articles examined. For this investigation, we thought that it would be prudent to articulate the types of studies being examined (e.g., individuals nested in contexts versus

23 repeated measures), the rationale provided by authors for employing HLM methods, the study design, sampling, the average number of units at varying levels of the model and the description provided regarding the distribution of level 1 units across level 2 units. The studies were typically nonexperimental (85%) and often did not use probability sampling (65%). They covered a wide range of applications. Half the studies used two-level models where individuals were nested in contexts. Two studies (10%) involved thee-level models where individuals were nested in contexts that were nested in contexts, while the remaining eight studies (40%) involved repeated measures data. Almost all of the studies (90%) explicitly stated a rationale for using hierarchical modeling, but the level of detail in the rationales varied greatly. The studies also differed widely in the amount of data used in the analysis, where the number of level-two units ranged from 19 to 1406, and the average number of level one-units per level-two unit ranged from a low just over 2 to a high of 160. Model Development & Specification Model development is a central component of inquiry in many disciplines, with notably different approaches evidenced among researchers both within and across disciplines. In our examination of this critical component, we considered a host of aspects related to divergent approaches to the development and specification of multilevel statistical models. A considerable amount of variability was evidenced in the number of models examined by researchers and the clarity of how well the number of models was communicated. For example, in only 45% of the articles reviewed were we able to determine with confidence the number of models analyzed. For this subset of articles (n=9), the number of models examined ranged from 4 to 430 with the median number equal to 9 (M = 51, SD = 126). For the set of published articles that we scrutinized, baseline models (i.e., unconditional models) were frequently investigated as part of data analysis (n=9, 45%). For 11 of the articles, we could not determine with confidence if baseline models were examined (see Table 2). It was also common to encounter studies that examined more than one set of predictor variables for each of the dependent variables under investigation (n=15, 75%). For these studies, researchers employed between two and six sets of predictors. In all of the studies that we examined, the predictors were selected based at least partially on apriori considerations. In most of these cases, strong support was provided by the literature base and empirical research. In six cases there was evidence that predictor variables were selected, in part, on significance tests for the individual predictors. With respect to the subset of researchers who explored multiple sets of predictors, the exact number of sets could not accurately be determined for approximately 35% of the studies. Further, four of the studies reported level two statistical interactions, while nine reported across level interactions. During

24 our examination of how researchers typically specify the covariance structure underlying the data, we observed that for approximately two-third of the studies, there was no clear discussion of this issue. For these instances, it appeared that software defaults were used in the analyses. Although centering has important implications for interpreting the results from the statistical modeling, 40% (n=8) of the studies provided no discussion of centering at level-1 and 60% (n=12) of the studies did not provide any discussion of centering at level-2. When centering was used at level-1, researchers either used grand mean (30%), group mean (15%) or other approaches (25%). Other types of centering for level-1 variables included from the last time point, coding from a given point in time, centered from time of loss, or some form of standardization. Grand mean centering was reported for the eight studies that reported the use of centering at level-2.

Table 2 Model Development and Specification Characteristic Examination of baseline models Yes No Unable to determine Selection of predictors: Based at least partially on: Aprior considerations Significance test for individual predictors Effect size for individual predictors Fit statistics (e.g. AIC or SBC) More than one set of predictor variables for each DV Yes, but exact number could not be determined with confidence Yes, number of sets of predictors could be determined Could not be determined with confidence No Interactions examined Level 1 Level 2 Across level No interactions Selection of covariance structure Not discussed, or unclear, and/or appears that defaults were used Established apriori prior to looking at the data Based partially on LRTS or significance tests for individual variance components Based partially on fit statistics (e.g. AIC or SBC)

N

Percent (%)

10 3 7

50 15 35

20 6 1 0

100 30 5 0

7

35

8

40

1 4

5 20

1 4 9 6

5 20 45 30

13

65

4 7

20 35

0

0

25 Centering Level 1 No discussion of centering 8 40 Grand mean 6 30 Group mean 3 15 Other 5 25 Level 2 No discussion of centering 12 60 Grand mean 8 40 Note: Counts may exceed 100% if multiple methods were applied (i.e., selection of predictors, centering, selection of variance structure). When we critiqued the extent to which models were well communicated, we observed 35% of the studies did not explicitly communicate the nature and number of the models, yet we were able to glean this information through close scrutiny of the text, tables, and footnotes (Table 2). For the remaining studies, only 10% (n=2) provided explicit statements of the number of models examined, while for the other 55% we could not determine this information with any degree of confidence. Given the complexity and number of models run, researchers tended to use multiple approaches to reporting the results. The most prominent method of communicating fixed effects was through the use of verbal descriptions (n=20), followed closely by lists of estimated effects (n=19), and communication through a series of regression equations (70%). The most common methods of communicating the estimated variance structure was a list of parameters (55%) and verbal description (75%). Eight of the studies examined provided evidence of variance parameters through the use of equations. None of the articles included matrix representations of these relationships. As researchers we are keenly interested in the extent to which our results are generalizable. To examine the critical issue of generalizability, we considered a broad range of evidence with respect to this aspect of inquiry. For example, we looked for both sensitivity analysis and traditional cross-validation methods as evidence of generalizability. Further, we also included both the replication or extension of previous research as well as internal replications (e.g., between group differences). None of the studies addressed the possibility of capitalizing on chance in model development by employing cross-validation analyses. However, six studies provide evidence of internal replication and three studies provided evidence of sensitivity analysis, while a single study reported replication/extension of previous research.

Table 3 Model Communication

26 Characteristic

N

Percent (%)

Communication of models presented Not explicitly stated, but could be determined from the information provided in the text, footnotes, etc.

7

35

Explicit statement of the number of models examined

2

10

Could not be determined with confidence

11

55

14 19 20

70 95 100

8 11 15

40 55 75

3 6 1

15 30 5

Communication of fixed effects Equation representation List of estimated effects Verbal description Communication of Variance structure Equation representation List of estimated effects Verbal description Generalizability Sensitivity analysis Internal replication Replication

Data Considerations Because inferences in multilevel models are based on an analysis of the covariances between and within the nested units, the consideration of distributional assumptions, outliers, statistical power, and missing data are critical to obtaining credible results. The results of the analysis of the treatment of such data considerations in the 20 articles reviewed are presented in Table 4. Despite the recent advances in statistical power analysis in multilevel models, none of the studies examined included an explicit discussion of statistical power in the study design or interpretation of results. Similarly, only three of the articles (15%) provided evidence of outlier screening and only one article described a consideration of the potential impact of measurement error on the resulting models. Conversely, 90% of the studies (n = 18) provided some discussion of missing data in the analysis and six of the 17 studies that acknowledged missing data (35%) included a consideration of the randomness of such missing data. However, details on the treatment of missing data in the analysis were less prevalent. Ten of the studies used listwise deletion for missing data at level 1 and two studies used a simple imputation procedure. For missing data at level 2, eight of the studies used listwise deletion, two studies used imputation, and two studies used other procedures (i.e., selecting a proxy variable with less missing data and

27 incorporating a missingness indicator vector in the analysis). Even when the nature of the missingness was discussed, the articles generally provided little insight as to the overall impact of the missing data treatment on the resulting estimates. Multilevel models require assumptions about the errors at each level of the analysis, and a consideration of the tenability of these assumptions is important in assessing the credibility of the results. In the 20 studies examined in our study, only four (20%) discussed normality of the level 1 residuals and three (15%) discussed variance homogeneity of the residuals. Only two studies provided details about the results of checking these residuals with corrective action taken. Finally, four studies (20%) mentioned the assumption of residual normality at level 2, but only one of these provided details on the extent to which this assumption was met.

Table 4 Data Considerations Characteristic

N

Percent (%)

Statistical Power

0

0

Discussion of Missing Data Randomness of Missing Data Treatment of Missing Data at Level 1 Listwise Deletion Imputation Treatment of Missing Data at Level 2 Listwise Deletion Imputation Other

18 6

90 351

10 2

672 132

8 2 2

1003 253 253

Discussion of Outliers Screening for Outliers

3 1

15 5

Treatment of Imperfect Measurement

1

5

Assumptions Level 1 Residual Normality Level 1 Residual Homogeneity of Variance Level 2 Residual Normality

4 3 4

20 15 20

1 Percent

based on 17 papers that acknowledged missing data. 2 Percent based on 15 papers that acknowledged Level 1 missing data. 3 Percent based on 8 papers that acknowledged Level 2 missing data.

Estimation and Testing Each article was examined for details about the analysis performed. In particular, articles were examined for information regarding the software utilized as well as general estimation techniques, including the method used, the algorithm used, whether or not convergence

28 problems were encountered, or if matrices were positive definite. Additionally, studies were examined for details regarding the method employed for drawing inferences for variance parameters and fixed effects. In general, limited information regarding methods of estimation and testing was provided, including the type of software used in the analysis. Only 40% of the studies explicitly state the type of software used in the analysis (Table 5). Six of the eight indicated the use of a variation of Byrk and Raudenbush’s HLM software and two indicated the use of the SAS software. Other available software for these types of analyses (e.g., ML-WIN, M-PLUS ) were not explicitly noted.

Table 5 Software Identified for Use in HLM Studies N

Percent (%)

Details

No information about software used

12

60

Information about program used, no information regarding date/version

2

10

SAS Proc Mixed

Information about program used as well as date or version used

6

30

HLM

General information about model estimation methods, algorithms, convergence issues and whether matrices were positive definite, was virtually non-existent. As Table 6 illustrates, few of the studies examined thus far provided any information on these issues. Since there are multiple ways to estimate hierarchical models, and evidence that these different methods can lead to different results and potentially to estimation problems, it is important that authors provide detail about the estimation process if other researchers are to be able to critically evaluate or replicate the analyses. Table 6 Model Estimation Considerations N

%

Estimation Method Stated

3

15%

Estimation Algorithm Stated

0

0%

Convergence Addressed

0

0%

29 Positive Definite Matrices Addressed

0

0%

Estimates of variance and covariance of model parameters varied across the studies. Variance estimates tended to be provided more often than covariance estimates between intercept and slope errors (Table 7). In 75% of the studies, one could not determine whether the covariance had been estimated. The large percentage of articles containing incomplete information was somewhat surprising. For other types of statistical models (e.g., multiple regression or structural equation modeling) it is expected that a complete listing of the estimated parameters will be given. It seems reasonable to expect the same in hierarchical linear modeling, at least for the models that are presented and interpreted. Table 7 Frequency of Reporting Variance and Covariance Estimates

Error Variance of Intercepts Error Variance of Regression Coefficient or Slope Covariance Between the Intercept and Slope Errors First Level Error Variance

Provided

Estimated but Not Provided

Insufficient Information Given

Not Applicable Since Not Estimated

10 (50%)

8 (40%)

2 (10%)

0 (0%)

7 (35%)

6 (30%)

3 (15%)

4 (20%)

1 (5%)

0 (0%)

15 (75%)

4(20%)

8 (40%)

12 (60%)

The degree to which variance information and fixed effect information was reported also fluctuated across studies, as well as how such information was reported. Table 8 provides a summary of what type of variance information was reported in the studies as well as how that information was reported. Significance tests (n = 12) were the most prevalent method of reporting additional information regarding variance. In addition, only six of the studies reported the method they used for these significance tests, a chi-square analysis, and none of the articles specified the type of chi-square analysis used. Again we observe the tendency to not provide enough detail for thorough critique or replication. None of the studies used confidence intervals to gauge precision of variance estimates although four (20%) studies provided such information for fixed effect estimates. For fixed effects, significance tests and point estimates were widely reported (95% and 100% of the time, respectively). However, although information on fixed

30 effects tended to be reported more often than variance estimates, the details of how inferences were made were often not included. Only eight of the studies indicated the type of test used (e.g., t-test) and of these none reported the method for determining the degrees of freedom. Table 8 Additional Information on Variance and Fixed Effects Reported N

Percent (%)

7 2 0 12 2 6

35 10 0 60 10 30

5

25

8 6 0 6 0

40 30 0 30 0

1 12 4 19 20 6

5 60 20 95 100 30

12 0 8 0

60 0 40 0

20 0 0 0 0

100 0 0 0 0

Additional variance information provideda None Standard Errors Confidence Intervals Significance Tests Reliabilities Inter Class Correlations Explained Variance Method used for CIs or Significance Tests for Variance Parameters Not Applicable Not Stated SE/z-estimate Chi-Square Other Fixed Effect Information Provideda None Standard Errors Confidence Intervals Signficance Tests Point Estimates Other Method used for CIs or Signficance Tests for Fixed Effects Not Stated Likelihood Ratio T or F test Other Level-1 Parameter Information Provided None Extimates Provided, Method Not Stated OLS or EB Estimates Statistical Tests for OLS or EB Estimates CIs for OLS or EB Estimates

31 Other may exceed 100% if multiple methods were applied

0

0

aCounts

CONCLUSIONS AND RECOMMENDATIONS The results presented from this study should be viewed as preliminary, and although we will offer some conclusions and recommendations, it is important to note that these are being offered tentatively at this point. We have only reviewed 20 of the articles published in the selected journals between 1998 and 2002, which is about ¼ of the hierarchical modeling articles published in these journals. After reviewing the remainder of the articles, we will be able to make more precise statements about analysis and reporting practices. It should also be noted that questions could be raised about the reliability of the coding. Each article was reviewed independently by all members of the research team and then discussed at a team meeting, at which time a master checklist was created. There were many items for which 100% agreement was obtained (e.g., was there a statement of the statistical software used?), there were other items, however, that involved greater levels of inference and that sometimes led to disagreements (e.g., how many models were estimated?). These disagreements were resolved through discussion, and a codebook, which facilitated coding efforts, was developed to capture in more detail the coding process. We anticipate estimating reliability for all coding decisions for a sample of the remaining articles, and then using these estimates to guide the number of coders used to examine the remaining articles. When the reliability has been estimated and more articles have been coded, it will be possible to make less tentative conclusions and recommendations. Even keeping in mind the preliminary nature of our results, there seems to be some relatively clear problems in the reporting of HLM analyses. There is often not enough information for a reader to technically critique the reported analyses, even when the writers have done an admirable job in discussing the critical methodological issues of sampling, research procedures, and measurement. With this in mind, we suggest the following reporting guidelines for hierarchical modeling. 1. Provide a clear description of the process used to arrive at the model(s) presented. This should include discussion of how the predictors were selected, how the covariance structure was chosen, and a statement of how many models were examined. Readers can more carefully consider the presented models if they understand the process from which the models were generated.

32 2. Explicitly state whether centering was or was not used, and if it was, provide details on which variables were centered and how they were centered. Without knowledge of centering decisions, readers cannot easily interpret the regression coefficients. 3. Explicitly state whether there were specification checks, if distributional assumptions were considered, and whether data were screened for outliers. If such checks were made, state both the method used and what was found. Without this sort of information it is easier to question the creditability of the results. 4. State whether the data were complete. If they were not complete, describe the missingness and attempt to provide insight into its possible effects on the results. 5. Provide details on the analysis methods, including a statement of the software used, the method of estimation, whether convergence was obtained, and whether all variance estimates were admissible. It is important for authors to list the version of the software used in case bugs in a specific version of the software are found at a later date that may call into question the interpretation. The other details are helpful for interpreting the parameter estimates. 6. For any interpreted model, provide a complete list of all parameter estimates. In addition to providing critical information for interpreting the results, this helps to communicate the precise model estimated. 7. Provide either standard errors or interval estimates of the parameters of interest. This recommendation is consistent with the general reporting guidelines provided by the APA taskforce on statistical inference (Wilkinson & Task Force on Statistical Inference, 1999). Statistical significance tests provide limited inferential information, and can be difficult to interpret when large numbers of tests have been conducted, which was typical in the reviewed applications. We recognize that it would also appear helpful if we provided some concrete guidelines regarding the conduct of hierarchical modeling. Unfortunately, the models are complex and the methodological decisions regarding their implementation are best made only after careful consideration of a particular application. For example, whether group mean centering makes for a good recommendation depends on the application being considered. Whether restricted maximum likelihood estimation is the best recommendation for estimation depends on the application being considered. We hope that providing some reporting guidelines will heighten awareness of some of the technical issues among researchers and reviewers involved in a particular application. This in turn may lead to a careful examination and critical dialog about

33 the issues within the context of the application, which may facilitate improvements in applied practice.

34 References Akaike, H. (1974). A new look at the statistical model of identification. IEEE Transaction on Automatic Control, 19, 716-723. Anderson, A. B., Basilevsky, A., & Hum, D. P. (1983). Missing data. In P. H. Rossi, J. D. Wright, & A. B. Anderson (Eds.), Handbook of survey research (pp. 415-494). New York: Academic Press. Basilevsky, A., Sabourin, D., Hum, D., & Anderson, A. (1985). Missing data estimators in the general linear model: An evaluation of simulated data as an experimental design. Communications in Statistics, 14, 371-394. Beale, E. M. L., & Little R. J. A. (1975). Missing values in multivariate analysis. Journal of the Royal Statistical Society, Series B, 37, 129-145. Bremer, R.H. (1993). Choosing and modeling your mixed linear-model. Commun Stat – Theory, 22, 3491-3521. Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical linear models: Applications and data analysis methods. Newbury Park: Sage Publications. Burstein, Kim, & Delandshere, (1989). Carpenter, J. Goldstein, H., & Rasbash, J. (1999). A non-parametric bootstrap for multilevel models. Multilevel Modeling Newsletter, 11, 2-5. Christensen, R. (1996). Exact tests for variance components. Biometrics, 52, 309-314. Collins, L. M., Schafer, J. L., & Kam, C. M (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330-351. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1-8. Dickmeyer, N. (1989). Metaphor, model, and theory in education research. Teachers College Record, 91, 151-160. Donner A, Birkett N, Buck C. (1981). Randomization by cluster: Sample size requirements and analysis. American Journal of Epidemiology, 114, 906-914. Draper, D. (1995). Inference and hierarchical modeling in the social sciences. Journal of Educational and Behavioral Statistics, 20 (2), 115-147. Ferron, J., Dailey, R., & Yi, Q. (2002). Effects of misspecifying the first-level error structure in two-level models of change. Multivariate Behavioral Research, 37, 379-403. Frane, J. W. (1976). Some simple procedures for handling missing data in multivariate analysis. Psychometrika, 41, 409-415.

35 Giesbrecht, F., & Burns, J. (1985). Two-stage analysis based on a mixed model: Large-sample asymptotic theory and small-sample simulation results. Biometrics, 41, 477-486. Gleason, T. C., & Staelin, R. (1975). A proposal for handling missing observations. Psychometrika, 40, 229-252. Goldstein, (1986). Multilevel mixed linear model analysis using iterative generalized least squares. Biometrika, 73, 43-56. Goodwin, L. D. & Goodwin, W. L. (1985b). Statistical techniques in AERJ articles, 1979 – 1983: The preparation of graduate students to read the educational literature. Educational Researcher, 14, 5 – 11. Gosh, M., & Rao, J. (1994). Small area estimation: An appraisal. Statistical Science, 9, 55-93. Guertin, W. H. (1968). Comparison of three methods of handling missing observations. Psychological Reports, 22, 896. Heitjan, D. F. & Rubin, D.B. (1991). Ignorability and coarse data. Annals of Statistics, 12, 22442253. Hopkins, K. D. (1982). The unit of analysis: Group means versus individual observations. American Educational Research Journal, 19, 5-18. Hsieh, F. Y. (1988). Sample size formulae for intervention studies with the cluster as unit of randomization. Statistics in Medicine, 7, 1195–1201. Jiang, JM (2001) Goodness-of-fit tests for mixed model diagnostics. The Annals of Statistics, 29, 1137-1164. Kackar, R., & Harville, D. (1981). Unbiasedness of two-stage estimation and prediction procedures for mixed linear models. Communications in Statistics-Theory and Methods-A, 10, 1249-1261. Kackar, R., & Harville, D. (1984). Approximations for standard errors of estimators of fixed and random effects in mixed linear models. Journal of the American Statistical Association, 79, 853862. Kasim, R., & Raudenbush, S. (1998). Application of Gibbs sampling to nested variance components models with heterogeneous with-in group variance. Journal of Educational and Behavioral Statistics, 20, 93-116. Kenward, M., & Roger, J. (1997). Small sample inference for fixed effects from restricted maximum likelihood. Biometrics, 53, 983-997. Keselman, H.J., Algina, J., Kowalchuk, R. K., & Wolfinger, R.D. (1998). A comparison of two approaches for selecting covariance structures in the analysis of repeated measurements. Communications in Statistics: Simulation, 27, 591-604.

36 Keselman, H. J., Huberty, C. J., Lix, L.M., Olejnik, S., Cribbie, R.A., Donahue, B., Kowalchuk, Rhonda K., Lowman, L. L., Petoskey, M. D., Keselman, J. C., & Levin, J. R. (1998). Statistical Practices of Educational Researchers: An Analysis of Their ANOVA, MANOVA, and ANCOVA Analyses. Review of Educational, 68, 350-386. Kim, J., & Curry J. (1977). The treatment of missing data in multivariate analysis. Sociological Methods and Research, 6, 215-240. Kreft, I. G. G. (1995). Hierarchical linear models: Problems and prospects. Journal of Educational and Behavioral Statistics, 20 (2), 109-133. Kreft, I.G.G., & de Leeuw, J. (1998). Introducing Multilevel Models. London: Sage. Kromrey, J. D., & Dickenson, W. B. (1996). Detecting unit of analysis problems in nested designs: Statistical power and Type I error rates of the F test for groups-within-treatments effects. Educational and Psychological Measurement, 56, 215-231. Lange, L., & Laird, N. M. (1989). The effect of covariance structure on variance estimation in balanced growth-curve models with random parameters. Journal of the American Statistical Association, 84, 241-247. Lee, Y., & Seely, J. (1996). Computing the Wald interval for a variance ratio. Biometrics, 52, 14861491. Lindley, D., & Smith, A. (1972). Bayes estimates for the linear model. Journal of the Royal Statistical Soceity, B, 34, 1-41. Linstrom, M., & Bates, D. (1989). Newton-Raphson and EM algorithms for linear mixed-effects models for repeated-measures data. Journal of the American Statistical Association, 83, 10141022. Little, T. D., Lindenberger, U., & Nesselroade, J. R. (1999). On selecting indicators for multivariate measurement and modeling with latent variables: When “good” indicators are bad and “bad” indicators are good. Psychological Methods, 4, 192-211. Little, R. J., & Rubin, D. B. (1987). Statistical analysis with missing data. New York: John Wiley & Sons. Littell, R., Milliken, G., Stroup, W., & Wolfinger, R. (1996). SAS system for mixed models. Cary, NC: SAS Institute Inc. Longford, N. (1987). A fast scoring algorithm for maximum likelihood estimation in unbalanced mixed models with nested random effects. Biometrika, 74, 817-827. Longford, N. (1993). Regression analysis of multilevel data with measurement error. British Journal of Mathematical and Statistical Psychology, 46, 301-311. Longford, N. (2001). Simulation-based diagnostics in random-coefficient models. Journal of the Royal Statistical Society Serie A-Statistics in Society, 164, 259-273.

37 Meijer, E., van der Leeden, R., & Busing, F. (1995). Implementing the bootstrap multilevel model. Multilevel Modeling Newsletter, 7, 7-11. Ofversten, J. (1993). Exact tests for variance components in unbalanced mixed linear models. Biometrics, 49, 45-57. Raudenbush, S. W. (1997). Statistical analysis and optimal design for cluster randomized trials. Psychological Methods, 2, 173-185. Raudenbush, S. W. & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods. Newbury Park: Sage Publications. Raudenbush, S., Bryk, A., Cheong, Y. & Congdon, R. (2000). HLM 5: Hierarchical Linear and Nonlinear Modeling. Chicago: Scientific Software International. Raymond, M. R., & Roberts, D. M. (1987). A comparison of methods for treating incomplete data in selection research. Educational and Psychological Measurement, 47, 13-26. Reboussin, D. M., & DeMets, D. L. (1996). Exact permutation inference for two sample repeated measures data. Communications in Statistical Theory and Methods, 25, 2223-2238. Richardson, A., & Welsh, A. (1995). Robust restricted maximum likelihood in mixed linear models. Biometrics, 51, 1429-1439. Ridgeway, V. G., Dunston, P. J., & Quian, G. (1993). A methodological analysis of teaching and learning strategy research at the secondary school level. Reading Research Quarterly, 28, 335 – 349. Roy, J., & Lin, X. (2002). Analysis of multivariate longitudinal outcomes with nonignorable dropouts and missing covariates: Changes in methadone treatment practices. Journal of the American Statistical Association, 97, 40-52. Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: John Wiley & Sons, Inc. Santos, R. (1981). Effects of imputation on regression coefficients. Proceedings of the Section on Survey Research Methods, American Statistical Association, 141-145. SAS Institute Inc. (2000). SAS/Proc Mixed (version 8) [comuputer program], Carey, NC: SAS Institute Inc. Schwartz, G. (1978). Estimating the dimensions of a model. Annals of Statistics, 6, 461-464. Seltzer, M. Novak, J., Choi, K., & Lim, N. (2002). Sensitivity analysis for hierarchical models employing t level-1 assumptions. Journal of Educational and Behavioral Statistics, 27, 181-222. Shimodaira, H. (1998). An application of multiple comparison techniques to model selection. Annals of the Institute of Statistical Mathematics, 50, 1-13.

38 Singer, J. D. (1998). Using SAS PROC MIXED to fit multilevel models, hierarchical models, and individual growth models. Journal of Educational and Behavioral Statistics, 24, 323-355. Snijder, T., & Bosker, R. (1993). Standard errors and sample sizes for two-level research. Journal of Educational Statistics, 18, 237-259. Snijder, T., & Bosker, R. (1994). Modeled variance in two-level models. Sociological Methods and Research, 22, 342-363. Stern, S., & Welsh, A. (2000). Likelihood inference for small variance components. Canadian Journal of Statistics, 28, 517-532. Teuscher, F., Herrendorfer, G., & Guiard, G. (1994). The estimation of skewness and kurtosis of random effects in the linear model. Biometrical Journal, 36, 661-672. Wada, Y. & Kashiwagi, N. (1990). Selecting statistical models with information statistics. Journal of Dairy Science, 73, 3575-3582. Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604. Wolfinger, R. (1993). Covariance structure selection in general mixed models. Communication Statistics – Simulation, 22, 1079-1106. Woodhouse, G., Yang, M., Goldstein, H., & Rashbash, J. (1996). Adjusting for measurement error in multilevel analysis. Journal of the Royal Statistical Society-A, 159, 201-212. Yu, Q., & Burdick, R. (1995). Confidence-intervals on variance components in regression-models with balanced (Q-1)-Fold nested error structure. Communications in Statistics-Theory and Methods, 24, 1151-1167. Zucker, D., Lieberman, O., & Manor, O. (2000). Improved small sample inference in the mixed linear model: Bartlett correction and adjusted likelihood. Journal of the Royal Statistical SocietyB, 62, 827-838.

39

Title:___________________________________________________________________________________ Author(s):_______________________________________________________________ Journal, Year, Vol (Number), pgs: ___________________________________________ Study Characteristics (place holder) Is there an appendix provided with technical details? ______ Yes ________No

1. What best describes the study type?

2. Is a rationale (and/or advantage(s)) provided for using HLM methods in the study?

____ ____ ____ ____ ____ ____ ____

a. b. c. d. e. a. b.

individuals nested in contexts growth curves individuals nested in contexts within contexts growth curves nested within contexts other, describe: ___________________________ no yes: _____________________________________ __________________________________________ __________________________________________ __________________________________________ __________________________________________

Page(s):

Page(s):

Page(s):

3. What is the study design?

_____ a. nonexperimental _____ b. experimental Data set:__________________________________________ _________________________________________________ _________________________________________________ _________________________________________________ _________________________________________________

Page(s):

4. Thoroughly describe the data set, including scope (national, etc.) if known.

_____ a. nonprobability _____ b. probability _____ c. mixed—describe: _________________________ ________________________________________ ________________________________________

Page(s):

5. What type of sampling was used?

6. How many level 1 units per level 2 unit? (e.g., average number of students per school in nested designs, number of observations per student in growth curve designs) ________________

Page(s):

Page(s):

7. How many level 2 units? (e.g., number of schools in a nested design, number of students in a growth curve design, etc.) __________________

8. How well was the distribution of level 1 units across level 2 units addressed?

_____ a. not described _____ b. minimal description, e.g., only average number of students per school or only number of observations per student with no further information _____ c. described partially, e.g., one may know the maximum number of observations per student _____ d. described fully so that it is clear how many level 1 units there are for each level 2 unit.

Page(s):

Comments/Notes

9. Fill out the following tables by listing the outcome(s) and predictors modeled

Outcome

Predictor

Type

Nature

Normality

Outliers

Reliability

Validity

1= achievement 2 = other (specify)

1 = continuous 2 = binary 3 = count 4 = ordinal 5 = multinomial

0 = not discussed 1 = normal 2 = nonnormal

0 = not discussed 1 = no 2 = yes

0 = not discussed 1 = estimated from data set 2 = other

0 = not discussed 1 = validity evidence gathered using this data 2 = other

Nature

Normality

Outliers

Reliability

Validity

1 = continuous 2 = binary 3 = count 4 = ordinal 5 = multinomial

0 = not discussed 1 = normal 2 = nonnormal

0 = not discussed 1 = no 2 = yes

0 = not discussed 1 = estimated from data set 2 = other

0 = not discussed 1 = validity evidence gathered using this data 2 = other

MODEL SPECIFICATION Page(s):

10. How many models are examined in the study?

____________________________________________

11. How well were the number of models presented in this communicated?

_____ a. not explicit but can be determined from information provided in text, tables, footnotes, etc. _____ b. explicit statement(s) of number of models examined _____ c. cannot be determined with confidence

12. Were baseline models run?

_____ a. no _____ b. yes _____ c. cannot be determined with confidence

Page(s):

Page(s):

Page(s):

NOTE: If different methods were used for different models in the study, please list all methods used

_____ a. not discussed or unclear _____ b. based at least partially on apriori considerations _____ c. based at least partially on significance tests for individual predictors _____ d. based at least partially on effect size for individual predictors _____ e. based at least partially on fit statistics like AIC or SBC _____ f. other: ____________________________________

14. Were there more than one set of predictors for each Dependent Variable?

_____ _____ _____ _____

a. b. c. d.

no yes, but exact number of sets could not be determined yes, number of different sets of predictors: _________ cannot be determined with confidence

Page(s):

15. Were interactions examined and communicated? Check ALL that apply

_____ _____ _____ _____ _____

a. b. c. d. e.

no yes, level 1 interaction(s) yes, level 2 interaction(s) yes, across level interaction(s) cannot be determined with confidence

Page(s):

13. How were the predictors selected?

16. How was the covariance structure of the model(s) specified? NOTE: If different methods were used for different models in the study, please list all methods used 17. Was there centering of variables at level 1?

18. Was there centering of variables at level 2?

Page(s):

_____ _____ _____ _____

a. b. c. d.

not discussed, unclear, and/or appears defaults were used established prior to looking at the data based partially on fit statistics like the AIC or SBC based partially on LRTs or significance tests for individual variance components _____ e. other:____________________________________ _____ _____ _____ _____ _____

a. b. c. d. e.

_____ _____ _____ _____ _____

a. b. c. d. e.

no discussion of centering no centering grand mean centering group mean centering other: ___________________________________ (e.g., growth curve centered at last point) no discussion of centering no centering grand mean centering group mean centering other: ___________________________________ (e.g., growth curve centered at last point)

Page(s):

Page(s):

Comments/Notes

19 How were the fixed effects (regression coefficients) in the model communicated (check all that apply)? 20. How were variance structures in the model communicated (check all that apply)?

21. Is there evidence of generalizability?

_____ _____ _____ _____

a. b. c. d.

series of regression equations single mixed model equation list of estimated effects verbal description

_____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____

a. not mentioned b. equation representation c. list of estimated variance parameters d. verbal description e. other: __________________________________ a. no b. sensitivity analysis c. cross-validation d. replication/extension of previous research (explicit) e. internal replication e.g., between group differences f. other: ______________________________________

Page(s):

Page(s):

Page(s):

DATA

22. Was power considered?

_____ _____ _____ _____

23. Was there missing data?

_____ a. no missing data (skip to # 27) _____ b. no discussion of completeness of data (skip to #27) _____ c. missing data noted at level 1, e.g., attrition, absence during testing, failure to complete instruments, etc. _____ d. missing data noted at level 2, e.g., attrition, absence during testing, failure to complete instruments, etc. _____ e. other: ___________________________________

24. If missing data were discussed, were relationships among missingness and other variables discussed?

_____ a. no _____ b. yes __________________________________________ __________________________________________

25. If there was missing data at level 1, how was it handled?

26. If there was missing data at level 2, how was it handled? 27. Were outliers present?

28. What method was used to screen for outliers?

a. b. c. d.

no discussion of power general guidelines considered power analysis conducted other: __________________________________

Page(s):

Page(s):

Page(s):

_____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____

a. b. c. d. e. a. b. c. d. e. a. b. c. a. b. c. d. e. f.

not applicable not discussed listwise deletion imputation. Type: _______________________ other: __________________________________ not applicable not discussed listwise deletion imputation. Type: _______________________ other: __________________________________ not discussed no yes not discussed can’t tell univariate methods simulation diagnostics residuals other: _______________________________________ ____________________________________________

Page(s):

Page(s):

Page(s):

Page(s):

Comments/Notes

29. How was imperfect measurement handled?

Page(s):

_____ a. not discussed _____ b. consequences considered _____ c. other: _______________

30. Using the chart, indicate if distributional assumptions of the model were considered. Evidence of Violation 0 = not discussed 1 = examined and no violation found 2 = examined and violation found

Considered 0 = not discussed 1 = considered Assumption

Action Taken (if violated) 0 = ignored 1 = consequences considered 2 = corrective action taken

Level- 1 residuals~N Lvl-1 residuals have equal variance for each lvl-2 unit

Level-2 residuals~N ESTIMATION AND TESTING 31. What software package/ version was used? Please list name and version or year 32. What method of estimation was used 33. What estimation algorithm was used?

34. Were any convergence problems encountered?

35. Were any of the covariance matrices not positive definite?

_____ a. not given _____ b. package stated _____ c. package and version or year stated Software Information: __________________ _____ a. not given _____ b. given: _____________________________________

Page(s):

_____ a. not given _____ b. given: _____________________________________

Page(s):

_____ a. not mentioned _____ b. no _____ c. yes __________________________________________ __________________________________________ __________________________________________

Page(s):

_____ a. not mentioned _____ b. no _____ c. yes __________________________________________ __________________________________________ __________________________________________

Page(s):

Page(s):

36. For which variance/covariance parameters were estimates provided? 36-a. error variance of the intercepts (τ00) 36-b. error variance of each regression coefficient or slope (e.g., τ11, τ22)

_____ _____ _____ _____ _____ _____ _____ _____

a. b. c. d. a. b. c. d.

not applicable since not estimated provided estimated but not provided insufficient information provided not applicable since not estimated provided estimated but not provided insufficient information provided

Page(s):

Page(s):

Comments/Notes

36-c. covariance between the intercept and slope errors (e.g., τ12, τ23)

_____ _____ _____ _____

a. b. c. d.

not applicable since not estimated provided estimated but not provided Insufficient information provided

36-d. first level error variance (typically σ2, but could be both σ2 and ρ) _____ _____ _____ 37. What additional variance _____ parameter information is _____ provided? Check all that apply _____ _____

a. b. c. d. e. f. g.

none SEs confidence intervals significance tests reliabilities ICCs explained variance

_____ _____ _____ _____ _____

a. b. c. d. e.

not applicable not stated SE/z-estimate χ2 , type (if given): ____________________ exact, bootstrap, other__________________

_____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____

a. b. c. d. e. f. a. b. c. d. e. a. b. c. d. e. f.

38. If CIs or significance tests were reported for variance parameters, what method was used

39. What fixed effect parameter information is provided? Check all that apply.

40. If CIs or significance tests were reported for fixed effects, what method was used?

41. What level-1 parameter information is provided? Please check all that apply.

42. Is there something else in this study that needs to be ‘captured’ that hasn’t been addressed in other items??

_____ a. provided _____ b. estimated but not provided

none SEs confidence intervals significance tests point estimates __________________________ other (e.g., effect size): _____________________ not stated likelihood ratio t or F test, degree of freedom method NOT stated t or F test, degree of freedom method IS stated permutation, bootstrap, other none estimates provided, but method not stated OLS or EB estimates statistical tests for OLS or EB estimates CIs for OLS or EB estimates other (e.g., just equations)___________________ ________________________________________ ____________________________________________ ____________________________________________ ____________________________________________ ____________________________________________ ____________________________________________

Page(s):

Page(s):

Page(s):

Page(s):

Page(s):

Page(s):

Page(s):