Multiple Imputation - Wiley Online Library

5 downloads 217482 Views 543KB Size Report
standard statistical software programs such as SAS. (SAS Institute, Cary, North ... ysis despite having censored data, while also accounting for the uncertainty ...
Advanced Statistics: Missing Data in Clinical Research—Part 2: Multiple Imputation Craig D. Newgard, MD, MPH, Jason S. Haukoos, MD, MS

Abstract In part 1 of this series, the authors describe the importance of incomplete data in clinical research, and provide a conceptual framework for handling incomplete data by describing typical mechanisms and patterns of censoring, and detailing a variety of relatively simple methods and their limitations. In part 2, the authors will explore multiple imputation (MI), a more sophisticated and valid method for handling incomplete data in clinical research. This article will provide a detailed conceptual framework for MI, comparative examples of MI versus naive methods for handling incomplete data (and how different methods may impact subsequent study results), plus a practical user’s guide to implementing MI, including sample statistical software MI code and a deidentified precoded database for use with the sample code. ACADEMIC EMERGENCY MEDICINE 2007; 14:669–678 ª 2007 by the Society for Academic Emergency Medicine Keywords: missing data, bias, clinical research, imputation, multiple imputation, statistical analysis

I

n part 1 of this series, we provide a conceptual overview of incomplete or missing data, discuss the patterns and mechanisms of censoring, and describe several relatively simple methods of handling incomplete data and their limitations.1 In this article, we will discuss multiple imputation (MI) as one potential solution to the problem of censored data. The objectives of this article are 1) to provide a conceptual framework and practical user’s guide for using MI, 2) to demonstrate examples of various methods of handling incomplete data (including MI) under different scenarios using real data and how such methods may impact study results, and 3) to provide sample MI code (using SAS software) and a publicly available, deidentified database to be imputed. As described in part 1, we will continue to refer to values as ‘‘censored’’ or ‘‘incomplete’’ if they were not observed in the original data set and to use the phrase ‘‘missing

From the Center for Policy and Research in Emergency Medicine, Department of Emergency Medicine, Oregon Health & Science University (CDN), Portland, OR; and Department of Emergency Medicine, Denver Health Medical Center, and the Department of Preventive Medicine and Biometrics, University of Colorado Health Sciences Center (JSH), Denver, CO. Received June 30, 2006; revision received September 13, 2006; accepted November 23, 2006. Presented in part at the SAEM annual meeting, San Francisco, CA, May 2006. Series editor: Roger J. Lewis, MD, PhD, Harbor–UCLA Medical Center, Torrance, CA. Contact for correspondence and reprints: Craig D. Newgard, MD, MPH; e-mail: [email protected].

ª 2007 by the Society for Academic Emergency Medicine doi: 10.1197/j.aem.2006.11.038

values’’ to refer to the actual values that would have been present had they been observed.1 Single imputation methods have substantial limitations (i.e., the generation of inappropriately small variances and potentially biased estimates) that limit their use as potential methods of handling censored data. Several valid methods for handling incomplete data have been described, including maximum likelihood (ML) estimation, Bayesian estimation, and MI. However, MI is the only technique that is computationally straightforward, versatile, relatively easy to apply, and increasingly available in standard statistical software programs such as SAS (SAS Institute, Cary, North Carolina), S-plus (Insightful Corporation, Seattle, WA), R (R Foundation, Vienna, Austria), and Stata (StataCorp, College Station, TX). Given these benefits, we have opted to describe the use of MI in detail. Since its initial development for survey research,2,3 MI has been successfully used in many different research settings, including clinical trials,4–6 longitudinal studies,4–9 studies where drop-out depends on an unobserved treatment effect,8 right and interval censored data (double censored data),10 epidemiologic analyses,11–13 registries,14,15 probability samples,16 clustered data,9,17,18 and others. When MI has been compared with alternative methods of handling incomplete data (e.g., complete-case analysis, single imputation, missing indicator regression), MI has been shown to generate less biased estimates with more statistical efficiency.4–7,13–15,19–21 Despite the increasing availability of appropriate incomplete data methods, naive methods for handling incomplete data are still widely prevalent, even among clinical trials published in toptier medical journals.22

ISSN 1069-6563 PII ISSN 1069-6563583

669

670

Newgard and Haukoos

CONCEPTUAL OVERVIEW OF MI The general basis of MI is to use observed values to generate a range of plausible values (i.e., imputations) for each censored value, based on existing correlations between variables. The imputations are intended to represent a plausible range of values that approximate the missing value, had it not been missing. The variability of values within this range allows the uncertainty in the imputation process to be quantified and integrated into the analysis. When certain assumptions are met, the MI process allows for the inclusion of all subjects in the analysis despite having censored data, while also accounting for the uncertainty inherent in the imputation process. Multiple imputation is an extension of single imputation, where each censored value is replaced by a set of m > 1 simulated values (generally 5–10) that exist in m complete data sets.3 Each of the m complete data sets is then analyzed using standard analytic methods, and the results from each data set are combined using a set of rules that appropriately account for the uncertainty (variance) in the MI process (Figure 1).3 These methods allow for estimates of parameters, standard errors, and confidence intervals that incorporate the uncertainty inherent in the imputation process.3,23 Multiply imputed data are not intended to be used to assess individual subjects included in a data set, but rather to allow one to draw inferences (e.g., point estimates and confidence intervals) for the aggregate sample that approximates the true values in the population from which the sample was drawn.24 Assumptions Required for MI The most important assumption required for MI pertains to the underlying mechanism of censoring in the data. MI will generate valid results when the underlying pattern of



MISSING DATA IN CLINICAL RESEARCH PART 2

censoring is ‘‘ignorable.’’3 Such a situation exists when data are either missing completely at random (MCAR; where the probability that data are censored does not depend on observed or unobserved values) or missing at random (MAR; where the probability of being censored is entirely explained by observed values).23 Nonignorable censoring is said to exist when both observed and missing values are required to explain the pattern of censoring (i.e., a ‘‘missing not at random’’ [MNAR] mechanism).23 Other valid methods for handling incomplete data (e.g., ML estimation and Bayesian modeling) require the same assumption. Appropriate selection of variables for the data model (discussed below) will help increase the likelihood of having an MAR mechanism of censoring, despite the inability to directly test whether the mechanism is MAR versus MNAR. MCAR is least plausible and generally not a realistic assumption. Although biased results may result from MI when data have an MNAR mechanism, some studies have suggested that even under such situations, MI generates less biased estimates than naive methods of handling the same censored values.5,20 Most MI methods also assume that data conform to a multivariate normal distribution (i.e., variables are normally distributed), although MI methods may be robust to deviations from this assumption.25 To meet this assumption, a continuous variable with a skewed distribution should be transformed (e.g., log-transformed, squared, or polynomial) to approximate a normal distribution. Some MI software will allow for log-linear or other mixed models that do not require variables to be normally distributed.26,27 The censored data must also represent ‘‘item nonresponse’’ (i.e., all subjects have at least some portion of observed values), because MI does not accommodate ‘‘unit nonresponse’’ (i.e., subjects with no observed values for any variables).24

Figure 1. Overview of multiple imputation process with sample database (n = 38,167).

ACAD EMERG MED



July 2007, Vol. 14, No. 7



www.aemj.org

The Data Model The first step in MI is to generate a probability model that relates the complete data set (Y), consisting of both observed values (Yobs) and missing values (Ymis), to a set of parameters.3,23 The goal is to use Yobs to generate a predictive distribution for Ymis, specified as p(YmisjYobs). This process is accomplished using Bayesian methods. In brief, a parameter of interest is identified and the prior distribution is specified (a noninformative prior distribution is typically used) based on Yobs. The prior distribution is combined with a likelihood function to generate the Ymis posterior distribution (posterior probability). The imputation process is iterative, so these steps are repeated many times, refining the posterior distribution until the predictive Ymis distribution stabilizes and the process converges.20,23 A common modeling technique applied in many MI software programs is a Markov chain Monte Carlo simulation,28 while other software programs allow the MI data model to be specified using flexible sequential regression models (e.g., linear, logistic, Poisson, mixed log-linear, polytomous, and proportional hazards models).24,26,27,29 Detailed description of the different methods available for MI data models and simulation is beyond the scope of this article but can be found in several useful references.3,23,25,26,28 After the conditional distribution for Ymis has been created, multiple random draws are generated from this posterior distribution (equivalent to the m number of complete data sets), thus producing the MIs for each subject with originally censored values. Each draw from the posterior distribution generates different estimates for the censored values, allowing integration of both the variability inherent in the imputation process as well as an approximation of the missing value. The process of generating a range of plausible values for each subject with censored data distinguishes MI from single imputation. After the m random draws have been performed, one is left with m complete data sets, each with slightly different values for variables that were previously censored. These m data sets can now be stored and used in subsequent analyses.3,23

Selection of Variables for the Data Model: The Inclusive Strategy In practice, MI software will handle the process of generating the posterior distribution p(YmisjYobs) after the user specifies which variables are to be included in the data model. The MI data model requires selection of specific types of variables, including 1) target variables (those to be included in the planned final analysis, including the outcome measure30); 2) auxiliary variables (variables not intended for the final analysis but that are highly predictive of target variables or associated with the censoring of target variables); and 3) sample design variables that describe special aspects of how the sample was generated (e.g., clusters and strata).23,31 Sample design variables may not be present in certain data sets (e.g., simple random samples) but are important when working with clustered data, probability samples, or other data sets with unique sample design features. Examples of these three types of variables are included in the sample data set provided in Appendix A (available as

671

an online Data Supplement at http://www.aemj.org/cgi/ content/full/j.aem.2006.11.038/DC1). It is critical to preserve the associations among variables and to generate a data model that accurately portrays the conditional distribution of missing values. Appropriate selection of variables for the MI data model will allow an MAR mechanism of censoring to be approximated, resulting in valid estimates with MI. However, having an overly restrictive selection of variables for the MI data model may disrupt associations among variables and generate bias due to an inadequate ability of observed values to explain the censoring of values (i.e., allowing an MNAR mechanism to exist). Previous research has shown that an inclusive strategy (i.e., including all variables possibly important in preserving the relationships between variables) results in less bias than a restrictive strategy, with minimal loss in statistical efficiency.31 Because variables to be included in the final analysis (i.e., target variables) must be included in the MI data model, selection of additional variables (i.e., auxiliary variables) requires more thoughtful consideration to ensure an adequate data model. Auxiliary variables are included in MI models solely to improve the performance of the MI process, and generally will not be used in subsequent analyses. In a simulation study, Collins et al. showed that the inclusion of auxiliary variables either correlated with target variables and/or predictive of censoring of target variables reduced bias and variance and improved statistical efficiency.31 Because it may be difficult to assess which auxiliary variables are associated with censoring of target variables, the imputer generally selects these variables based on demonstrated correlation/association with target variables, and on plausible associations with censoring. Even when such terms are not predictive of target values or censoring, there is minimal or no penalty for including them (i.e., no reduction in statistical efficiency).31 While there is a recognized balance between terms that add useful information to the data model, versus terms that add no information and risk statistical inefficiency, it is better to err on the side of being more inclusive during variable selection. Selection of appropriate auxiliary variables ideally should be addressed early in study development (i.e., during the design stage), because critically important auxiliary variables may not otherwise be collected24 and the integration of MI methods for handling incomplete data will also alter sample size requirements.32 Several additional points deserve mention regarding the selection of terms for the data model. Although it may seem counterintuitive, inclusion of the outcome variable in the MI data model is necessary to reduce bias in imputing missing predictor variables.30 This strategy has been shown to reduce bias in regression coefficients for predictor terms regardless of the mechanism of censoring and the use of different MI software programs.30 Also, when coding variables for the data model, one should remember that continuous variables generally provide more information than categorical variables. When possible, continuous covariates should not be categorized before inclusion in the data model (even if such coding is desired for the final analysis), because such a practice results in a decay of information for the variable.

672

Newgard and Haukoos

Determining the Number of Imputed Values/Data Sets to Generate Selection of the number of imputations (m) is based on the desired ‘‘relative efficiency’’ of MI estimates3,33 and is approximated using the following formula: relative efficiency ¼ ð1 þ l=mÞ1 ; where l is the rate of missing data (Table 1). In selecting the number of imputations to perform, relative efficiency is measured against a situation of perfect efficiency (100%). In most instances, m = 10 provides a high rate of efficiency (i.e., R95%) and is unlikely to produce any notable change in precision of results compared with 100% efficiency.3 Analyzing Multiply Imputed Data Once the m imputed data sets have been created, the data in each data set can be analyzed using standard statistical procedures. This is one of the enticing features of using MI, because most univariate and multivariable statistical procedures can be performed following the imputation process, analyzing each imputed data set separately. The results are subsequently combined using the following mathematical rules developed by Rubin.3 After completion of statistical analyses for each imputed data set, parameter estimates (Qi) are averaged over the m number of data sets: X m ¼ 1 Qi ; Q m

(1)

where Qi is the point estimate from each imputed data set. The within-imputation variance (Um) is represented by the average of the m complete data (i.e., imputed data sets with no censored values) variances, X m ¼ 1 Ui ; U m

(2)

where Ui is the variance for each imputed data set. The between-imputation variance (Bm) is Bm ¼

X 1 ðQi  Qm Þ2 : ðm  1Þ

(3)

The total variance (Tm) is generated by combining the within-imputation and between-imputation variances,    m þ 1 þ 1 Bm : Tm ¼ U m

(4)

Table 1 Relative Efficiency of Multiple Imputation with an Increasing Number of Imputations and Proportion of Censored Data Proportion of Censored Data (l)

No. of Imputations (m)

10%

30%

50%

70%

90%

3 5 10 20 30

0.97 0.98 0.99 1.00 1.00

0.91 0.94 0.97 0.99 0.99

0.86 0.91 0.95 0.98 0.98

0.81 0.88 0.93 0.97 0.98

0.77 0.85 0.92 0.96 0.97



MISSING DATA IN CLINICAL RESEARCH PART 2

An interval estimate (e.g., 95% confidence interval) is then generated by  m  tv ða=2ÞT 1=2 ; Q m

(5)

where tn(a/2) represents the upper and lower 100(a/2) percentage point of the Student’s t-distribution with n degrees of freedom " v ¼ ðm  1Þ 1 þ

 U ð1 þ m1 ÞBm

#2 :

(6)

To calculate a 95% confidence interval with large degrees of freedom present, then tn(a/2) = 1.96. Of note, degrees of freedom are not based on sample size, but rather on within- and between-imputation variances, plus the number of imputed data sets. Special Situations with MI Imputer versus Analyst. Because there are two steps to integrating MI into an analysis (i.e., the imputation process followed by the statistical analysis), it is possible that different people will be in charge of each step. Although this is feasible, it is not ideal, because there are unique features of the analysis that should be included in the imputation process (e.g., selection of variables, interaction terms, and sample design features). Failure to include such information in the missing data procedure could adversely impact subsequent study results,25,33 and selection of variables for the imputation model may differ depending on the variables of interest for the final analysis.33,34 The imputation process should be designed with the target analysis in mind to maximize statistical efficiency and validity of subsequent results. Coding ‘‘Missing.’’ Before starting the MI procedure, one must carefully consider what is actually ‘‘missing.’’ While there are clearly values that are truly censored, one should also consider recorded values that are known to be inaccurate or clinically nonsensible (e.g., a spontaneous respiratory rate in an intubated patient). Although many of these values or discrepancies may be clarified during the data cleaning process when checked against other variables or when the option to recheck the source data is available, some discrepancies are likely to persist. MI is one option to consider in such situations by multiply imputing values that are discrepant. It is important to note that handling discrepant data should be guided by a clear, objective plan set forth before evaluation of the data to avoid arbitrary decisions regarding discrepant values and the introduction of bias into the results. Another possibility is the inclusion of linked variables, where the presence of one term is required to have the presence of another. For example, if air bag presence and air bag deployment are both included in the MI data model, it is possible that some observations could be imputed to have no air bag present but to have air bag deployment. Such discrepancies should be checked after the imputation process and, if present, one may consider splitting the sample on such a term to avoid this situation (discussed below).

ACAD EMERG MED



July 2007, Vol. 14, No. 7



www.aemj.org

Probability Samples. Probability samples are commonly used in survey research and are becoming increasingly common in nationally representative, public access databases. Because it is important to preserve relationships between variables in the MI data model, sample design variables (i.e., variables that describe clusters and strata) should be included in the imputation process. Although this area of imputation is still being developed, two potential options for including such design factors are fixed effects and random effects models.16,23,31 If there are an adequate number of observations in each strata, either fixed effects or random effects models can be used to account for individual clusters within separate strata.16 That is, separate MI models are generated for each strata within the sample, with clustering accounted for within each stratified MI analysis (Appendix B, available as an online Data Supplement at http://www.aemj.org/cgi/ content/full/j.aem.2006.11.038/DC2). The imputed data from each stratified MI model are then combined into one data set after the imputation process. Failure to include key study design features can introduce severe bias into subsequent results.16 Effect Modification. If detection of effect modification (i.e., interaction terms) is an important part of the planned analysis, interaction terms should be accounted for in the MI data model because failure to include such terms may weaken such higher-order relationships between variables. There are three options for including interaction terms in the MI data model, including 1) creating the interaction term(s) following MI (least efficient), 2) creating the interaction term(s) before imputation and including the term(s) in the MI data model (moderate efficiency), or 3) separating the data set by levels (strata) of one categorical variable included in the interaction (provided one of the variables in the interaction is categorical) and imputing using separate ‘‘chains’’ of MI models (most efficient).35 These parallel ‘‘chains’’ of MI data models simply refer to splitting the sample based on levels of one categorical term included in the interaction and then conducting the standard MI process within each subsample. Following the MI process, the imputed data sets are recombined and the interaction term is generated. Although most efficient, this last option is only viable if one of the main effect terms in the interaction is categorical (i.e., interactions involving two continuous variables would not work), if such a categorical term has few or no censored values, and if there are an adequate number of subjects in each level of the categorical variable. Split Sample MI. In some cases, there may be a situation where certain subjects are believed to be inherently different from the rest of the sample, such that a data model that includes all subjects may not be able to fully realize such differences in missing values (i.e., an MNAR mechanism would exist), resulting in invalid estimates for some subjects. Rubin first raised this concept in the setting of survey respondents and nonrespondents.3 Another example would be a study examining the association between vital signs and outcomes among children and adults. Children have different physiology and are subject to different medical decision-making (relative to

673

adults), so it may be more appropriate to impute values for children separately from those of adults. In such a situation, one can perform separate MI data models (i.e., children vs. adults) and recombine the data following MI to proceed with the statistical analysis, generating more valid estimates. Such an approach has been used successfully in epidemiologic analyses.12 MI versus ML MI and ML estimation are closely related but have some notable differences. They are both valid methods for handling incomplete data and have the same basic assumptions (i.e., ignorable mechanism of censoring and multivariate normality). Similar to MI, ML relies on the probability model, uses a complete data likelihood function to link missing values to model parameters and generally requires inclusion of auxiliary variables.20,31 The critical differences between ML and MI are as follows: 1) ML methods are model specific (MI methods are not) and will change based on different statistical analyses; 2) ML is computationally intensive and higher rates of censoring produce slower rates of convergence; 3) ML does not fill in missing values, but rather accounts for them during the analysis (i.e., ML is integrated in one step during the analysis); and 4) ML is more efficient than MI, although this difference is generally small provided an adequate number of imputations (m) are used in MI.3,31,36 Recognizing these differences, ML is a very reasonable option for handling incomplete data that can produce valid results when used appropriately. How Do You Know MI Generated Valid Estimates? While it is not possible to be certain how closely the MI estimates compare with the ‘‘truth’’ (i.e., if we knew the true values, there would be no need for a missing data procedure), there are several tactics that may be used to assess how well the MI process worked. First, the imputed values should be examined. Most MI programs will produce summary statistics for imputed values (e.g., proportion imputed in each category of a categorical variable, means, and standard deviations for continuous variables). Reviewing this output is helpful in assessing whether unexpected findings are present. A large number of imputations for one value or clustered in an unusual distribution may suggest a limitation in the data model. The imputer must be familiar with the data as well as potential reasons why data are missing to allow insight into interpreting imputed results and to assess whether the data model produced plausible values. Similarly, results from subsequent statistical analyses should be assessed. If known associations between predictors and a given outcome are not present (or are opposite to expected associations), the MI procedure, the data model, auxiliary variables, and the sample itself should be examined to resolve such inconsistencies. If the MI data model is flawed in its ability to capture the censoring mechanism (i.e., to generate an MAR mechanism), then subsequent analyses using such imputed values will also be flawed.14,37 Finally, sensitivity analyses can be very helpful in further assessing how well the MI process may (or may not) have worked and how robust the results are to the assumptions required for MI. For example, one of the

674

Newgard and Haukoos

previously imputed data sets (i.e., no censored values) could be used to recreate certain patterns, mechanisms, and proportions of censored values and then imputed and analyzed and compared with the ‘‘true’’ results to allow assessment of the robustness of results to permutations in missingness. Such simulations, performed under a variety of censored data situations (i.e., MCAR, MAR, MNAR) and different MI data models, are important in assessing the ability of the MI procedure to generate valid results. Consistent results that remain qualitatively similar despite different underlying patterns and mechanisms of censored data will increase confidence in the results. Problems with any of these checks would indicate the need to reexamine the MI data model, the variables included in the model, and the adequacy of information in the sample for using MI methods (i.e., whether there are enough auxiliary variables in the data set to fulfill the requirement for a MAR mechanism).

How Much Missing Data Can Be Tolerated without Using a Missing Data Procedure? Although some investigators have suggested that naive methods for handling censored values may be reasonable when only a ‘‘small portion’’ of censored data is present,33 generating a general guideline for what constitutes a ‘‘small portion’’ is quite difficult. The amount of bias and reduction in precision incurred by ignoring missing values depends on several factors, including the proportion of complete cases available for analysis (relative to the total number of cases in the sample), the underlying pattern and mechanism of censoring, and the extent to which complete and incomplete cases differ.23 The planned analysis should also be considered, because multivariable methods are influenced by the proportion of



MISSING DATA IN CLINICAL RESEARCH PART 2

censored data for all variables included in the analysis (each of which is generally different from the rest), which often results in dramatic reductions in sample size (Figure 2). Even with a relatively small proportion of censored data (e.g., 3%) there can be substantial bias and reduction in precision, to the point of reversing the direction of effect for certain variables.15 A conservative strategy is to always include an a priori plan for handling incomplete data regardless of the proportion of censored values, because there is no proportion of censored data under which valid results can be guaranteed, and discerning the underlying mechanism of censoring as well as inherent differences between complete and incomplete cases is generally not possible. EXAMPLES To illustrate the concepts discussed in this paper, we have included several examples using a publicly available, probability sampled, nationally representative data set (National Automotive Sampling System Crashworthiness Data System [NASS CDS]) of motor vehicle crashes (MVCs). All drivers 16 years or older involved in MVCs with passenger vehicles and light trucks and included in NASS CDS during 1995–2003 (N = 38,167) are included in these examples. To create the MI data model, we included 11 target variables (i.e., ten predictors, plus the outcome variable), eight auxiliary variables, and two sampling design variables (clusters and strata). The distributions of all continuous variables were checked for normality, with log conversion of those variables with skewed distributions. To allow for comparison to the ‘‘truth’’ (i.e., results from the same analyses using a data set with no missing values), we used a previously imputed (i.e., complete) NASS CDS data set, which was then modified with

Figure 2. The effect of increasing the number of variables in a multivariable logistic regression model on results for restraint use in motor vehicle crashes (n = 38,167). In this example, the proportion of cases with censored restraint data remains fixed at 30%, with a missing at random mechanism of censoring. Restraint use was included in all models (outcome = Injury Severity Score R16). Additional variables included factors known to be associated with serious injury (e.g., passenger space intrusion, DV, steering wheel deformity, and so on). Results for the ‘‘truth’’ represent identical analyses in an otherwise identical data set with no censored values. Vertical bars are 95% confidence intervals. CC = complete case.

ACAD EMERG MED



July 2007, Vol. 14, No. 7



www.aemj.org

proportions of censored data for all target and auxiliary variables identical to that from the original NASS CDS database. The results from these analyses are hypothetical in that the data set used to represent the ‘‘truth’’ is based on a previously imputed NASS data set. Such an approach was required to allow comparison to results with no censored data and to provide an opportunity to demonstrate how different features of censored data (i.e., proportion of censoring, mechanism of censoring, sample size, and type of variable to be imputed) impact subsequent results. While we believe the generation of this hypothetical complete data set is unlikely to bias the results of these examples, it is important to recognize that the MI process may not perform similarly with different data sets, having varying levels and availability of explanatory variables, different sample sizes, and varying mechanisms of censoring. We focus on one categorical variable (restraint use) and one continuous variable (change in velocity [DV]) included in these models to demonstrate how different types of variables can be affected by different methods of handling incomplete data. For consistency, most examples use the same ten-variable logistic regression model using a binary outcome variable coded as ‘‘yes’’ for an Injury Severity Score R16 and ‘‘no’’ for an Injury Severity Score