DEALING WITH MISSING DATA IN RANDOMISED CONTROL TRIALS
Ania Filus, PhD Parenting & Family Support Centre University of Queensland
WHY DO MISSING DATA MATTER? Most estimation procedures were not designed to handle missing data Loss in power due to smaller sample size Loss of external validity – sample does not reflect what you really want
to measure Loss of internal validity – biased estimates of statistics due to incorrect
variances or covariances
WHAT TO DO? Best situation - avoid missing data issues through careful study design, recruitment retention. The best thing about missing data is not to have any!
Realistic situation - despite all your attempts you will probably have missing data anyway.
SO WHAT IF YOU END UP WITH MISSING DATA? You want to deal with missigness in the way that will give us: Unbiased parameter estimation e.g., b-weights Good estimate of variability e.g., standard errors Best statistical power
THE QUESTION YOU SHOULD ALWAYS HAVE IN THE BACK OF YOUR MIND:
What is the best from the purely statistical point of view? versus What is the best for your specific study, sample and analyses you want to perform?
HOW TO CHOOSE OPTIMAL STARTEGY? From the statistical point of vie the way you treat your missing data will
depend mostly on amount of missigness, and its mechanism
AMOUNT OF DATA MISSING If 5% or less is missing any substitution method should be
However, if these missing cases are very different from the cases with complete data, substantial bias can still result from even a small amount of missing data No rules of thumb regarding when there is too much missigness
However, cases with lots of missing values may not be adequately sampled in the first place However, variables with lots of missing values may not be adequately measured in the first place
MECHANISM OF MISSIGNESS Three mechanism Missing Completely At Random (MCAR) Missing At Random (MAR)
Missing Not At Random (MNAR)
IGNORABLE
NOT IGNORABLE
Mechanisms describe how the likelihood for a missing value on
a variable Y relates to the data, if at all
EXAMPLE Suppose that we measure gender and number of chocolate truffles
eaten weekly……
MISSING COMPLETELY AT RANDOM The probability that the data on the
variable Y is missing (M) is unrelated to the observed data as well as is unrelated to the values of Y The dog ate the response sheets
observed data
M missing data
MISSING AT RANDOM The probability that the data on the observed data
variable Y is missing is unrelated to the values of Y but is related to observed data
M Probability that the chocolate data is
missing varies according to gender but not number of eaten truffles
missing data
MISSING NOT AT RANDOM
The probability that the data on the observed data
variable Y is missing is unrelated to the values of Y itself Probability that the chocolate data is
missing varies according to the number of eaten truffles within each gender
M missing data
WHY IS THE MECHANISM IPORTANT? Mechanisms influence optimal strategy for working with missing values MCAR The best scenario Even simple approaches can yield unbiased results
MAR The less ideal scenario More advanced methods required, can yield unbiased results
MNAR The worse scenario No approaches can help with the biased results
BUT DON’T WORRY TOO MUCH We can test MCAR assumption by examining the data But we cannot test MAR and MNAR assumption – because we don’t
know what the missing values are A common approach – unless a missing value is clearly MNAR, we can
assume MAR and apply methods that require this assumption
SOME TRADITIONAL METHODS Traditional methods can lead to substantial bias, even when MCAR
holds: Deletion methods Mean imputation Regression imputation
Last Observation Carried Forward
Although easy to apply, they can do more harm than good!
SOME TRADITIONAL METHODS Deletion methods Listwise deletion eliminates all cases with missing values, resulting in a
complete data set Pairwise deletion eliminates cases on analysis-by-analysis biases Reduced N reduced power
Can lead to substantial bias if MCAR assumption not satisfied – if the
remaining sample is not representative for entire population
SOME TRADITIONAL METHODS Arithmetic Mean Imputation Filling in the missing values with the arithmetic mean of available cases Although convenient, the estimates are severely biased: Artificially reduced variance and standard deviation Artificially reduced strength of associations between the variables
Simulations studies suggest that mean imputation is possibly the worst
missing data handling method available.
SOME TRADITIONAL METHODS Regression Imputation (conditional mean imputation) Regression equation predicts missing values from complete cases Imputed values fall directly on a regression line Reduced variability of the data Increased associations between variables
Although considered better as mean imputation, it produces biased
estimates even when MCAR assumption is met - the predicted values are ‘too good’
SOME TRADITIONAL METHODS
Last Observation Carried Forward Imputes missing repeated values with the observation that immediately
proceeds dropout. Conventional wisdom is that LOCF yields conservative estimate of
treatment group differences at the end of the study However, simulation studies suggest that LOCF can magnify the differences
between the groups at the end of the study It is likely to produce biased parameter estimates even if the MCAR
assumption is met
WHAT THEN IF NOT TRADITIONAL METHODS Use modern methods EM Multiple Imputations Full Information Maximum Likelihood
This methods model the missing data mechanism and then proceed to
make a proper likelihood-based analysis, either via the method of maximum likelihood or using Bayesian methods.
NEW METHODS EM algorithm Available in SPSS, NORM, Stata Based on maximum likelihood estimation E step: we estimate means, variances and covariances from the available cases.
We use these estimates to create regression equations to predict missing data M step – we use the complete data (with imputed missing values) to recalculate
the means variances and covariances. These are substituted back to E-step. Iteration continuous until the model converges (stabilizes)
Although appealing, as a single imputation method it leads to decreased
standard errors and biased parameter estimates
NEW METHODS Full Information Maximum Likelihood Available is all SEM software FIML does not fill in the missing data values ML uses the observed data to search for the parameters that
yield the highest log likelihood (i.e., best fit to the observed data)
Considered as state of the art of handling missing data However, your model has to be good in the first place – M.I. have not
been yet developed for estimators other than maximum likelihood
NEW METHODS Multiple Imputations Available in NORM, SAS, SPSS, LISREL 8.5+, Mplus MI creates multiple copies of the data (e.g., 20 or more), each
of which has a different set of plausible replacement values. Each imputed complete dataset is analysed by standard methods
Results combined to produce estimates and confidence intervals that
incorporate missing-data uncertainty Considered as ‘state of the art’ of missing data methods However, does not support analysis of variance so far!
WHAT SHOULD I DO? First steps Make every effort to avoid missing data (?) If you have missing data make every effort to understand why your
data is missing - collect more auxiliary variables (?)
WHAT SHOULD I DO NEXT? What is the best from the purely statistical point of view? versus What is the best for your specific study, sample and analyses you want to perform? Software used & available(?) Planned statistical analysis (?) Sample size (?)
REMEMBER The only thing we know for sure about missing data point is that it is
not there – and there is no magic statistics that can change that. So the only thing we can do is to estimate the extent to which missing
data have influenced the inferences we wish to draw
RUN SENSITIVITY ANALYSIS Always run Sensitivity Analysis: how do your substantive results depend on
how you handled missing data? Do complete case analysis Test several better missing data analysis Compare results – if the results don’t change
you are free to select the option that best suits you and the research questions
COMMERCIAL STATISTICAL PACKAGES PACKAGE
EM
SAS (MI) SPSS (EM)
Multiple Imputation
FIML
AMOS
HLM
LISREL
EQS
Mplus
FREE STATISTICAL PACKAGES PACKAGE
EM
Multiple Imputation
Amelia
IVEware
Stata(ice)
Mx
FIML
FOR FUTHER QUEDTIONS CONTACT
[email protected]