strategies for handling missing data in randomised ...

8 downloads 21608 Views 754KB Size Report
Loss of internal validity – biased estimates of statistics due to incorrect variances or ... What is the best from the purely statistical point of view? .... PACKAGE. EM.
DEALING WITH MISSING DATA IN RANDOMISED CONTROL TRIALS

Ania Filus, PhD Parenting & Family Support Centre University of Queensland

WHY DO MISSING DATA MATTER?  Most estimation procedures were not designed to handle missing data  Loss in power due to smaller sample size  Loss of external validity – sample does not reflect what you really want

to measure  Loss of internal validity – biased estimates of statistics due to incorrect

variances or covariances

WHAT TO DO? Best situation - avoid missing data issues through careful study design, recruitment retention. The best thing about missing data is not to have any!

Realistic situation - despite all your attempts you will probably have missing data anyway.

SO WHAT IF YOU END UP WITH MISSING DATA? You want to deal with missigness in the way that will give us:  Unbiased parameter estimation  e.g., b-weights  Good estimate of variability  e.g., standard errors  Best statistical power

THE QUESTION YOU SHOULD ALWAYS HAVE IN THE BACK OF YOUR MIND:

What is the best from the purely statistical point of view? versus What is the best for your specific study, sample and analyses you want to perform?

HOW TO CHOOSE OPTIMAL STARTEGY?  From the statistical point of vie the way you treat your missing data will

depend mostly on amount of missigness, and its mechanism

AMOUNT OF DATA MISSING  If 5% or less is missing any substitution method should be

However, if these missing cases are very different from the cases with complete data, substantial bias can still result from even a small amount of missing data  No rules of thumb regarding when there is too much missigness

However, cases with lots of missing values may not be adequately sampled in the first place However, variables with lots of missing values may not be adequately measured in the first place

MECHANISM OF MISSIGNESS  Three mechanism  Missing Completely At Random (MCAR)  Missing At Random (MAR)

 Missing Not At Random (MNAR)

IGNORABLE

NOT IGNORABLE

 Mechanisms describe how the likelihood for a missing value on

a variable Y relates to the data, if at all

EXAMPLE  Suppose that we measure gender and number of chocolate truffles

eaten weekly……

MISSING COMPLETELY AT RANDOM  The probability that the data on the

variable Y is missing (M) is unrelated to the observed data as well as is unrelated to the values of Y  The dog ate the response sheets

observed data

M missing data

MISSING AT RANDOM  The probability that the data on the observed data

variable Y is missing is unrelated to the values of Y but is related to observed data

M  Probability that the chocolate data is

missing varies according to gender but not number of eaten truffles

missing data

MISSING NOT AT RANDOM

 The probability that the data on the observed data

variable Y is missing is unrelated to the values of Y itself  Probability that the chocolate data is

missing varies according to the number of eaten truffles within each gender

M missing data

WHY IS THE MECHANISM IPORTANT? Mechanisms influence optimal strategy for working with missing values  MCAR  The best scenario  Even simple approaches can yield unbiased results

 MAR  The less ideal scenario  More advanced methods required, can yield unbiased results

 MNAR  The worse scenario  No approaches can help with the biased results

BUT DON’T WORRY TOO MUCH  We can test MCAR assumption by examining the data  But we cannot test MAR and MNAR assumption – because we don’t

know what the missing values are  A common approach – unless a missing value is clearly MNAR, we can

assume MAR and apply methods that require this assumption

SOME TRADITIONAL METHODS  Traditional methods can lead to substantial bias, even when MCAR

holds:  Deletion methods  Mean imputation  Regression imputation

 Last Observation Carried Forward

 Although easy to apply, they can do more harm than good!

SOME TRADITIONAL METHODS Deletion methods  Listwise deletion eliminates all cases with missing values, resulting in a

complete data set  Pairwise deletion eliminates cases on analysis-by-analysis biases  Reduced N  reduced power

 Can lead to substantial bias if MCAR assumption not satisfied – if the

remaining sample is not representative for entire population

SOME TRADITIONAL METHODS Arithmetic Mean Imputation  Filling in the missing values with the arithmetic mean of available cases  Although convenient, the estimates are severely biased:  Artificially reduced variance and standard deviation  Artificially reduced strength of associations between the variables

 Simulations studies suggest that mean imputation is possibly the worst

missing data handling method available.

SOME TRADITIONAL METHODS Regression Imputation (conditional mean imputation)  Regression equation predicts missing values from complete cases  Imputed values fall directly on a regression line  Reduced variability of the data  Increased associations between variables

 Although considered better as mean imputation, it produces biased

estimates even when MCAR assumption is met - the predicted values are ‘too good’

SOME TRADITIONAL METHODS

 Last Observation Carried Forward  Imputes missing repeated values with the observation that immediately

proceeds dropout.  Conventional wisdom is that LOCF yields conservative estimate of

treatment group differences at the end of the study  However, simulation studies suggest that LOCF can magnify the differences

between the groups at the end of the study  It is likely to produce biased parameter estimates even if the MCAR

assumption is met

WHAT THEN IF NOT TRADITIONAL METHODS  Use modern methods  EM  Multiple Imputations  Full Information Maximum Likelihood

 This methods model the missing data mechanism and then proceed to

make a proper likelihood-based analysis, either via the method of maximum likelihood or using Bayesian methods.

NEW METHODS EM algorithm  Available in SPSS, NORM, Stata  Based on maximum likelihood estimation  E step: we estimate means, variances and covariances from the available cases.

We use these estimates to create regression equations to predict missing data  M step – we use the complete data (with imputed missing values) to recalculate

the means variances and covariances. These are substituted back to E-step.  Iteration continuous until the model converges (stabilizes)

 Although appealing, as a single imputation method it leads to decreased

standard errors and biased parameter estimates

NEW METHODS Full Information Maximum Likelihood  Available is all SEM software  FIML does not fill in the missing data values  ML uses the observed data to search for the parameters that

yield the highest log likelihood (i.e., best fit to the observed data)

 Considered as state of the art of handling missing data  However, your model has to be good in the first place – M.I. have not

been yet developed for estimators other than maximum likelihood

NEW METHODS Multiple Imputations  Available in NORM, SAS, SPSS, LISREL 8.5+, Mplus  MI creates multiple copies of the data (e.g., 20 or more), each

of which has a different set of plausible replacement values.  Each imputed complete dataset is analysed by standard methods

 Results combined to produce estimates and confidence intervals that

incorporate missing-data uncertainty  Considered as ‘state of the art’ of missing data methods  However, does not support analysis of variance so far!

WHAT SHOULD I DO? First steps  Make every effort to avoid missing data (?)  If you have missing data make every effort to understand why your

data is missing - collect more auxiliary variables (?)

WHAT SHOULD I DO NEXT? What is the best from the purely statistical point of view? versus What is the best for your specific study, sample and analyses you want to perform?  Software used & available(?)  Planned statistical analysis (?)  Sample size (?)

REMEMBER  The only thing we know for sure about missing data point is that it is

not there – and there is no magic statistics that can change that.  So the only thing we can do is to estimate the extent to which missing

data have influenced the inferences we wish to draw

RUN SENSITIVITY ANALYSIS  Always run Sensitivity Analysis: how do your substantive results depend on

how you handled missing data?  Do complete case analysis  Test several better missing data analysis  Compare results – if the results don’t change

you are free to select the option that best suits you and the research questions

COMMERCIAL STATISTICAL PACKAGES PACKAGE

EM

SAS (MI) SPSS (EM)

Multiple Imputation

FIML

 



AMOS



HLM



LISREL



EQS



Mplus





FREE STATISTICAL PACKAGES PACKAGE

EM

Multiple Imputation

Amelia





IVEware





Stata(ice)





Mx





FIML

FOR FUTHER QUEDTIONS CONTACT [email protected]