Missing Data

0 downloads 0 Views 596KB Size Report
There can be many such reasons for nonresponse. ... Do not attempt to control for non response error. ..... Missing data known or unknown: Proceed to next step.
HOW TO HANDLE IT?

Anuj Vijay Bhatia FPRM 14 Institute of Rural Management Anand

Research Methodology

NON RESPONSE ERROR

NON RESPONSE ERROR  The respondent has not replied to the mail or did not find time to give the interview or cannot be contacted. There can be many such reasons for nonresponse.  High rate of non response is serious.  Research may lose:  Credibility  Acceptability  Accuracy and Professional Soundness

 Methodology used should be described completely.  Researchers responsibility to establish external validity.  Appropriate sample size and acceptable response rate must be achieved.

NON RESPONSE ERROR  Nonresponse error exist to the extent that subjects included in sample fail to provide usable responses.  Research manifested by high nonresponse loses Validity and Reliability.  Many research articles:  Do not mention nonresponse as a threat to external validity.  Do not attempt to control for non response error.  Do not provide reference to the literature of handling nonresponse.

 It limits the ability of the researcher to generalize.

SAMPLING PROCEDURES AND NONRESPONSE  In a survey research, the ability to generalize is critical.  There is a risk that non-respondents will be systematically different from respondents.  Response rate is higher (100% many times) when purposive or convenience sampling is used.  However, probability sampling is used, response rates are low.  Ability to generalize is limited when purposive or convenience sampling is used.  The threat to validity is not due to response rate but due to nonrepresentataive sampling procedures.  To ensure external validity answer: Will your results be same if a 100% response rate was achieved?

A SIMPLE LOGIC  Suppose the population is divided into two strata i.e., the respondents ( r ) and the non-respondents whose data is missing (m). Suppose we want to determine 𝑌 , the total population mean.  𝒀 = Wr 𝒀 𝒓 + Wm 𝒀 𝒎  Y r and Y m are the means of respondents and respondents respectively. Wr and Wm are weights.

non—

 If the survey fails to collect data from non-respondents, it will produce result estimate equal to 𝑌 𝑟.

 The bias will be the dif ference between 𝑌 𝑟 𝑎𝑛𝑑 𝑌  𝒀 𝒓 − 𝒀 = 𝒀 𝒓 − ( Wr 𝒀 𝒓 + Wm 𝒀 𝒎 ) = 𝒀 𝒓 𝟏 − 𝑾𝒓 − 𝑾𝒎 𝒀 𝒎 = Wm (𝒀 𝒓 − 𝒀 𝒎)

CONTROLLING NON-RESPONSE ERROR  Begins with designing and implementation.  Appropriate sampling protocols and procedures should be used to maximize participation.  Ensure that response rate is enough to conclude that non-response is not a threat to external validity.  If required go for some additional procedures to establish that non-response is not a threat to external validity.

RECOMMENDATIONS FOR HANDLING NON-RESPONSE Methods for Handling Non-Response 1. Comparison of Early to Late Respondents 2. Using “Days to Respond” as a Regression Variable 3. Compare Respondents to Non-Respondents 4. Compare Respondents on Characteristics known a priori 5. Ignore Non-Response as a Threat to External Validity

METHODS FOR HANDLING NON-RESPONSE Method 1: Comparison of Early to Late Respondents  Extrapolation based on statistical inferences  Operationally define ‘Late Respondents’  Last wave of respondents: Late Respondents  Compare early and late respondents based on key variables of interest.  If no difference, results can be generalized to larger population.

METHODS FOR HANDLING NON-RESPONSE Method 2: Using “Days to Respond” as a Regression Variable  “Days to respond” is coded as continuous variable and used as IV in regression equation.  Primary variables of interest are regressed on variable “Days to Respond”.  If not statistically significant: Assume that respondents are not different from non-respondents.

METHODS FOR HANDLING NON-RESPONSE Method 3: Compare Respondents to Non-Respondents  Compute differences by sampling nonrespondents and working extra diligently to get their responses.  Minimum 20% of responses from nonrespondents should be obtained.  If fewer than 20% responses are obtained, Method 1 or 2 should be used by combining the results.

METHODS FOR HANDLING NON-RESPONSE Method 4: Compare Respondents on Characteristics known a priori  Compare respondents to population or characteristics known in advance  Describe similarities and differences. Method 5: Ignore Non-Response as a Threat to External Validity  If above methods are you can choose to ignore.

IN QUANTITATIVE RESEARCH

Anuj Vijay Bhatia FPRM 14 Institute of Rural Management Anand

Research Methodology

MISSING DATA

A FOOD FOR THOUGHT  What is certain in life?  Death  Taxes

 What is certain in research?  Measurement error  Missing data

 Missing data can be:  Due to preventable errors, mistakes, or lack of foresight by the researcher  Due to problems outside the control of the researcher  Deliberate, intended, or planned by the researcher to reduce cost or respondent burden  Due to differential applicability of some items to subsets of respondents Etc.

MISSING DATA • Non-Response v/s Missing Data • Missing Data: Where valid values on one or more variables are not available for analysis. • Researchers primary concern is to identify the patterns and relationships underlying the missing data. • we need to understand process leading to missing data to take appropriate course of action. • Common in Social Research • More acute in experiments and surveys • Best way is to avoid it by planning and conscientious data collection. • Not uncommon to have some level of missing data.

PRIMARY PROBLEMS Lost data  Reduces Statistical Power  Meaningfully diminishes sample size

Bias Parameter Estimates  Correlations biased downwards  Predictor scores affected  Restrict Variance  Central Tendency Biased

TECHNIQUES TO DEAL WITH MISSING DATA Simple Techniques  Listwise Deletion  Pairwise Deletion  Mean Substitution  Regression Imputation  Hot-Deck Imputation

Maximum Likelihood and Related Methods  Maximum Likelihood  Expectation Maximization  Repeated Measures and Time Series Designs

LISTWISE DELETION Eliminate all cases with missing data on any predictor or criterion. Sacrifices large amount of data Decreases statistical power May introduce bias in parameter Default option in many statistical packages

PAIRWISE DELETION Deletes information only from those statistics that “need” information. Preserves great deal of information than listwise deletion. Interpretation becomes difficult. May lead to mathematically inconsistent correlations.

MEAN SUBSTITUTION Use means in place of missing data Allows to use rest of individual’s data Preserves data Easy to use Attenuate variance and covariance estimates Useful when correlations between variables is low and less than 10% of data are missing.

REGRESSION IMPUTATION  Estimate missing data based on other variables in data set.  Advantages:    

Preserves data Better than Listwise and Pairwise deletion Preserves the deviation from the mean Doesn’t attune correlations like mean substitution.

 Variants:  Simple regression strategy  Only one iteration  Estimate relationships in variables and estimate missing data

 Stepwise/Iterative Regression  Isolate a few key variables, prepare correlation matrix.  Estimate regression equation and predict missing values

HOT-DECK IMPUTATION  Replace missing value with actual score from similar case in current data set.  Hot-deck? What is so hot about it?  What is Cold-Deck then?  Missing values are replaced with a reasonable estimate from similar individual.  Accurate: Real values are imputed  May not distort distributions.  Helpful when data is missing in patterns.  Little literature backing the accuracy claim.  Problematic when there are large classification variables.  Categorizing variables sacrifices information.  Estimating Standard Errors Difficult.

MAXIMUM LIKELIHOOD  Assume: The observed data are a sample drawn from multivariate normal distribution.  Parameters are estimated by available data and then missing scores are estimated based on the parameters just estimated.  The missing values are predicted by using conditional distribution of variables on which data is available.  ML provides explicit modeling of the imputation process that is open to scientific analysis and critique.  More accurate then Listwise deletion and better than ad hoc approaches like mean substitution.  However, it may be possible that differences are small and the distributional assumptions in this method are relatively strict.

EXPECTATION MAXIMIZATION  Uses Expectation Maximization Algorithm  Iterations through process of estimating missing data  First iteration involves estimating missing data and then estimating parameters using ML method.  Second iteration would require re-estimating the missing data based on new parameter estimates and then recalculating the parameter estimates.  This process continues till there is convergence in the parameter estimates.  Produces less biased estimates, more accurate.  Open to scientific analysis and critique.  Lengthy and complex.

REPEATED MEASURES AND TIME SERIES DESIGN  Problem of Missing Data more severe  Listwise deletion: Loss of more data due to repeated measures.  Additional data is collected on same measures at different time.  Opportunity to use strongly correlated variables to impute missing data.  Linear regression and subject mean can be used to predict missing values, but it may be biased.  Interpolation and Extrapolation can produced relatively unbiased estimates.

LEVELS OF MISSINGNESS  The data can be missing at three levels: 1. Item-level missingness 2. Construct-level missingness 3. Person-level missingness

(Adopted from: Newman, D. A., (2014). Missing Data: Five Practical Guidelines, Sage Publications.)

MECHANISMS OF MISSING DATA Data can be missing systematically. Random Missingness:

randomly

 Missing Completely at Random (MCR)

Systematic Missingness  Missing at Random (MAR)  Missing not at Random (MNAR)

or

 MCAR (Missing Completely at Random)  The probability that a variable value is missing does not depend on the observed data values nor the missing data values.  P ( missing | complete data ) = P (missing)

 MAR (Missing at Random)  The probability that a variable value is missing partly depends on other data that are observed in the dataset but does not depend on any of the values that are missing.  P(missing | complete data ) = P (missing | observed data)

 MNAR (Missing Not at Random)  The probability that a variable value is missing depends on the missing data values themselves.  P (missing | complete data ) ≠ P (missing | observed data)

(Adopted from: Newman, D. A., (2014). Missing Data: Five Practical Guidelines, Sage Publications.)

BIAS AND INACCURATE STANDARD ERRORS

CHOOSING MISSING DATA TREATMENTS

(Adopted from: Newman, D. A., (2014). Missing Data: Five Practical Guidelines, Sage Publications.)

A FOUR STEP PROCESS FOR IDENTIFYING MISSING DATA AND APPLYING REMEDIES STEP 1: DETERMINE THE T YPE OF MISSING DATA  Is it under the control of researcher?  Is it ignorable?

 Ignorable Missing Data  Expected  Remedies not needed  Allowance for missing data are inherent in the technique  Missing data is operating at random

 Non—Ignorable Missing Data  Known to researchers: Some remedies if random  Unknown missing data: Process less easy, but remedies available  Missing data known or unknown: Proceed to next step

STEP 2: DETERMINE THE EXTENT OF MISSING DATA  Determine the extent of missing data  Patterns of individual variables, individual cases and even overall.  Is it low enough to affect the results?  It is random?  If sufficiently low: Apply any remedy  If not low: Determine the randomness before applying the remedy  Assessing the Extent and Pattern of Missing data:  Tabulate  Number of cases with missing data  Percentage of variables with missing data in each case.  Look for non-random pattern  Also determine number of cases with no missing data (100% complete)  Is missing data too high to create a bias? (Rule of Thumb 1)  Can deletion be used? (Rule of Thumb 2)

RULE OF THUMB 1 HOW MUCH MISSING DATA IS TOO MUCH? Missing data under 10% can generally be ignored when it happens in random fashion. The number of cases with no missing data should be sufficient for the selected analysis technique if replacement values will not be substituted (imputed) for the missing data.

RULE OF THUMB 2 DELETION BASED ON MISSING DATA  Variables with less 15% data are candidates for deletion.  Higher level of missingness like 20-30% can be remedied.  Deletion of large data should be justifiable.  Cases with missing data for dependent variables typically are deleted to avoid increase in relationship with independent variable.  While deleting a variable, ensure a highly correlated variable is available to represent intent of original variable.  Always perform analysis with or without the deleted cases or variables to identify any marked differences.

STEP 3: DIAGNOSE THE RANDOMNESS OF THE MISSING DATA PROCESSES.  Degree of randomness determines the appropriate level of remedy.

 Level of Randomness  Random: MCAR  Observed values of Y are truly a random sample of Y values.  No underlying process that tends to bias the observed data.  Missing data are indistinguishable form complete data.

 Non-Random: MAR  Missing values of Y depends on X but not on Y  Observed values of Y represent a random sample of Y for each value of X.  Cannot be generalized.

 Diagnostic Tests for Level of Randomness  Forming 2 groups, with and without missing data : T-Test  Overall test of Randomness for MCAR

STEP 4: SELECT THE IMPUTATION METHOD

RULE OF THUMB 3 IMPUTATION OF MISSING DATA UNDER 10%  Any imputation method can be applied.

10% - 20%  For MCAR  Hot-Deck Case Substitution and Regression Imputation

 For MAR  Model Based Methods

Over 20%  Regression method for MCAR  Model Based method for MAR

REFERENCES 1. Dooley, L. M., & Lindner, J. R. (2003). The handling of nonresponse error. Human Resource Development Quar terly, 14(1), 99-110. 2. Roth, P. L. (1994). Missing data: A conceptual review for applied psychologists. Personnel psychology, 47(3), 537-560. 3. Blair, E., & Zinkhan, G. M. (2006). Nonresponse and generalizability in academic research. Journal of the Academy of Marketing Science, 34(1), 4-7. 4. Newman, D. A. (2014). Missing data five practical guidelines. Organizational Research Methods, 17(4), 372-411 . 5. Hair, J. F., Black, W. C., Babin, B. J., Anderson, R. E., & Tatham, R. L. (2006). Multivariate data analysis 6th Edition. New Jersey: Pearson Education.