Most real world datasets often suffer from nonresponse, that is, they contain missing values.
Ideally, analysts should first decide on how to deal with missing data before moving on to analysis.
Most real world datasets often suffer from nonresponse, that is, they contain missing values.
Ideally, analysts should first decide on how to deal with missing data before moving on to analysis.
One needs to make assumptions and ask tons of questions, for example,
Most real world datasets often suffer from nonresponse, that is, they contain missing values.
Ideally, analysts should first decide on how to deal with missing data before moving on to analysis.
One needs to make assumptions and ask tons of questions, for example,
As a Bayesian, one could treat the missing values as parameters and estimate them simultaneously with the analysis, but even in that case, one must still ask the same questions.
Most real world datasets often suffer from nonresponse, that is, they contain missing values.
Ideally, analysts should first decide on how to deal with missing data before moving on to analysis.
One needs to make assumptions and ask tons of questions, for example,
As a Bayesian, one could treat the missing values as parameters and estimate them simultaneously with the analysis, but even in that case, one must still ask the same questions.
Ask as many questions as possible to help you figure out the most plausible assumptions!
Simplest approach: complete/available case analyses -- delete cases with missing data. Often problematic because:
it is just not feasible sometimes (small n large p problem) -- when we have a small number of observations but a large number of variables, we simply can not afford to throw away data, even when the proportion of missing data is small.
information loss -- even when we do not have the small n, large p problem, we still lose information when we delete cases.
Simplest approach: complete/available case analyses -- delete cases with missing data. Often problematic because:
it is just not feasible sometimes (small n large p problem) -- when we have a small number of observations but a large number of variables, we simply can not afford to throw away data, even when the proportion of missing data is small.
information loss -- even when we do not have the small n, large p problem, we still lose information when we delete cases.
biased results -- because the missing data mechanism is rarely random, features of the observed data can be completely different from the missing data.
Simplest approach: complete/available case analyses -- delete cases with missing data. Often problematic because:
it is just not feasible sometimes (small n large p problem) -- when we have a small number of observations but a large number of variables, we simply can not afford to throw away data, even when the proportion of missing data is small.
information loss -- even when we do not have the small n, large p problem, we still lose information when we delete cases.
biased results -- because the missing data mechanism is rarely random, features of the observed data can be completely different from the missing data.
More principled approach: impute the missing data (in a statistically proper fashion) and analyze the imputed data.
Loss of power due to the the smaller sample size
Any analysis must make an untestable assumption about the missing data
Loss of power due to the the smaller sample size
Any analysis must make an untestable assumption about the missing data
Some popular analyses with missing data get biased standard errors
Loss of power due to the the smaller sample size
Any analysis must make an untestable assumption about the missing data
Some popular analyses with missing data get biased standard errors
Some popular analyses with missing data are inefficient
Approach by design:
Approach by design:
minimize amount of missing data
reduce the impact of missing data
A suitable method of analysis would:
However, we can never be sure about what the correct assumption is ⇒ sensitivity analyses are essential!
Start by knowing:
Principled approach to missing data:
Start by knowing:
Principled approach to missing data:
Just because a method is simple to use does not make it plausible; some analysis methods are simple to describe but have complex and/or implausible assumptions.
Unit nonresponse: the individual has no values recorded for any of the variables. For example, when participants do not complete a survey questionnaire at all.
Item nonresponse: the individual has values recorded for at least one variable, but not all variables.
Variables | |||
---|---|---|---|
X1 | X2 | Y | |
Complete cases | ✓ | ✓ | ✓ |
Item nonresponse | ✓ | ✓ | ❓ |
❓ | ❓ | ||
❓ | ✓ | ||
Unit nonresponse | ❓ | ❓ | ❓ |
Data are said to be missing completely at random (MCAR) if the reason for missingness does not depend on the values of the observed data or missing data.
For example, suppose
Data are said to be missing completely at random (MCAR) if the reason for missingness does not depend on the values of the observed data or missing data.
For example, suppose
Then, the values for questions 16-20 for those people who did not respond would be missing completely at random if they simply did not realize the pages were double-sided; they had no reason to ignore those questions.
Data are said to be missing completely at random (MCAR) if the reason for missingness does not depend on the values of the observed data or missing data.
For example, suppose
Then, the values for questions 16-20 for those people who did not respond would be missing completely at random if they simply did not realize the pages were double-sided; they had no reason to ignore those questions.
This is rarely plausible in practice!
Data are said to be missing at random (MAR) if the reason for missingness may depend on the values of the observed data but not the missing data (conditional on the values of the observed data).
Using our previous example, suppose
Data are said to be missing at random (MAR) if the reason for missingness may depend on the values of the observed data but not the missing data (conditional on the values of the observed data).
Using our previous example, suppose
Then, the values for questions 16-20 for those people who did not respond would be missing at random if younger people are more likely not to respond to those income related questions than old people, where age is observed for all participants.
Data are said to be missing at random (MAR) if the reason for missingness may depend on the values of the observed data but not the missing data (conditional on the values of the observed data).
Using our previous example, suppose
Then, the values for questions 16-20 for those people who did not respond would be missing at random if younger people are more likely not to respond to those income related questions than old people, where age is observed for all participants.
This is the most commonly assumed mechanism in practice!
Data are said to be missing not at random (MNAR or NMAR) if the reason for missingness depends on the actual values of the missing (unobserved) data.
Continuing with our previous example, suppose again that
Data are said to be missing not at random (MNAR or NMAR) if the reason for missingness depends on the actual values of the missing (unobserved) data.
Continuing with our previous example, suppose again that
Then, the values for questions 16-20 for those people who did not respond would be missing not at random if people who earn more money are less likely to respond to those income related questions than old people.
Data are said to be missing not at random (MNAR or NMAR) if the reason for missingness depends on the actual values of the missing (unobserved) data.
Continuing with our previous example, suppose again that
Then, the values for questions 16-20 for those people who did not respond would be missing not at random if people who earn more money are less likely to respond to those income related questions than old people.
This is usually the case in real analysis, but analysis can be complex!
So how can we tell the type of mechanism we are dealing with?
So how can we tell the type of mechanism we are dealing with?
In general, we don't know!!!
So how can we tell the type of mechanism we are dealing with?
In general, we don't know!!!
So how can we tell the type of mechanism we are dealing with?
In general, we don't know!!!
Rare that data are MCAR (unless planned beforehand)
Possible that data are MNAR
So how can we tell the type of mechanism we are dealing with?
In general, we don't know!!!
Rare that data are MCAR (unless planned beforehand)
Possible that data are MNAR
Compromise: assume data are MAR if we include enough variables in model for the missing data indicator \boldsymbol{R}.
Let's attempt to answer these questions via simulations.
Set n = 10,000. For i=1,\ldots,n, generate
Next, set y_i missing whenever r_i = 1.
Set different values for \boldsymbol{\theta} = (\theta_0, \theta_1, \theta_2) to reflect MCAR, MAR and MNAR.
Let's use the R script here.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |