class: center, middle, inverse, title-slide # IDS 702: Module 5.1 ## Introduction to missing data ### Dr. Olanrewaju Michael Akande --- ## Motivation - Most real world datasets often suffer from nonresponse, that is, they contain missing values. -- - Ideally, analysts should first decide on how to deal with missing data before moving on to analysis. -- - One needs to make assumptions and ask tons of questions, for example, + why are the values missing? + what is the pattern of missingness? + what is the proportion of missing values in the data? -- - As a Bayesian, one could treat the missing values as parameters and estimate them simultaneously with the analysis, but even in that case, one must still ask the same questions. -- - Ask as many questions as possible to help you figure out the most plausible assumptions! --- ## Motivation - Simplest approach: complete/available case analyses -- delete cases with missing data. Often problematic because: + it is just not feasible sometimes (small `\(n\)` large `\(p\)` problem) -- when we have a small number of observations but a large number of variables, we simply can not afford to throw away data, even when the proportion of missing data is small. -- + information loss -- even when we do not have the small `\(n\)`, large `\(p\)` problem, we still lose information when we delete cases. -- + biased results -- because the missing data mechanism is rarely random, features of the observed data can be completely different from the missing data. -- - More principled approach: impute the missing data (in a statistically proper fashion) and analyze the imputed data. --- ## Why should we care? - <font color="red">Loss of power</font> due to the the smaller sample size + can't regain lost power -- - Any analysis must make an <font color="red">untestable assumption</font> about the missing data + wrong assumption `\(\Rightarrow\)` <font color="red">biased estimates</font> -- - Some popular analyses with missing data get <font color="red">biased standard errors</font> + resulting in wrong p-values and confidence intervals -- - Some popular analyses with missing data are <font color="red">inefficient</font> + so that confidence intervals are wider than they need be --- ## What to do: loss of power Approach by design: - minimize amount of missing data + good communications with participants, for example, patients in clinical trial, participants in surveys and censuses, etc + follow up as much as possible; make repeated attempts using different methods -- - reduce the impact of missing data + collect reasons for missing data + collect information predictive of missing values --- ## What to do: analysis - A suitable method of analysis would: + make the correct (or plausible) assumption about the missing data + give an unbiased estimate (under that assumption) + give an unbiased standard error (so that p-values and confidence intervals are correct) + be efficient (make best use of the available data) -- - However, we can never be sure about what the correct assumption is `\(\Rightarrow\)` sensitivity analyses are essential! --- ## How to approach the analysis? - Start by knowing: + extent of missing data + pattern of missing data (e.g. is `\(X_1\)` always missing whenever `\(X_2\)` is also missing?) + predictors of missing data and of outcome -- - Principled approach to missing data: + identify a plausible assumption (through discussions between you as a data scientist and your clients) + choose an analysis method that's valid under that assumption -- - Just because a method is simple to use does not make it plausible; some analysis methods are simple to describe but have complex and/or implausible assumptions. --- ## Types of nonresponse (missing data) - .hlight[Unit nonresponse]: the individual has no values recorded for any of the variables. For example, when participants do not complete a survey questionnaire at all. - .hlight[Item nonresponse]: the individual has values recorded for at least one variable, but not all variables. <table> <caption>Unit nonresponse vs item nonresponse</caption> <tr> <th> </th> <th height="30px" colspan="3">Variables</th> </tr> <tr> <th> </th> <td height="30px" style="text-align:center" width="100px"> X<sub>1</sub> </td> <td style="text-align:center" width="100px"> X<sub>2</sub> </td> <td style="text-align:center" width="100px"> Y </td> </tr> <tr> <td height="30px" style="text-align:left"> Complete cases </td> <td style="text-align:center"> ✓ </td> <td style="text-align:center"> ✓ </td> <td style="text-align:center"> ✓ </td> </tr> <tr> <td rowspan="3"> Item nonresponse </td> <td rowspan="3" style="text-align:center"> ✓ </td> <td height="30px" style="text-align:center"> ✓ </td> <td style="text-align:center"> ❓ </td> </tr> <tr> <td height="30px" style="text-align:center"> ❓ </td> <td style="text-align:center"> ❓ </td> </tr> <tr> <td height="30px" style="text-align:center"> ❓ </td> <td style="text-align:center"> ✓ </td> </tr> <tr> <td height="30px" style="text-align:left"> Unit nonresponse </td> <td style="text-align:center"> ❓ </td> <td style="text-align:center"> ❓ </td> <td style="text-align:center"> ❓ </td> </tr> </table> --- ## Types of Missing Data Mechanism - Data are said to be .hlight[missing completely at random (MCAR)] if the reason for missingness does not depend on the values of the observed data or missing data. -- - For example, suppose - you handed out a double-sided survey questionnaire of 20 questions to a sample of participants; - questions 1-15 were on the first page but questions 16-20 were at the back; and - some of the participants did not respond to questions 16-20. -- - Then, the values for questions 16-20 for those people who did not respond would be .hlight[missing completely at random] if they simply did not realize the pages were double-sided; they had no reason to ignore those questions. -- - This is rarely plausible in practice! --- ## Types of Missing Data Mechanism - Data are said to be .hlight[missing at random (MAR)] if the reason for missingness may depend on the values of the observed data but not the missing data (conditional on the values of the observed data). -- - Using our previous example, suppose - questions 1-15 include demographic information such as age and education; - questions 16-20 include income related questions; and - once again, some of the participants did not respond to questions 16-20. -- - Then, the values for questions 16-20 for those people who did not respond would be .hlight[missing at random] if younger people are more likely not to respond to those income related questions than old people, where age is observed for all participants. -- - This is the most commonly assumed mechanism in practice! --- ## Types of Missing Data Mechanism - Data are said to be .hlight[missing not at random (MNAR or NMAR)] if the reason for missingness depends on the actual values of the missing (unobserved) data. -- - Continuing with our previous example, suppose again that - questions 1-15 include demographic information such as age and education; - questions 16-20 include income related questions; and - once again, some of the participants did not respond to questions 16-20. -- - Then, the values for questions 16-20 for those people who did not respond would be .hlight[missing not at random] if people who earn more money are less likely to respond to those income related questions than old people. -- - This is usually the case in real analysis, but analysis can be complex! --- ## Types of missing data mechanisms: how to tell in practice? So how can we tell the type of mechanism we are dealing with? -- In general, we don't know!!! -- - Rare that data are MCAR (unless planned beforehand) -- - Possible that data are MNAR -- - Compromise: assume data are MAR if we include enough variables in model for the missing data indicator `\(\boldsymbol{R}\)`. --- ## Why should we care? - <div class="question"> Why should we care in practice? What does bias really mean here? How exactly does using only the complete cases affect our results for the three mechanisms? </div> - Let's attempt to answer these questions via simulations. - Set `\(n = 10,000\)`. For `\(i=1,\ldots,n\)`, generate + `\(x_i \overset{iid}{\sim} N(2, 1); \ \ \ y_i|x_i \overset{iid}{\sim} N(-1 + 2 x_{i}, \sigma^2=5^2)\)` + `\(r_i|y_i, x_i \sim \textrm{Bernoulli}(\pi_i); \ \ \ \textrm{log}\left(\dfrac{\pi_i}{1-\pi_i}\right) = \theta_0 + \theta_1 y_i + \theta_2 x_i\)` - Next, set `\(y_i\)` missing whenever `\(r_i = 1\)`. - Set different values for `\(\boldsymbol{\theta} = (\theta_0, \theta_1, \theta_2)\)` to reflect MCAR, MAR and MNAR. - Let's use the R script [here](https://ids-702-f20.github.io/Course-Website/slides/MissingDataSim.R). --- class: center, middle # What's next? ### Move on to the readings for the next module!