Missing Data | Types, Explanation, & Imputation
Missing data, or missing values, occur when you don’t have data stored for certain variables or participants. Data can go missing due to incomplete data entry, equipment malfunctions, lost files, and many other reasons.
In any dataset, there are usually some missing data. In quantitative research, missing values appear as blank cells in your spreadsheet.
Types of missing data
Missing data are errors because your data don’t represent the true values of what you set out to measure.
The reason for the missing data is important to consider, because it helps you determine the type of missing data and what you need to do about it.
There are three main types of missing data.
|Missing completely at random (MCAR)||Missing data are randomly distributed across the variable and unrelated to other variables.|
|Missing at random (MAR)||Missing data are not randomly distributed but they are accounted for by other observed variables.|
|Missing not at random (MNAR)||Missing data systematically differ from the observed values.|
Missing completely at random
When data are missing completely at random (MCAR), the probability of any particular value being missing from your dataset is unrelated to anything else.
The missing values are randomly distributed, so they can come from anywhere in the whole distribution of your values. These MCAR data are also unrelated to other unobserved variables.
Data are often considered MCAR if they seem unrelated to specific values or other variables. In practice, it’s hard to meet this assumption because “true randomness” is rare.
When data are missing due to equipment malfunctions or lost samples, they are considered MCAR.
Missing at random
Data missing at random (MAR) are not actually missing at random; this term is a bit of a misnomer.
This type of missing data systematically differs from the data you’ve collected, but it can be fully accounted for by other observed variables.
The likelihood of a data point being missing is related to another observed variable but not to the specific value of that data point itself.
Missing not at random
Data missing not at random (MNAR) are missing for reasons related to the values themselves.
This type of missing data is important to look for because you may lack data from key subgroups within your sample. Your sample may not end up being representative of your population.
For example, in long-term medical studies, some participants may drop out because they become more and more unwell as the study continues. Their data are MNAR because their health outcomes are worse, so your final dataset may only include healthy individuals, and you miss out on important data.
Are missing data problematic?
Missing data are problematic because, depending on the type, they can sometimes bias your results. This means your results may not be generalizable outside of your study because your data come from an unrepresentative sample.
In practice, you can often consider two types of missing data ignorable because the missing data don’t systematically differ from your observed values:
- MCAR data
- MAR data
For these two data types, the likelihood of a data point being missing has nothing to do with the value itself. So it’s unlikely that your missing values are significantly different from your observed values.
On the flip side, you have a biased dataset if the missing data systematically differ from your observed data. Data that are MNAR are called non-ignorable for this reason.
How to prevent missing data
Missing data often come from attrition, non-response, or poorly designed research protocols. When designing your study, it’s good practice to make it easy for your participants to provide data.
Here are some tips to help you minimize missing data:
- Limit the number of follow-ups
- Minimize the amount of data collected
- Make data collection forms user friendly
- Use data validation techniques
- Offer incentives
After you’ve collected data, it’s important to store them carefully, with multiple backups.
How to deal with missing values
To tidy up your data, your options usually include accepting, removing, or recreating the missing data.
You should consider how to deal with each case of missing data based on your assessment of why the data are missing.
- Are these data missing for random or non-random reasons?
- Are the data missing because they represent zero or null values?
- Was the question or measure poorly designed?
Your data can be accepted, or left as is, if it’s MCAR or MAR. However, MNAR data may need more complex treatment.
The most conservative option involves accepting your missing data: you simply leave these cells blank.
It’s best to do this when you believe you’re dealing with MCAR or MAR values. When you have a small sample, you’ll want to conserve as much data as possible because any data removal can affect your statistical power.
You might also recode all missing values with labels of “N/A” (short for “not applicable”) to make them consistent throughout your dataset.
These actions help you retain data from as many research subjects as possible with few or no changes.
You can remove missing data from analyses using listwise or pairwise deletion.
Listwise deletion means deleting data from all cases (participants) who have data missing for any variable in your dataset. You’ll have a dataset that’s complete for all participants included in it.
A downside of this technique is that you may end up with a much smaller and/or a biased sample to work with. If significant amounts of data are missing from some variables or measures in particular, the participants who provide those data might significantly differ from those who don’t.
Your sample could be biased because it doesn’t adequately represent the population.
Pairwise deletion lets you keep more of your data by only removing the data points that are missing from any analyses. It conserves more of your data because all available data from cases are included.
It also means that you have an uneven sample size for each of your variables. But it’s helpful when you have a small sample or a large proportion of missing values for some variables.
When you perform analyses with multiple variables, such as a correlation, only cases (participants) with complete data for each variable are included.
Imputation means replacing a missing value with another value based on a reasonable estimate. You use other data to recreate the missing value for a more complete dataset.
You can choose from several imputation methods.
In hot-deck imputation, you replace each missing value with an existing value from a similar case or participant within your dataset. For each case with missing values, the missing value is replaced by a value from a so-called “donor” that’s similar to that case based on data for other variables.
Alternatively, in cold-deck imputation, you replace missing values with existing values from similar cases from other datasets. The new values come from an unrelated sample.
Use imputation carefully
Imputation is a complicated task because you have to weigh the pros and cons.
Although you retain all of your data, this method can create bias and lead to inaccurate results. You can never know for sure whether the replaced value accurately reflects what would have been observed or answered. That’s why it’s best to apply imputation with caution.
Frequently asked questions about missing data
- What are missing data?
- Why are missing data important?
- How do I deal with missing data?
To tidy up your missing data, your options usually include accepting, removing, or recreating the missing data.
- Acceptance: You leave your data as is
- Listwise or pairwise deletion: You delete all cases (participants) with missing data from analyses
- Imputation: You use other data to fill in the missing data
- What are the types of missing data?
There are three main types of missing data.
Missing completely at random (MCAR) data are randomly distributed across the variable and unrelated to other variables.
Missing at random (MAR) data are not randomly distributed but they are accounted for by other observed variables.
Missing not at random (MNAR) data systematically differ from the observed values.