Multiple Imputation

[This article was first published on Analysis on StatsNotebook - Simple. Powerful. Reproducible., and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The tutorial is based on R and StatsNotebook, a graphical interface for R.

Missing data is a norm rather than an exception in most areas of research. Excluding observations with missing data reduces statistical power and potentially introduces bias in model estimates. Multiple imputation is a technique that fills in missing values based on the available data. It can increase statistical power and reduce the bias due to missing data.

StatsNotebook provides a simple interface for multiple imputation using the mice package. By default, numeric variables are imputed using predictive mean matching and categorical variables are imputed using multinomial logistic regression (for categorical variables with 3 or more level) or binary logistic regression (for categorical variables with 2 levels).

In this tutorial, we will use the built-in substance dataset. This dataset can be loaded into StatsNotebook using the instructions here. It is a simulated dataset on the effects of a family intervention during adolescence on engagement with deviant peer group, experimentation with drugs and risk of substance use disorder in young adulthood. See Causal Mediation Analysis for an example based on this dataset.

In this dataset,

  1. dev_peer represents engagement with deviant peer groups and it was coded as “0: No” and “1: Yes”;
  2. sub_exp represents experimentation with drugs and it was coded as “0: No” and “1: Yes”;
  3. fam_int represents participation in family intervention during adolescence and it was coded as “0: No” and “1: Yes”;
  4. sub_disorder represents diagnosis of substance use disorder in young adulthood and it was coded as “0: No: and “1: Yes”.
  5. conflict represents level of family conflict. It will be used as a covariate in this analysis.

Two variables, .imp and .id will be added to the dataset on successful imputation. The .imp is the imputation number, and zero indicates the original dataset. The .id is a unique identifier for each observation in the dataset.

Prior to imputing missing data, all categorical variables will need to be specified as categorical (i.e. factor variable in R). See Converting variable type for a step-by-step guide.

To impute missing data,

  1. Click Analysis at the top
  2. Click Imputation and select Multiple imputation from the menu
  3. In the left panel, select all variables that we want to include in our imputation. Variables with no missing data can also be included as information from these variables will be used to impute missing data in other variables.
Multiple imputation in StatsNotebook
  1. Expand the panel Passive imputation if we need to include interaction terms in the imputation. In this example, we do not include any interaction.
  2. Expand the panel Analysis Setting to specify the number of imputations.
    • As a rule of thumb, the number of imputations should be roughly similar to the percentage of missing data in the dataset.
Imputation setting

The only output from StatsNotebook is a set of diagnostic plots from the imputation model. The lines in all plots should be freely intermingled. Non-convergence will be indicated by clearly separated lines.

imputation output

The following is the code generated by StatsNotebook.

library(mice)

formulas <- make.formulas(currentDataset)

formulas$gender =gender ~ conflict + dev_peer + sub_exp + fam_int + sub_disorder
formulas$conflict =conflict ~ gender + dev_peer + sub_exp + fam_int + sub_disorder
formulas$dev_peer =dev_peer ~ gender + conflict + sub_exp + fam_int + sub_disorder
formulas$sub_exp =sub_exp ~ gender + conflict + dev_peer + fam_int + sub_disorder
formulas$fam_int =fam_int ~ gender + conflict + dev_peer + sub_exp + sub_disorder
formulas$sub_disorder =sub_disorder ~ gender + conflict + dev_peer + sub_exp + fam_int

meth <- make.method(currentDataset)


imputedDataset <- parlmice(currentDataset,
  method = meth,
  formulas = formulas,
  m = 20,
  n.core = 15, 
  n.imp.core = 2)

plot(imputedDataset)
currentDataset <- complete(imputedDataset, action = "long", include = TRUE) 
"Chan, G. and StatsNotebook Team (2020). StatsNotebook. (Version 0.1.0) [Computer Software]. Retrieved from https://www.statsnotebook.io"
"R Core Team (2020). The R Project for Statistical Computing. [Computer software]. Retrieved from https://r-project.org"
"Buuren, S. v. and K. Groothuis-Oudshoorn (2010). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software: 1-68."


The top section specifies how each variable is imputed. StatsNotebook will use all selected variables for imputation.

formulas <- make.formulas(currentDataset)

formulas$gender =gender ~ conflict + dev_peer + sub_exp + fam_int + sub_disorder
formulas$conflict =conflict ~ gender + dev_peer + sub_exp + fam_int + sub_disorder
formulas$dev_peer =dev_peer ~ gender + conflict + sub_exp + fam_int + sub_disorder
formulas$sub_exp =sub_exp ~ gender + conflict + dev_peer + fam_int + sub_disorder
formulas$fam_int =fam_int ~ gender + conflict + dev_peer + sub_exp + sub_disorder
formulas$sub_disorder =sub_disorder ~ gender + conflict + dev_peer + sub_exp + fam_int

After specifying what variables would be used to impute each of the variables, we use the following line of code to specify the imputation methods. By default, predictive mean matching will be used for numeric variables, binary logistic regression will be used for dichotomized variable and multinomial logistic regression will be used for categorical variables with two or more levels.

meth <- make.method(currentDataset)

After the setup, the function parlmice will be used to impute missing data.

imputedDataset <- parlmice(currentDataset,
  method = meth,
  formulas = formulas,
  m = 20,
  n.core = 15, 
  n.imp.core = 2)

Chan, G. and StatsNotebook Team (2020). StatsNotebook. (Version 0.1.0) [Computer Software]. Retrieved from https://www.statsnotebook.io
R Core Team (2020). The R Project for Statistical Computing. [Computer software]. Retrieved from https://r-project.org
Buuren, S. v. and K. Groothuis-Oudshoorn (2010). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software: 1-68.

To leave a comment for the author, please follow the link and comment on their blog: Analysis on StatsNotebook - Simple. Powerful. Reproducible..

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)