Announcing the simputation package: make imputation simple

September 13, 2016
By

(This article was first published on R – Mark van der Loo, and kindly contributed to R-bloggers)

I am happy to announce that my simputation package has appeared on CRAN this weekend. This package aims to simplify missing value imputation. In particular it offers standardized interfaces that

• make it easy to define both imputation method and imputation model;
• for multiple variables at once;
• while grouping data by categorical variables;
• all fitting in the magrittr not-a-pipeline.

A few examples

To start with an example, let us first create a data set with some missings.

``````dat <- iris
# empty a few fields
dat[1:3,1] <- dat[3:7,2] <- dat[8:10,5] <- NA
``````
``````##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1            NA         3.5          1.4         0.2  setosa
## 2            NA         3.0          1.4         0.2  setosa
## 3            NA          NA          1.3         0.2  setosa
## 4           4.6          NA          1.5         0.2  setosa
## 5           5.0          NA          1.4         0.2  setosa
## 6           5.4          NA          1.7         0.4  setosa
## 7           4.6          NA          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2
## 9           4.4         2.9          1.4         0.2
## 10          4.9         3.1          1.5         0.1
``````

Below, we first impute `Sepal.Width` and `Sepal.Length` by regression on `Petal.Width` and `Species`. After this we impute `Species` using a decision tree model (CART) using every other variable as a predictor (including the ones just imputed).

``````library(magrittr)    # load the %>% operator
library(simputation)
imputed <- dat %>%
impute_lm(Sepal.Width + Sepal.Length ~ Petal.Width + Species) %>%
impute_cart(Species ~ .)
``````
``````##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1      4.979844    3.500000          1.4         0.2  setosa
## 2      4.979844    3.000000          1.4         0.2  setosa
## 3      4.979844    3.409547          1.3         0.2  setosa
## 4      4.600000    3.409547          1.5         0.2  setosa
## 5      5.000000    3.409547          1.4         0.2  setosa
## 6      5.400000    3.561835          1.7         0.4  setosa
## 7      4.600000    3.485691          1.4         0.3  setosa
## 8      5.000000    3.400000          1.5         0.2  setosa
## 9      4.400000    2.900000          1.4         0.2  setosa
## 10     4.900000    3.100000          1.5         0.1  setosa
``````

The package is pretty lenient against failure of imputation. For example, if one of the predictors is missing, fields just remain unimputed and if one of the models cannot be fitted, only a warning is issued (not shown here).

``````dat %>% impute_lm(Sepal.Length ~ Sepal.Width + Species) %>% head(3)
``````
``````##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1     5.076579         3.5          1.4         0.2  setosa
## 2     4.675654         3.0          1.4         0.2  setosa
## 3           NA          NA          1.3         0.2  setosa
``````

So here, the third `Sepal.Length` value could not be imputed since the predictor `Sepal.Width` is missing.

It is possible to split data into groups before estimating the imputation model and predicting missing values. There are two ways. The first is to use the `|` operator to specify grouping variables.

``````# We first need to complete 'Species'. Here, we use sequential
# hot deck after sorting by Petal.Length
dat %<>% impute_shd(Species ~ Petal.Length)
# Now impute Sepal.Length by regressing on
# Sepal.Width, computing a model for each Species.
dat %>% impute_lm(Sepal.Length ~ Sepal.Width | Species) %>% head(3)
``````
``````##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1     5.067813         3.5          1.4         0.2  setosa
## 2     4.725677         3.0          1.4         0.2  setosa
## 3           NA          NA          1.3         0.2  setosa
``````

The second way is to use the `group_by` command from dplyr

``````dat %>% dplyr::group_by(Species) %>%
impute_lm(Sepal.Length ~ Sepal.Width) %>%
``````
``````## Source: local data frame [3 x 5]
## Groups: Species [1]
##
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##
## 1     5.067813         3.5          1.4         0.2  setosa
## 2     4.725677         3.0          1.4         0.2  setosa
## 3           NA          NA          1.3         0.2  setosa
``````

Note: by using `group_by`, we also transformed the data.frame to a tibble, which not only sounds funny when you pronounce it (tibble, TIBBLE, tibble? tibbebbebbebble) but is also pretty useful.

Supported methods and how to specify them

Currently, the package supports the following methods:

• Model based (optionally add [non-]parametric random residual)
• linear regression
• robust linear regression
• CART models
• Random forest
• Donor imputation (including various donor pool specifications)
• k-nearest neigbour (based on gower‘s distance)
• sequential hotdeck (LOCF, NOCB)
• random hotdeck
• Predictive mean matching
• Other
• (groupwise) median imputation (optional random residual)
• Proxy imputation (copy from other variable)

Any call to one of the `impute_` functions looks as follows:

``````impute_(data, formula [, ])
``````

and the formula always has the following form:

`````` ~  [|]
``````

The parts in square brackets are optional.

Please see the package vignette for more examples and details, or `?simputation::impute_` for an overview of all imputation functions.

Happy imputing!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...