# Imputing missing values in parallel using {furrr}

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today I saw this tweet on my timeline:

and as a heavy `{purrr}` user, as well as the happy owner of a 6-core AMD Ryzen 5 1600X cpu, I was very excited to try out `{furrr}`. For those unfamiliar with `{purrr}`, you can read some of my previous blog posts on it here, here or here.

To summarize very quickly: `{purrr}` contains so-called higher order functions, which are functions that take other functions as argument. One such function is `map()`. Consider the following simple example:

``numbers <- seq(1, 10)``

If you want the square root of this numbers, you can of course simply use the `sqrt()` function, because it is vectorized:

``sqrt(numbers)``
``````##  [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
##  [8] 2.828427 3.000000 3.162278``````

But in a lot of situations, the solution is not so simple. Sometimes you have to loop over the values. This is what we would need to do if `sqrt()` was not vectorized:

``````sqrt_numbers <- rep(0, 10)

for(i in length(numbers)){
sqrt_numbers[i] <- sqrt(numbers[i])
}``````

First, you need to initialize a container, and then you have to populate the `sqrt_numbers` list with the results. Using, `{purrr}` is way easier:

``````library(tidyverse)
map(numbers, sqrt)``````
``````## [[1]]
## [1] 1
##
## [[2]]
## [1] 1.414214
##
## [[3]]
## [1] 1.732051
##
## [[4]]
## [1] 2
##
## [[5]]
## [1] 2.236068
##
## [[6]]
## [1] 2.44949
##
## [[7]]
## [1] 2.645751
##
## [[8]]
## [1] 2.828427
##
## [[9]]
## [1] 3
##
## [[10]]
## [1] 3.162278``````

`map()` is only one of the nice functions that are bundled inside `{purrr}`. Mastering `{purrr}` can really make you a much more efficient R programmer. Anyways, recently, I have been playing around with imputation and the `{mice}` package. `{mice}` comes with an example dataset called `boys`, let’s take a look at it:

``````library(mice)

data(boys)

brotools::describe(boys) %>%
select(variable, type, n_missing, everything())``````
``````## # A tibble: 9 x 12
##   variable type    n_missing   mean    sd mode      min   max   q25 median
##
## 1 age      Numeric         0   9.16  6.89    0.0350  21.2  1.58   10.5
## 2 bmi      Numeric        21  18.1   3.05   11.8     31.7 15.9    17.4
## 3 hc       Numeric        46  51.5   5.91   33.7     65.0 48.1    53.0
## 4 hgt      Numeric        20 132    46.5    50.0    198   84.9   147
## 5 tv       Numeric       522  11.9   7.99    1.00    25.0  4.00   12.0
## 6 wgt      Numeric         4  37.2  26.0     3.14   117   11.7    34.7
## 7 gen      Factor        503  NA    NA      NA       NA   NA      NA
## 8 phb      Factor        503  NA    NA      NA       NA   NA      NA
## 9 reg      Factor          3  NA    NA    south NA       NA   NA      NA
## # ... with 2 more variables: q75 , n_unique ``````

In the code above I use the `describe()` function from my personal package to get some summary statistics of the `boys` dataset (you can read more about this function here). I am especially interested in the number of missing values, which is why I re-order the columns. If I did not re-order the columns, it would not appear in the output on my blog.

We see that some columns have a lot of missing values. Using the `mice` function, it is very easy to impute them:

``````start <- Sys.time()
imp_boys <- mice(boys, m = 10, maxit = 100, printFlag = FALSE)
end <- Sys.time() - start

print(end)``````
``## Time difference of 3.290611 mins``

Imputation on a single core took around 3 minutes on my computer. This might seem ok, but if you have a larger data set with more variables, 3 minutes can become 3 hours. And if you increase `maxit`, which helps convergence, or the number of imputations, 3 hours can become 30 hours. With a 6-core CPU this could potentially be brought down to 5 hours (in theory). Let’s see if we can go faster, but first let’s take a look at the imputed data.

The `mice()` function returns a `mids` object. If you want to look at the data, you have to use the `complete()` function (careful, there is also a `complete()` function in the `{tidyr}` package, so to avoid problems, I suggest you explicitely call `mice::complete()`):

``````imp_boys <- mice::complete(imp_boys, "long")

brotools::describe(imp_boys) %>%
select(variable, type, n_missing, everything())``````
``````## # A tibble: 11 x 12
##    variable type   n_missing   mean    sd mode      min   max   q25 median
##
##  1 age      Numer…         0   9.16  6.89    0.0350  21.2  1.58  10.5
##  2 bmi      Numer…         0  18.0   3.03   11.8     31.7 15.9   17.4
##  3 hc       Numer…         0  51.6   5.89   33.7     65.0 48.3   53.2
##  4 hgt      Numer…         0 131    46.5    50.0    198   83.0  146
##  5 tv       Numer…         0   8.39  8.09    1.00    25.0  2.00   3.00
##  6 wgt      Numer…         0  37.1  26.0     3.14   117   11.7   34.6
##  7 .id      Factor         0  NA    NA    3     NA       NA   NA     NA
##  8 .imp     Factor         0  NA    NA    1     NA       NA   NA     NA
##  9 gen      Factor         0  NA    NA    G1    NA       NA   NA     NA
## 10 phb      Factor         0  NA    NA    P1    NA       NA   NA     NA
## 11 reg      Factor         0  NA    NA    south NA       NA   NA     NA
## # ... with 2 more variables: q75 , n_unique ``````

As expected, no more missing values. The “long” argument inside `mice::complete()` is needed if you want the `complete()` function to return a long dataset. Doing the above “manually” using `{purrr}` is possible with the following code:

``````start <- Sys.time()
imp_boys_purrr <- map(rep(1, 10), ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE))
end <- Sys.time() - start

print(end)``````
``## Time difference of 3.393966 mins``

What this does is map the function `~mice(data = boys, m = ., maxit = 100, printFlag = FALSE)` to a list of `1`s, and creates 10 imputed data sets. `m = .` means that `m` will be equal to whatever is inside the list we are mapping our function over, so `1`, then `1` then another `1` etc…. It took around the same amount of time as using `mice()` directly.

`imp_boys_purrr` is now a list of 10 `mids` objects. We thus need to map `mice::complete()` to `imp_boys_purrr` to get the data:

``imp_boys_purrr_complete <- map(imp_boys_purrr, mice::complete)``

Now, `imp_boys_purrr_complete` is a list of 10 datasets. Let’s map `brotools::describe()` to it:

``map(imp_boys_purrr_complete, brotools::describe)``
``````## [[1]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##
## 1 age      Numeric   9.16  6.89    0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.0   3.03   11.8     31.7 15.9   17.4   19.5
## 3 hc       Numeric  51.7   5.90   33.7     65.0 48.3   53.1   56.0
## 4 hgt      Numeric 131    46.5    50.0    198   84.0  146    175
## 5 tv       Numeric   8.35  8.00    1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.2  26.0     3.14   117   11.7   34.7   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA
## # ... with 2 more variables: n_missing , n_unique
##
## [[2]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##
## 1 age      Numeric   9.16  6.89    0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.0   3.03   11.8     31.7 15.9   17.5   19.5
## 3 hc       Numeric  51.6   5.88   33.7     65.0 48.3   53.2   56.0
## 4 hgt      Numeric 131    46.6    50.0    198   83.5  145    175
## 5 tv       Numeric   8.37  8.02    1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.1  26.0     3.14   117   11.9   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA
## 8 phb      Factor   NA    NA    P2    NA       NA   NA     NA     NA
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA
## # ... with 2 more variables: n_missing , n_unique
##
## [[3]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##
## 1 age      Numeric   9.16  6.89    0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.1   3.04   11.8     31.7 15.9   17.5   19.5
## 3 hc       Numeric  51.6   5.87   33.7     65.0 48.5   53.3   56.0
## 4 hgt      Numeric 131    46.6    50.0    198   83.0  145    175
## 5 tv       Numeric   8.46  8.14    1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.2  26.1     3.14   117   11.7   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA
## # ... with 2 more variables: n_missing , n_unique
##
## [[4]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##
## 1 age      Numeric   9.16  6.89    0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.1   3.02   11.8     31.7 15.9   17.5   19.4
## 3 hc       Numeric  51.7   5.93   33.7     65.0 48.5   53.4   56.0
## 4 hgt      Numeric 131    46.5    50.0    198   82.9  145    175
## 5 tv       Numeric   8.45  8.11    1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.2  26.0     3.14   117   11.7   34.7   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA
## # ... with 2 more variables: n_missing , n_unique
##
## [[5]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##
## 1 age      Numeric   9.16  6.89    0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.0   3.03   11.8     31.7 15.9   17.5   19.5
## 3 hc       Numeric  51.6   5.91   33.7     65.0 48.3   53.2   56.0
## 4 hgt      Numeric 131    46.6    50.0    198   83.0  146    175
## 5 tv       Numeric   8.21  8.02    1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.1  26.0     3.14   117   11.7   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA
## # ... with 2 more variables: n_missing , n_unique
##
## [[6]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##
## 1 age      Numeric   9.16  6.89    0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.0   3.05   11.8     31.7 15.9   17.4   19.5
## 3 hc       Numeric  51.7   5.89   33.7     65.0 48.3   53.2   56.0
## 4 hgt      Numeric 131    46.5    50.0    198   83.0  146    175
## 5 tv       Numeric   8.44  8.24    1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.1  26.0     3.14   117   11.7   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA
## # ... with 2 more variables: n_missing , n_unique
##
## [[7]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##
## 1 age      Numeric   9.16  6.89    0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.1   3.04   11.8     31.7 15.9   17.4   19.5
## 3 hc       Numeric  51.6   5.88   33.7     65.0 48.2   53.2   56.0
## 4 hgt      Numeric 131    46.6    50.0    198   83.5  146    175
## 5 tv       Numeric   8.47  8.15    1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.2  26.1     3.14   117   11.7   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA
## # ... with 2 more variables: n_missing , n_unique
##
## [[8]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##
## 1 age      Numeric   9.16  6.89    0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.0   3.04   11.8     31.7 15.9   17.4   19.4
## 3 hc       Numeric  51.6   5.85   33.7     65.0 48.2   53.3   56.0
## 4 hgt      Numeric 131    46.5    50.0    198   83.0  146    175
## 5 tv       Numeric   8.36  8.06    1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.2  26.1     3.14   117   11.7   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA
## # ... with 2 more variables: n_missing , n_unique
##
## [[9]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##
## 1 age      Numeric   9.16  6.89    0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.0   3.05   11.8     31.7 15.9   17.4   19.5
## 3 hc       Numeric  51.6   5.90   33.7     65.0 48.3   53.2   56.0
## 4 hgt      Numeric 131    46.6    50.0    198   83.9  146    175
## 5 tv       Numeric   8.57  8.25    1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.1  26.1     3.14   117   11.7   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA
## # ... with 2 more variables: n_missing , n_unique
##
## [[10]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##
## 1 age      Numeric   9.16  6.89    0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.0   3.04   11.8     31.7 15.9   17.4   19.5
## 3 hc       Numeric  51.6   5.89   33.7     65.0 48.3   53.1   56.0
## 4 hgt      Numeric 131    46.6    50.0    198   83.0  146    175
## 5 tv       Numeric   8.49  8.18    1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.1  26.1     3.14   117   11.7   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA
## # ... with 2 more variables: n_missing , n_unique ``````

Before merging this 10 datasets together into one, it would be nice to have a column with the id of the datasets. This can easily be done with a variant of `purrr::map()`, called `map2()`:

``imp_boys_purrr <- map2(.x = seq(1,10), .y = imp_boys_purrr_complete, ~mutate(.y, imp_id = as.character(.x)))``

`map2()` applies a function, say `f()`, to 2 lists sequentially: `f(x_1, y_1)`, then `f(x_2, y_2)`, etc… So here I map `mutate()` to create a new column, `imp_id` in each dataset. Now let’s bind the rows and take a look at the data:

``````imp_boys_purrr <- bind_rows(imp_boys_purrr)

imp_boys_purrr %>%
brotools::describe() %>%
select(variable, type, n_missing, everything())``````
``````## # A tibble: 10 x 12
##    variable type   n_missing   mean    sd mode      min   max   q25 median
##
##  1 age      Numer…         0   9.16  6.89    0.0350  21.2  1.58  10.5
##  2 bmi      Numer…         0  18.0   3.04   11.8     31.7 15.9   17.4
##  3 hc       Numer…         0  51.6   5.89   33.7     65.0 48.3   53.2
##  4 hgt      Numer…         0 131    46.5    50.0    198   83.0  146
##  5 tv       Numer…         0   8.42  8.11    1.00    25.0  2.00   3.00
##  6 wgt      Numer…         0  37.1  26.0     3.14   117   11.7   34.6
##  7 imp_id   Chara…         0  NA    NA    1     NA       NA   NA     NA
##  8 gen      Factor         0  NA    NA    G1    NA       NA   NA     NA
##  9 phb      Factor         0  NA    NA    P1    NA       NA   NA     NA
## 10 reg      Factor         0  NA    NA    south NA       NA   NA     NA
## # ... with 2 more variables: q75 , n_unique ``````

You may ask yourself why I am bothering with all this. This will become apparent now. We can now use the code we wrote to get our 10 imputed datasets using `purrr::map()` and simply use `furrr::future_map()` to parallelize the imputation process:

``library(furrr)``
``## Loading required package: future``
``````plan(multiprocess)

start <- Sys.time()
imp_boys_future <- future_map(rep(1, 10), ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE))
end <- Sys.time() - start

print(end)``````
``## Time difference of 33.73772 secs``

Boooom! Much faster! And simply by loading `{furrr}`, then using `plan(multiprocess)` to run the code in parallel (if you forget that, the code will run on a single core) and using `future_map()` instead of `map()`.

Let’s take a look at the data:

``````imp_boys_future_complete <- map(imp_boys_future, mice::complete)

imp_boys_future <- map2(.x = seq(1,10), .y = imp_boys_future_complete, ~mutate(.y, imp_id = as.character(.x)))

imp_boys_future <- bind_rows(imp_boys_future)

imp_boys_future %>%
brotools::describe() %>%
select(variable, type, n_missing, everything())``````
``````## # A tibble: 10 x 12
##    variable type   n_missing   mean    sd mode      min   max   q25 median
##
##  1 age      Numer…         0   9.16  6.89    0.0350  21.2  1.58  10.5
##  2 bmi      Numer…         0  18.0   3.04   11.8     31.7 15.9   17.4
##  3 hc       Numer…         0  51.6   5.89   33.7     65.0 48.4   53.2
##  4 hgt      Numer…         0 131    46.5    50.0    198   83.0  146
##  5 tv       Numer…         0   8.35  8.09    1.00    25.0  2.00   3.00
##  6 wgt      Numer…         0  37.1  26.0     3.14   117   11.7   34.6
##  7 imp_id   Chara…         0  NA    NA    1     NA       NA   NA     NA
##  8 gen      Factor         0  NA    NA    G1    NA       NA   NA     NA
##  9 phb      Factor         0  NA    NA    P1    NA       NA   NA     NA
## 10 reg      Factor         0  NA    NA    south NA       NA   NA     NA
## # ... with 2 more variables: q75 , n_unique ``````

So imputation went from 3.4 minutes (around 200 seconds) to 30 seconds. How cool is that? If you want to play around with `{furrr}` you must install it from Github, as it is not yet available on CRAN:

``devtools::install_github("DavisVaughan/furrr")``

If you are not comfortable with `map()` (and thus `future_map()`) but still want to impute in parallel, there is this very nice script here to do just that. I created a package around this script, called parlMICE (the same name as the script), to make installation and usage easier. You can install it with like so:

``devtools::install_github("b-rodrigues/parlMICE")``

If you found this blog post useful, you might want to follow me on twitter for blog post updates.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.