# Imputing missing values in parallel using {furrr}

**Econometrics and Free Software**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today I saw this tweet on my timeline:

For those of us that just can't wait until RStudio officially supports parallel purrr in #rstats, boy have I got something for you.

— Davis Vaughan (@dvaughan32) April 13, 2018

Introducing `furrr`, parallel purrr through the use of futures. Go ahead, break things, you know you want to:https://t.co/l9z1UC2Tew

and as a heavy `{purrr}`

user, as well as the happy owner of a 6-core AMD Ryzen 5 1600X cpu, I was very excited to try out `{furrr}`

. For those unfamiliar with `{purrr}`

, you can read some of my previous blog posts on it here, here or here.

To summarize very quickly: `{purrr}`

contains so-called higher order functions, which are functions that take other functions as argument. One such function is `map()`

. Consider the following simple example:

numbers <- seq(1, 10)

If you want the square root of this numbers, you can of course simply use the `sqrt()`

function, because it is vectorized:

sqrt(numbers) ## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 ## [8] 2.828427 3.000000 3.162278

But in a lot of situations, the solution is not so simple. Sometimes you have to loop over the values. This is what we would need to do if `sqrt()`

was not vectorized:

sqrt_numbers <- rep(0, 10) for(i in length(numbers)){ sqrt_numbers[i] <- sqrt(numbers[i]) }

First, you need to initialize a container, and then you have to populate the `sqrt_numbers`

list with the results. Using, `{purrr}`

is way easier:

library(tidyverse) map(numbers, sqrt) ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 1.414214 ## ## [[3]] ## [1] 1.732051 ## ## [[4]] ## [1] 2 ## ## [[5]] ## [1] 2.236068 ## ## [[6]] ## [1] 2.44949 ## ## [[7]] ## [1] 2.645751 ## ## [[8]] ## [1] 2.828427 ## ## [[9]] ## [1] 3 ## ## [[10]] ## [1] 3.162278

`map()`

is only one of the nice functions that are bundled inside `{purrr}`

. Mastering `{purrr}`

can really make you a much more efficient R programmer. Anyways, recently, I have been playing around with imputation and the `{mice}`

package. `{mice}`

comes with an example dataset called `boys`

, let’s take a look at it:

library(mice) data(boys) brotools::describe(boys) %>% select(variable, type, n_missing, everything()) ## # A tibble: 9 x 12 ## variable type n_missing mean sd mode min max q25 median #### 1 age Numeric 0 9.16 6.89 0.0350 21.2 1.58 10.5 ## 2 bmi Numeric 21 18.1 3.05 11.8 31.7 15.9 17.4 ## 3 hc Numeric 46 51.5 5.91 33.7 65.0 48.1 53.0 ## 4 hgt Numeric 20 132 46.5 50.0 198 84.9 147 ## 5 tv Numeric 522 11.9 7.99 1.00 25.0 4.00 12.0 ## 6 wgt Numeric 4 37.2 26.0 3.14 117 11.7 34.7 ## 7 gen Factor 503 NA NA NA NA NA NA ## 8 phb Factor 503 NA NA NA NA NA NA ## 9 reg Factor 3 NA NA south NA NA NA NA ## # ... with 2 more variables: q75 , n_unique

In the code above I use the `describe()`

function from my personal package to get some summary statistics of the `boys`

dataset (you can read more about this function here). I am especially interested in the number of missing values, which is why I re-order the columns. If I did not re-order the columns, it would not appear in the output on my blog.

We see that some columns have a lot of missing values. Using the `mice`

function, it is very easy to impute them:

start <- Sys.time() imp_boys <- mice(boys, m = 10, maxit = 100, printFlag = FALSE) end <- Sys.time() - start print(end) ## Time difference of 3.290611 mins

Imputation on a single core took around 3 minutes on my computer. This might seem ok, but if you have a larger data set with more variables, 3 minutes can become 3 hours. And if you increase `maxit`

, which helps convergence, or the number of imputations, 3 hours can become 30 hours. With a 6-core CPU this could potentially be brought down to 5 hours (in theory). Let’s see if we can go faster, but first let’s take a look at the imputed data.

The `mice()`

function returns a `mids`

object. If you want to look at the data, you have to use the `complete()`

function (careful, there is also a `complete()`

function in the `{tidyr}`

package, so to avoid problems, I suggest you explicitely call `mice::complete()`

):

imp_boys <- mice::complete(imp_boys, "long") brotools::describe(imp_boys) %>% select(variable, type, n_missing, everything()) ## # A tibble: 11 x 12 ## variable type n_missing mean sd mode min max q25 median #### 1 age Numer… 0 9.16 6.89 0.0350 21.2 1.58 10.5 ## 2 bmi Numer… 0 18.0 3.03 11.8 31.7 15.9 17.4 ## 3 hc Numer… 0 51.6 5.89 33.7 65.0 48.3 53.2 ## 4 hgt Numer… 0 131 46.5 50.0 198 83.0 146 ## 5 tv Numer… 0 8.39 8.09 1.00 25.0 2.00 3.00 ## 6 wgt Numer… 0 37.1 26.0 3.14 117 11.7 34.6 ## 7 .id Factor 0 NA NA 3 NA NA NA NA ## 8 .imp Factor 0 NA NA 1 NA NA NA NA ## 9 gen Factor 0 NA NA G1 NA NA NA NA ## 10 phb Factor 0 NA NA P1 NA NA NA NA ## 11 reg Factor 0 NA NA south NA NA NA NA ## # ... with 2 more variables: q75 , n_unique

As expected, no more missing values. The “long” argument inside `mice::complete()`

is needed if you want the `complete()`

function to return a long dataset. Doing the above “manually” using `{purrr}`

is possible with the following code:

start <- Sys.time() imp_boys_purrr <- map(rep(1, 10), ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE)) end <- Sys.time() - start print(end) ## Time difference of 3.393966 mins

What this does is map the function `~mice(data = boys, m = ., maxit = 100, printFlag = FALSE)`

to a list of `1`

s, and creates 10 imputed data sets. `m = .`

means that `m`

will be equal to whatever is inside the list we are mapping our function over, so `1`

, then `1`

then another `1`

etc…. It took around the same amount of time as using `mice()`

directly.

`imp_boys_purrr`

is now a list of 10 `mids`

objects. We thus need to map `mice::complete()`

to `imp_boys_purrr`

to get the data:

imp_boys_purrr_complete <- map(imp_boys_purrr, mice::complete)

Now, `imp_boys_purrr_complete`

is a list of 10 datasets. Let’s map `brotools::describe()`

to it:

map(imp_boys_purrr_complete, brotools::describe) ## [[1]] ## # A tibble: 9 x 12 ## variable type mean sd mode min max q25 median q75 #### 1 age Numeric 9.16 6.89 0.0350 21.2 1.58 10.5 15.3 ## 2 bmi Numeric 18.0 3.03 11.8 31.7 15.9 17.4 19.5 ## 3 hc Numeric 51.7 5.90 33.7 65.0 48.3 53.1 56.0 ## 4 hgt Numeric 131 46.5 50.0 198 84.0 146 175 ## 5 tv Numeric 8.35 8.00 1.00 25.0 2.00 3.00 15.0 ## 6 wgt Numeric 37.2 26.0 3.14 117 11.7 34.7 59.6 ## 7 gen Factor NA NA G1 NA NA NA NA NA ## 8 phb Factor NA NA P1 NA NA NA NA NA ## 9 reg Factor NA NA south NA NA NA NA NA ## # ... with 2 more variables: n_missing , n_unique ## ## [[2]] ## # A tibble: 9 x 12 ## variable type mean sd mode min max q25 median q75 ## ## 1 age Numeric 9.16 6.89 0.0350 21.2 1.58 10.5 15.3 ## 2 bmi Numeric 18.0 3.03 11.8 31.7 15.9 17.5 19.5 ## 3 hc Numeric 51.6 5.88 33.7 65.0 48.3 53.2 56.0 ## 4 hgt Numeric 131 46.6 50.0 198 83.5 145 175 ## 5 tv Numeric 8.37 8.02 1.00 25.0 2.00 3.00 15.0 ## 6 wgt Numeric 37.1 26.0 3.14 117 11.9 34.6 59.6 ## 7 gen Factor NA NA G1 NA NA NA NA NA ## 8 phb Factor NA NA P2 NA NA NA NA NA ## 9 reg Factor NA NA south NA NA NA NA NA ## # ... with 2 more variables: n_missing , n_unique ## ## [[3]] ## # A tibble: 9 x 12 ## variable type mean sd mode min max q25 median q75 ## ## 1 age Numeric 9.16 6.89 0.0350 21.2 1.58 10.5 15.3 ## 2 bmi Numeric 18.1 3.04 11.8 31.7 15.9 17.5 19.5 ## 3 hc Numeric 51.6 5.87 33.7 65.0 48.5 53.3 56.0 ## 4 hgt Numeric 131 46.6 50.0 198 83.0 145 175 ## 5 tv Numeric 8.46 8.14 1.00 25.0 2.00 3.00 15.0 ## 6 wgt Numeric 37.2 26.1 3.14 117 11.7 34.6 59.6 ## 7 gen Factor NA NA G1 NA NA NA NA NA ## 8 phb Factor NA NA P1 NA NA NA NA NA ## 9 reg Factor NA NA south NA NA NA NA NA ## # ... with 2 more variables: n_missing , n_unique ## ## [[4]] ## # A tibble: 9 x 12 ## variable type mean sd mode min max q25 median q75 ## ## 1 age Numeric 9.16 6.89 0.0350 21.2 1.58 10.5 15.3 ## 2 bmi Numeric 18.1 3.02 11.8 31.7 15.9 17.5 19.4 ## 3 hc Numeric 51.7 5.93 33.7 65.0 48.5 53.4 56.0 ## 4 hgt Numeric 131 46.5 50.0 198 82.9 145 175 ## 5 tv Numeric 8.45 8.11 1.00 25.0 2.00 3.00 15.0 ## 6 wgt Numeric 37.2 26.0 3.14 117 11.7 34.7 59.6 ## 7 gen Factor NA NA G1 NA NA NA NA NA ## 8 phb Factor NA NA P1 NA NA NA NA NA ## 9 reg Factor NA NA south NA NA NA NA NA ## # ... with 2 more variables: n_missing , n_unique ## ## [[5]] ## # A tibble: 9 x 12 ## variable type mean sd mode min max q25 median q75 ## ## 1 age Numeric 9.16 6.89 0.0350 21.2 1.58 10.5 15.3 ## 2 bmi Numeric 18.0 3.03 11.8 31.7 15.9 17.5 19.5 ## 3 hc Numeric 51.6 5.91 33.7 65.0 48.3 53.2 56.0 ## 4 hgt Numeric 131 46.6 50.0 198 83.0 146 175 ## 5 tv Numeric 8.21 8.02 1.00 25.0 2.00 3.00 15.0 ## 6 wgt Numeric 37.1 26.0 3.14 117 11.7 34.6 59.6 ## 7 gen Factor NA NA G1 NA NA NA NA NA ## 8 phb Factor NA NA P1 NA NA NA NA NA ## 9 reg Factor NA NA south NA NA NA NA NA ## # ... with 2 more variables: n_missing , n_unique ## ## [[6]] ## # A tibble: 9 x 12 ## variable type mean sd mode min max q25 median q75 ## ## 1 age Numeric 9.16 6.89 0.0350 21.2 1.58 10.5 15.3 ## 2 bmi Numeric 18.0 3.05 11.8 31.7 15.9 17.4 19.5 ## 3 hc Numeric 51.7 5.89 33.7 65.0 48.3 53.2 56.0 ## 4 hgt Numeric 131 46.5 50.0 198 83.0 146 175 ## 5 tv Numeric 8.44 8.24 1.00 25.0 2.00 3.00 15.0 ## 6 wgt Numeric 37.1 26.0 3.14 117 11.7 34.6 59.6 ## 7 gen Factor NA NA G1 NA NA NA NA NA ## 8 phb Factor NA NA P1 NA NA NA NA NA ## 9 reg Factor NA NA south NA NA NA NA NA ## # ... with 2 more variables: n_missing , n_unique ## ## [[7]] ## # A tibble: 9 x 12 ## variable type mean sd mode min max q25 median q75 ## ## 1 age Numeric 9.16 6.89 0.0350 21.2 1.58 10.5 15.3 ## 2 bmi Numeric 18.1 3.04 11.8 31.7 15.9 17.4 19.5 ## 3 hc Numeric 51.6 5.88 33.7 65.0 48.2 53.2 56.0 ## 4 hgt Numeric 131 46.6 50.0 198 83.5 146 175 ## 5 tv Numeric 8.47 8.15 1.00 25.0 2.00 3.00 15.0 ## 6 wgt Numeric 37.2 26.1 3.14 117 11.7 34.6 59.6 ## 7 gen Factor NA NA G1 NA NA NA NA NA ## 8 phb Factor NA NA P1 NA NA NA NA NA ## 9 reg Factor NA NA south NA NA NA NA NA ## # ... with 2 more variables: n_missing , n_unique ## ## [[8]] ## # A tibble: 9 x 12 ## variable type mean sd mode min max q25 median q75 ## ## 1 age Numeric 9.16 6.89 0.0350 21.2 1.58 10.5 15.3 ## 2 bmi Numeric 18.0 3.04 11.8 31.7 15.9 17.4 19.4 ## 3 hc Numeric 51.6 5.85 33.7 65.0 48.2 53.3 56.0 ## 4 hgt Numeric 131 46.5 50.0 198 83.0 146 175 ## 5 tv Numeric 8.36 8.06 1.00 25.0 2.00 3.00 15.0 ## 6 wgt Numeric 37.2 26.1 3.14 117 11.7 34.6 59.6 ## 7 gen Factor NA NA G1 NA NA NA NA NA ## 8 phb Factor NA NA P1 NA NA NA NA NA ## 9 reg Factor NA NA south NA NA NA NA NA ## # ... with 2 more variables: n_missing , n_unique ## ## [[9]] ## # A tibble: 9 x 12 ## variable type mean sd mode min max q25 median q75 ## ## 1 age Numeric 9.16 6.89 0.0350 21.2 1.58 10.5 15.3 ## 2 bmi Numeric 18.0 3.05 11.8 31.7 15.9 17.4 19.5 ## 3 hc Numeric 51.6 5.90 33.7 65.0 48.3 53.2 56.0 ## 4 hgt Numeric 131 46.6 50.0 198 83.9 146 175 ## 5 tv Numeric 8.57 8.25 1.00 25.0 2.00 3.00 15.0 ## 6 wgt Numeric 37.1 26.1 3.14 117 11.7 34.6 59.6 ## 7 gen Factor NA NA G1 NA NA NA NA NA ## 8 phb Factor NA NA P1 NA NA NA NA NA ## 9 reg Factor NA NA south NA NA NA NA NA ## # ... with 2 more variables: n_missing , n_unique ## ## [[10]] ## # A tibble: 9 x 12 ## variable type mean sd mode min max q25 median q75 ## ## 1 age Numeric 9.16 6.89 0.0350 21.2 1.58 10.5 15.3 ## 2 bmi Numeric 18.0 3.04 11.8 31.7 15.9 17.4 19.5 ## 3 hc Numeric 51.6 5.89 33.7 65.0 48.3 53.1 56.0 ## 4 hgt Numeric 131 46.6 50.0 198 83.0 146 175 ## 5 tv Numeric 8.49 8.18 1.00 25.0 2.00 3.00 15.0 ## 6 wgt Numeric 37.1 26.1 3.14 117 11.7 34.6 59.6 ## 7 gen Factor NA NA G1 NA NA NA NA NA ## 8 phb Factor NA NA P1 NA NA NA NA NA ## 9 reg Factor NA NA south NA NA NA NA NA ## # ... with 2 more variables: n_missing , n_unique

Before merging this 10 datasets together into one, it would be nice to have a column with the id of the datasets. This can easily be done with a variant of `purrr::map()`

, called `map2()`

:

imp_boys_purrr <- map2(.x = seq(1,10), .y = imp_boys_purrr_complete, ~mutate(.y, imp_id = as.character(.x)))

`map2()`

applies a function, say `f()`

, to 2 lists sequentially: `f(x_1, y_1)`

, then `f(x_2, y_2)`

, etc… So here I map `mutate()`

to create a new column, `imp_id`

in each dataset. Now let’s bind the rows and take a look at the data:

imp_boys_purrr <- bind_rows(imp_boys_purrr) imp_boys_purrr %>% brotools::describe() %>% select(variable, type, n_missing, everything()) ## # A tibble: 10 x 12 ## variable type n_missing mean sd mode min max q25 median #### 1 age Numer… 0 9.16 6.89 0.0350 21.2 1.58 10.5 ## 2 bmi Numer… 0 18.0 3.04 11.8 31.7 15.9 17.4 ## 3 hc Numer… 0 51.6 5.89 33.7 65.0 48.3 53.2 ## 4 hgt Numer… 0 131 46.5 50.0 198 83.0 146 ## 5 tv Numer… 0 8.42 8.11 1.00 25.0 2.00 3.00 ## 6 wgt Numer… 0 37.1 26.0 3.14 117 11.7 34.6 ## 7 imp_id Chara… 0 NA NA 1 NA NA NA NA ## 8 gen Factor 0 NA NA G1 NA NA NA NA ## 9 phb Factor 0 NA NA P1 NA NA NA NA ## 10 reg Factor 0 NA NA south NA NA NA NA ## # ... with 2 more variables: q75 , n_unique

You may ask yourself why I am bothering with all this. This will become apparent now. We can now use the code we wrote to get our 10 imputed datasets using `purrr::map()`

and simply use `furrr::future_map()`

to parallelize the imputation process:

library(furrr) ## Loading required package: future plan(multiprocess) start <- Sys.time() imp_boys_future <- future_map(rep(1, 10), ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE)) end <- Sys.time() - start print(end) ## Time difference of 33.73772 secs

Boooom! Much faster! And simply by loading `{furrr}`

, then using `plan(multiprocess)`

to run the code in parallel (if you forget that, the code will run on a single core) and using `future_map()`

instead of `map()`

.

Let’s take a look at the data:

imp_boys_future_complete <- map(imp_boys_future, mice::complete) imp_boys_future <- map2(.x = seq(1,10), .y = imp_boys_future_complete, ~mutate(.y, imp_id = as.character(.x))) imp_boys_future <- bind_rows(imp_boys_future) imp_boys_future %>% brotools::describe() %>% select(variable, type, n_missing, everything()) ## # A tibble: 10 x 12 ## variable type n_missing mean sd mode min max q25 median #### 1 age Numer… 0 9.16 6.89 0.0350 21.2 1.58 10.5 ## 2 bmi Numer… 0 18.0 3.04 11.8 31.7 15.9 17.4 ## 3 hc Numer… 0 51.6 5.89 33.7 65.0 48.4 53.2 ## 4 hgt Numer… 0 131 46.5 50.0 198 83.0 146 ## 5 tv Numer… 0 8.35 8.09 1.00 25.0 2.00 3.00 ## 6 wgt Numer… 0 37.1 26.0 3.14 117 11.7 34.6 ## 7 imp_id Chara… 0 NA NA 1 NA NA NA NA ## 8 gen Factor 0 NA NA G1 NA NA NA NA ## 9 phb Factor 0 NA NA P1 NA NA NA NA ## 10 reg Factor 0 NA NA south NA NA NA NA ## # ... with 2 more variables: q75 , n_unique

So imputation went from 3.4 minutes (around 200 seconds) to 30 seconds. How cool is that? If you want to play around with `{furrr}`

you must install it from Github, as it is not yet available on CRAN:

devtools::install_github("DavisVaughan/furrr")

If you are not comfortable with `map()`

(and thus `future_map()`

) but still want to impute in parallel, there is this very nice script here to do just that. I created a package around this script, called parlMICE (the same name as the script), to make installation and usage easier. You can install it with like so:

devtools::install_github("b-rodrigues/parlMICE")

If you found this blog post useful, you might want to follow me on twitter for blog post updates.

**leave a comment**for the author, please follow the link and comment on their blog:

**Econometrics and Free Software**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.