**Econometrics and Free Software**, and kindly contributed to R-bloggers)

Today I saw this tweet on my timeline:

For those of us that just can't wait until RStudio officially supports parallel purrr in #rstats, boy have I got something for you.

Introducing `furrr`, parallel purrr through the use of futures. Go ahead, break things, you know you want to:https://t.co/l9z1UC2Tew

— Davis Vaughan (@dvaughan32) April 13, 2018

and as a heavy `{purrr}`

user, as well as the happy owner of a 6-core AMD Ryzen 5 1600X cpu, I was very excited to try out `{furrr}`

. For those unfamiliar with `{purrr}`

, you can read some of my previous blog posts on it here, here or here.

To summarize very quickly: `{purrr}`

contains so-called higher order functions, which are functions that take other functions as argument. One such function is `map()`

. Consider the following simple example:

`numbers <- seq(1, 10)`

If you want the square root of this numbers, you can of course simply use the `sqrt()`

function, because it is vectorized:

`sqrt(numbers)`

```
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
## [8] 2.828427 3.000000 3.162278
```

But in a lot of situations, the solution is not so simple. Sometimes you have to loop over the values. This is what we would need to do if `sqrt()`

was not vectorized:

```
sqrt_numbers <- rep(0, 10)
for(i in length(numbers)){
sqrt_numbers[i] <- sqrt(numbers[i])
}
```

First, you need to initialize a container, and then you have to populate the `sqrt_numbers`

list with the results. Using, `{purrr}`

is way easier:

```
library(tidyverse)
map(numbers, sqrt)
```

```
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1.414214
##
## [[3]]
## [1] 1.732051
##
## [[4]]
## [1] 2
##
## [[5]]
## [1] 2.236068
##
## [[6]]
## [1] 2.44949
##
## [[7]]
## [1] 2.645751
##
## [[8]]
## [1] 2.828427
##
## [[9]]
## [1] 3
##
## [[10]]
## [1] 3.162278
```

`map()`

is only one of the nice functions that are bundled inside `{purrr}`

. Mastering `{purrr}`

can really make you a much more efficient R programmer. Anyways, recently, I have been playing around with imputation and the `{mice}`

package. `{mice}`

comes with an example dataset called `boys`

, let’s take a look at it:

```
library(mice)
data(boys)
brotools::describe(boys) %>%
select(variable, type, n_missing, everything())
```

```
## # A tibble: 9 x 12
## variable type n_missing mean sd mode min max q25 median
##
```
## 1 age Numeric 0 9.16 6.89 0.0350 21.2 1.58 10.5
## 2 bmi Numeric 21 18.1 3.05 11.8 31.7 15.9 17.4
## 3 hc Numeric 46 51.5 5.91 33.7 65.0 48.1 53.0
## 4 hgt Numeric 20 132 46.5 50.0 198 84.9 147
## 5 tv Numeric 522 11.9 7.99 1.00 25.0 4.00 12.0
## 6 wgt Numeric 4 37.2 26.0 3.14 117 11.7 34.7
## 7 gen Factor 503 NA NA NA NA NA NA
## 8 phb Factor 503 NA NA NA NA NA NA
## 9 reg Factor 3 NA NA south NA NA NA NA
## # ... with 2 more variables: q75 , n_unique

In the code above I use the `describe()`

function from my personal package to get some summary statistics of the `boys`

dataset (you can read more about this function here). I am especially interested in the number of missing values, which is why I re-order the columns. If I did not re-order the columns, it would not appear in the output on my blog.

We see that some columns have a lot of missing values. Using the `mice`

function, it is very easy to impute them:

```
start <- Sys.time()
imp_boys <- mice(boys, m = 10, maxit = 100, printFlag = FALSE)
end <- Sys.time() - start
print(end)
```

`## Time difference of 3.290611 mins`

Imputation on a single core took around 3 minutes on my computer. This might seem ok, but if you have a larger data set with more variables, 3 minutes can become 3 hours. And if you increase `maxit`

, which helps convergence, or the number of imputations, 3 hours can become 30 hours. With a 6-core CPU this could potentially be brought down to 5 hours (in theory). Let’s see if we can go faster, but first let’s take a look at the imputed data.

The `mice()`

function returns a `mids`

object. If you want to look at the data, you have to use the `complete()`

function (careful, there is also a `complete()`

function in the `{tidyr}`

package, so to avoid problems, I suggest you explicitely call `mice::complete()`

):

```
imp_boys <- mice::complete(imp_boys, "long")
brotools::describe(imp_boys) %>%
select(variable, type, n_missing, everything())
```

```
## # A tibble: 11 x 12
## variable type n_missing mean sd mode min max q25 median
##
```
## 1 age Numer… 0 9.16 6.89 0.0350 21.2 1.58 10.5
## 2 bmi Numer… 0 18.0 3.03 11.8 31.7 15.9 17.4
## 3 hc Numer… 0 51.6 5.89 33.7 65.0 48.3 53.2
## 4 hgt Numer… 0 131 46.5 50.0 198 83.0 146
## 5 tv Numer… 0 8.39 8.09 1.00 25.0 2.00 3.00
## 6 wgt Numer… 0 37.1 26.0 3.14 117 11.7 34.6
## 7 .id Factor 0 NA NA 3 NA NA NA NA
## 8 .imp Factor 0 NA NA 1 NA NA NA NA
## 9 gen Factor 0 NA NA G1 NA NA NA NA
## 10 phb Factor 0 NA NA P1 NA NA NA NA
## 11 reg Factor 0 NA NA south NA NA NA NA
## # ... with 2 more variables: q75 , n_unique

As expected, no more missing values. The “long” argument inside `mice::complete()`

is needed if you want the `complete()`

function to return a long dataset. Doing the above “manually” using `{purrr}`

is possible with the following code:

```
start <- Sys.time()
imp_boys_purrr <- map(rep(1, 10), ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE))
end <- Sys.time() - start
print(end)
```

`## Time difference of 3.393966 mins`

What this does is map the function `~mice(data = boys, m = ., maxit = 100, printFlag = FALSE)`

to a list of `1`

s, and creates 10 imputed data sets. `m = .`

means that `m`

will be equal to whatever is inside the list we are mapping our function over, so `1`

, then `1`

then another `1`

etc…. It took around the same amount of time as using `mice()`

directly.

`imp_boys_purrr`

is now a list of 10 `mids`

objects. We thus need to map `mice::complete()`

to `imp_boys_purrr`

to get the data:

`imp_boys_purrr_complete <- map(imp_boys_purrr, mice::complete)`

Now, `imp_boys_purrr_complete`

is a list of 10 datasets. Let’s map `brotools::describe()`

to it:

`map(imp_boys_purrr_complete, brotools::describe)`

```
## [[1]]
## # A tibble: 9 x 12
## variable type mean sd mode min max q25 median q75
##
```
## 1 age Numeric 9.16 6.89 0.0350 21.2 1.58 10.5 15.3
## 2 bmi Numeric 18.0 3.03 11.8 31.7 15.9 17.4 19.5
## 3 hc Numeric 51.7 5.90 33.7 65.0 48.3 53.1 56.0
## 4 hgt Numeric 131 46.5 50.0 198 84.0 146 175
## 5 tv Numeric 8.35 8.00 1.00 25.0 2.00 3.00 15.0
## 6 wgt Numeric 37.2 26.0 3.14 117 11.7 34.7 59.6
## 7 gen Factor NA NA G1 NA NA NA NA NA
## 8 phb Factor NA NA P1 NA NA NA NA NA
## 9 reg Factor NA NA south NA NA NA NA NA
## # ... with 2 more variables: n_missing , n_unique
##
## [[2]]
## # A tibble: 9 x 12
## variable type mean sd mode min max q25 median q75
##
## 1 age Numeric 9.16 6.89 0.0350 21.2 1.58 10.5 15.3
## 2 bmi Numeric 18.0 3.03 11.8 31.7 15.9 17.5 19.5
## 3 hc Numeric 51.6 5.88 33.7 65.0 48.3 53.2 56.0
## 4 hgt Numeric 131 46.6 50.0 198 83.5 145 175
## 5 tv Numeric 8.37 8.02 1.00 25.0 2.00 3.00 15.0
## 6 wgt Numeric 37.1 26.0 3.14 117 11.9 34.6 59.6
## 7 gen Factor NA NA G1 NA NA NA NA NA
## 8 phb Factor NA NA P2 NA NA NA NA NA
## 9 reg Factor NA NA south NA NA NA NA NA
## # ... with 2 more variables: n_missing , n_unique
##
## [[3]]
## # A tibble: 9 x 12
## variable type mean sd mode min max q25 median q75
##
## 1 age Numeric 9.16 6.89 0.0350 21.2 1.58 10.5 15.3
## 2 bmi Numeric 18.1 3.04 11.8 31.7 15.9 17.5 19.5
## 3 hc Numeric 51.6 5.87 33.7 65.0 48.5 53.3 56.0
## 4 hgt Numeric 131 46.6 50.0 198 83.0 145 175
## 5 tv Numeric 8.46 8.14 1.00 25.0 2.00 3.00 15.0
## 6 wgt Numeric 37.2 26.1 3.14 117 11.7 34.6 59.6
## 7 gen Factor NA NA G1 NA NA NA NA NA
## 8 phb Factor NA NA P1 NA NA NA NA NA
## 9 reg Factor NA NA south NA NA NA NA NA
## # ... with 2 more variables: n_missing , n_unique
##
## [[4]]
## # A tibble: 9 x 12
## variable type mean sd mode min max q25 median q75
##
## 1 age Numeric 9.16 6.89 0.0350 21.2 1.58 10.5 15.3
## 2 bmi Numeric 18.1 3.02 11.8 31.7 15.9 17.5 19.4
## 3 hc Numeric 51.7 5.93 33.7 65.0 48.5 53.4 56.0
## 4 hgt Numeric 131 46.5 50.0 198 82.9 145 175
## 5 tv Numeric 8.45 8.11 1.00 25.0 2.00 3.00 15.0
## 6 wgt Numeric 37.2 26.0 3.14 117 11.7 34.7 59.6
## 7 gen Factor NA NA G1 NA NA NA NA NA
## 8 phb Factor NA NA P1 NA NA NA NA NA
## 9 reg Factor NA NA south NA NA NA NA NA
## # ... with 2 more variables: n_missing , n_unique
##
## [[5]]
## # A tibble: 9 x 12
## variable type mean sd mode min max q25 median q75
##
## 1 age Numeric 9.16 6.89 0.0350 21.2 1.58 10.5 15.3
## 2 bmi Numeric 18.0 3.03 11.8 31.7 15.9 17.5 19.5
## 3 hc Numeric 51.6 5.91 33.7 65.0 48.3 53.2 56.0
## 4 hgt Numeric 131 46.6 50.0 198 83.0 146 175
## 5 tv Numeric 8.21 8.02 1.00 25.0 2.00 3.00 15.0
## 6 wgt Numeric 37.1 26.0 3.14 117 11.7 34.6 59.6
## 7 gen Factor NA NA G1 NA NA NA NA NA
## 8 phb Factor NA NA P1 NA NA NA NA NA
## 9 reg Factor NA NA south NA NA NA NA NA
## # ... with 2 more variables: n_missing , n_unique
##
## [[6]]
## # A tibble: 9 x 12
## variable type mean sd mode min max q25 median q75
##
## 1 age Numeric 9.16 6.89 0.0350 21.2 1.58 10.5 15.3
## 2 bmi Numeric 18.0 3.05 11.8 31.7 15.9 17.4 19.5
## 3 hc Numeric 51.7 5.89 33.7 65.0 48.3 53.2 56.0
## 4 hgt Numeric 131 46.5 50.0 198 83.0 146 175
## 5 tv Numeric 8.44 8.24 1.00 25.0 2.00 3.00 15.0
## 6 wgt Numeric 37.1 26.0 3.14 117 11.7 34.6 59.6
## 7 gen Factor NA NA G1 NA NA NA NA NA
## 8 phb Factor NA NA P1 NA NA NA NA NA
## 9 reg Factor NA NA south NA NA NA NA NA
## # ... with 2 more variables: n_missing , n_unique
##
## [[7]]
## # A tibble: 9 x 12
## variable type mean sd mode min max q25 median q75
##
## 1 age Numeric 9.16 6.89 0.0350 21.2 1.58 10.5 15.3
## 2 bmi Numeric 18.1 3.04 11.8 31.7 15.9 17.4 19.5
## 3 hc Numeric 51.6 5.88 33.7 65.0 48.2 53.2 56.0
## 4 hgt Numeric 131 46.6 50.0 198 83.5 146 175
## 5 tv Numeric 8.47 8.15 1.00 25.0 2.00 3.00 15.0
## 6 wgt Numeric 37.2 26.1 3.14 117 11.7 34.6 59.6
## 7 gen Factor NA NA G1 NA NA NA NA NA
## 8 phb Factor NA NA P1 NA NA NA NA NA
## 9 reg Factor NA NA south NA NA NA NA NA
## # ... with 2 more variables: n_missing , n_unique
##
## [[8]]
## # A tibble: 9 x 12
## variable type mean sd mode min max q25 median q75
##
## 1 age Numeric 9.16 6.89 0.0350 21.2 1.58 10.5 15.3
## 2 bmi Numeric 18.0 3.04 11.8 31.7 15.9 17.4 19.4
## 3 hc Numeric 51.6 5.85 33.7 65.0 48.2 53.3 56.0
## 4 hgt Numeric 131 46.5 50.0 198 83.0 146 175
## 5 tv Numeric 8.36 8.06 1.00 25.0 2.00 3.00 15.0
## 6 wgt Numeric 37.2 26.1 3.14 117 11.7 34.6 59.6
## 7 gen Factor NA NA G1 NA NA NA NA NA
## 8 phb Factor NA NA P1 NA NA NA NA NA
## 9 reg Factor NA NA south NA NA NA NA NA
## # ... with 2 more variables: n_missing , n_unique
##
## [[9]]
## # A tibble: 9 x 12
## variable type mean sd mode min max q25 median q75
##
## 1 age Numeric 9.16 6.89 0.0350 21.2 1.58 10.5 15.3
## 2 bmi Numeric 18.0 3.05 11.8 31.7 15.9 17.4 19.5
## 3 hc Numeric 51.6 5.90 33.7 65.0 48.3 53.2 56.0
## 4 hgt Numeric 131 46.6 50.0 198 83.9 146 175
## 5 tv Numeric 8.57 8.25 1.00 25.0 2.00 3.00 15.0
## 6 wgt Numeric 37.1 26.1 3.14 117 11.7 34.6 59.6
## 7 gen Factor NA NA G1 NA NA NA NA NA
## 8 phb Factor NA NA P1 NA NA NA NA NA
## 9 reg Factor NA NA south NA NA NA NA NA
## # ... with 2 more variables: n_missing , n_unique
##
## [[10]]
## # A tibble: 9 x 12
## variable type mean sd mode min max q25 median q75
##
## 1 age Numeric 9.16 6.89 0.0350 21.2 1.58 10.5 15.3
## 2 bmi Numeric 18.0 3.04 11.8 31.7 15.9 17.4 19.5
## 3 hc Numeric 51.6 5.89 33.7 65.0 48.3 53.1 56.0
## 4 hgt Numeric 131 46.6 50.0 198 83.0 146 175
## 5 tv Numeric 8.49 8.18 1.00 25.0 2.00 3.00 15.0
## 6 wgt Numeric 37.1 26.1 3.14 117 11.7 34.6 59.6
## 7 gen Factor NA NA G1 NA NA NA NA NA
## 8 phb Factor NA NA P1 NA NA NA NA NA
## 9 reg Factor NA NA south NA NA NA NA NA
## # ... with 2 more variables: n_missing , n_unique

Before merging this 10 datasets together into one, it would be nice to have a column with the id of the datasets. This can easily be done with a variant of `purrr::map()`

, called `map2()`

:

`imp_boys_purrr <- map2(.x = seq(1,10), .y = imp_boys_purrr_complete, ~mutate(.y, imp_id = as.character(.x)))`

`map2()`

applies a function, say `f()`

, to 2 lists sequentially: `f(x_1, y_1)`

, then `f(x_2, y_2)`

, etc… So here I map `mutate()`

to create a new column, `imp_id`

in each dataset. Now let’s bind the rows and take a look at the data:

```
imp_boys_purrr <- bind_rows(imp_boys_purrr)
imp_boys_purrr %>%
brotools::describe() %>%
select(variable, type, n_missing, everything())
```

```
## # A tibble: 10 x 12
## variable type n_missing mean sd mode min max q25 median
##
```
## 1 age Numer… 0 9.16 6.89 0.0350 21.2 1.58 10.5
## 2 bmi Numer… 0 18.0 3.04 11.8 31.7 15.9 17.4
## 3 hc Numer… 0 51.6 5.89 33.7 65.0 48.3 53.2
## 4 hgt Numer… 0 131 46.5 50.0 198 83.0 146
## 5 tv Numer… 0 8.42 8.11 1.00 25.0 2.00 3.00
## 6 wgt Numer… 0 37.1 26.0 3.14 117 11.7 34.6
## 7 imp_id Chara… 0 NA NA 1 NA NA NA NA
## 8 gen Factor 0 NA NA G1 NA NA NA NA
## 9 phb Factor 0 NA NA P1 NA NA NA NA
## 10 reg Factor 0 NA NA south NA NA NA NA
## # ... with 2 more variables: q75 , n_unique

You may ask yourself why I am bothering with all this. This will become apparent now. We can now use the code we wrote to get our 10 imputed datasets using `purrr::map()`

and simply use `furrr::future_map()`

to parallelize the imputation process:

`library(furrr)`

`## Loading required package: future`

```
plan(multiprocess)
start <- Sys.time()
imp_boys_future <- future_map(rep(1, 10), ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE))
end <- Sys.time() - start
print(end)
```

`## Time difference of 33.73772 secs`

Boooom! Much faster! And simply by loading `{furrr}`

, then using `plan(multiprocess)`

to run the code in parallel (if you forget that, the code will run on a single core) and using `future_map()`

instead of `map()`

.

Let’s take a look at the data:

```
imp_boys_future_complete <- map(imp_boys_future, mice::complete)
imp_boys_future <- map2(.x = seq(1,10), .y = imp_boys_future_complete, ~mutate(.y, imp_id = as.character(.x)))
imp_boys_future <- bind_rows(imp_boys_future)
imp_boys_future %>%
brotools::describe() %>%
select(variable, type, n_missing, everything())
```

```
## # A tibble: 10 x 12
## variable type n_missing mean sd mode min max q25 median
##
```
## 1 age Numer… 0 9.16 6.89 0.0350 21.2 1.58 10.5
## 2 bmi Numer… 0 18.0 3.04 11.8 31.7 15.9 17.4
## 3 hc Numer… 0 51.6 5.89 33.7 65.0 48.4 53.2
## 4 hgt Numer… 0 131 46.5 50.0 198 83.0 146
## 5 tv Numer… 0 8.35 8.09 1.00 25.0 2.00 3.00
## 6 wgt Numer… 0 37.1 26.0 3.14 117 11.7 34.6
## 7 imp_id Chara… 0 NA NA 1 NA NA NA NA
## 8 gen Factor 0 NA NA G1 NA NA NA NA
## 9 phb Factor 0 NA NA P1 NA NA NA NA
## 10 reg Factor 0 NA NA south NA NA NA NA
## # ... with 2 more variables: q75 , n_unique

So imputation went from 3.4 minutes (around 200 seconds) to 30 seconds. How cool is that? If you want to play around with `{furrr}`

you must install it from Github, as it is not yet available on CRAN:

`devtools::install_github("DavisVaughan/furrr")`

If you are not comfortable with `map()`

(and thus `future_map()`

) but still want to impute in parallel, there is this very nice script here to do just that. I created a package around this script, called parlMICE (the same name as the script), to make installation and usage easier. You can install it with like so:

`devtools::install_github("b-rodrigues/parlMICE")`

If you found this blog post useful, you might want to follow me on twitter for blog post updates.

**leave a comment**for the author, please follow the link and comment on their blog:

**Econometrics and Free Software**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...