R: Reduce() Part 2 – some pitfalls using Reduce

[This article was first published on R – Data Science made in Switzerland, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Matthias Templ (ZHAW), Thoralf Mildenberger (ZHAW)

By way of example, functionality of Reduce() is shown in in https://blog.zhaw.ch/datascience/r-reduce-applys-lesser-known-brother/ . It’s great to learn about how to use this function on interesting problems. If you are ready (equals if you read the first blog post on Reduce), we want to push you further on writing efficient code.

Reduce() and other base R functionality might be extremely slow compared to data manipulation functions of package dplyr or data.table . In addition, compared to other software and languages, solving recursive problems with R’s Reduce() is super-slow, see e.g. recursive problems at Rosetta Code , and this is sometimes used to argue that R is slow itself, which is simply not true.

This blog post discusses some drawbacks of Reduce and base R related to user-friendliness of code and computational speed. We re-discuss two examples from https://blog.zhaw.ch/datascience/r-reduce-applys-lesser-known-brother/

Example: Merging multiple data sets

Joining various data sets can be done with many packages, but to the best of my knowledge this could be done with package data.table . The syntax is clear and the computational speed is just amazing.

The former merge() and Reduce() solutions with base R:

df1 <- data.frame(Customer_ID = c(1, 2, 3), Name = c("John", "Sue", "Ann"))
df2 <- data.frame(Customer_ID = c(1, 3), Year_First_Order = c(2011, 2017))
df3 <- data.frame(Customer_ID = c(1, 2, 3), 
                  Date_Last_Order = c("2017-03-03", "2014-03-01", "2017-05-30"),
                  No_Items_Last_Order = c(3, 1, 1),
                  Total_Amount_Last_Order = c(49, 25,25))
df4 <- data.frame(Customer_ID = c(2, 3), Interested_In_Promo = c(TRUE, FALSE))

We could first join the first two data frames, then join the result with the third, and then join the results of that with the fourth table. In the following, a LEFT OUTER join is done using base R:

df_merge <- merge(df1, df2, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
df_merge <- merge(df_merge, df3, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
df_merge <- merge(df_merge, df4, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
df_merge
##   Customer_ID Name Year_First_Order Date_Last_Order No_Items_Last_Order
## 1           1 John             2011      2017-03-03                   3
## 2           2  Sue               NA      2014-03-01                   1
## 3           3  Ann             2017      2017-05-30                   1
##   Total_Amount_Last_Order Interested_In_Promo
## 1                      49                  NA
## 2                      25                TRUE
## 3                      25               FALSE

When all the data frames are stored in one list, Reduce() can be applied in the following manner:

df_list <- list(df1, df2, df3, df4)
Reduce(function(d1, d2) merge(d1, d2, by = "Customer_ID", all.x = TRUE, all.y = FALSE), df_list)
##   Customer_ID Name Year_First_Order Date_Last_Order No_Items_Last_Order
## 1           1 John             2011      2017-03-03                   3
## 2           2  Sue               NA      2014-03-01                   1
## 3           3  Ann             2017      2017-05-30                   1
##   Total_Amount_Last_Order Interested_In_Promo
## 1                      49                  NA
## 2                      25                TRUE
## 3                      25               FALSE

But is this computationally efficient?

We need a larger data set to get some insights.

n <- 300000
df1 <- data.frame(Customer_ID = 1:n, Name = rep(c("John", "Sue", "Ann"), n/3))
df2 <- data.frame(Customer_ID = sample(1:n, size = n*2/3), Year_First_Order = sample(2000:2017, size = n*2/3, replace = TRUE))
df3 <- data.frame(Customer_ID = 1:n, 
                  Date_Last_Order = rep(c("2017-03-03", "2014-03-01", "2017-05-30"),n/3),
                  No_Items_Last_Order = sample(1:10, size = n, replace = TRUE),
                  Total_Amount_Last_Order = sample(20:50, size = n, replace = TRUE))
df4 <- data.frame(Customer_ID = sample(1:n, size = n*2/3), Interested_In_Promo = sample(c(TRUE, FALSE), size = n*2/3, replace = TRUE))

We use the microbenchmark package to assess the computational speed. First we put our code in functions as input for microbenchmark.

m1 <- function(df1, df2, df3, df4){
  df_merge <- merge(df1, df2, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
  df_merge <- merge(df_merge, df3, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
  df_merge <- merge(df_merge, df4, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
  df_merge
}
m2 <- function(df1, df2, df3, df4){
  df_list <- list(df1, df2, df3, df4)
  Reduce(function(d1, d2) merge(d1, d2, by = "Customer_ID", all.x = TRUE, 
                                all.y = FALSE), df_list)
}
library("microbenchmark")
mbm <- microbenchmark(
  mergeBase = m1(df1, df2, df3, df4),
  Reduce = m2(df1, df2, df3, df4),
  times = 10
)
mbm
## Unit: seconds
##       expr      min       lq     mean   median       uq      max neval cld
##  mergeBase 1.028722 1.120588 1.126353 1.131829 1.152066 1.170740    10   a
##     Reduce 1.105373 1.144046 1.168305 1.158710 1.179812 1.306395    10   a

About 1.15 seconds for one merge. Not bad, but if we would increase the size of our data set slightly, the computation times would increase dramatically.

data.table’s solution to the above problem

It is relevant to ask, if more efficient code can be written that provides the same results. In addition, we ask the question if we can use better readable syntax than above?

Let’s replace our merge and Reduce solutions with data.table code.

We first call the package and store the data frames as data tables. In addition, we set a key for an efficient use of the binary search of the data.table package.

library("data.table")
dt1 <- data.table(df1)
dt2 <- data.table(df2)
dt3 <- data.table(df3)
dt4 <- data.table(df4)
setkey(dt1, Customer_ID)
setkey(dt2, Customer_ID) 
setkey(dt3, Customer_ID) 
setkey(dt4, Customer_ID) 

The following data.table code gives us exactly the same solution as with our previous base R code.

But note, the syntax is much shorter and more user-friendly! This is all you need to write for the LEFT INNER JOIN for the given problem:

DT_cmb <- dt4[dt3][dt2][dt1]

Lets compare the speed of all three approaches (pure merge, Reduce and data.table).

m3 <- function(dt1, dt2, dt3, dt4){
  setkey(dt1, Customer_ID)
  setkey(dt2, Customer_ID) 
  setkey(dt3, Customer_ID) 
  setkey(dt4, Customer_ID) 
  DT_cmb <- dt4[dt3][dt2][dt1]
  DT_cmb
}
library("microbenchmark")
mbm <- microbenchmark(
  mergeBase = m1(df1, df2, df3, df4),
  Reduce = m2(df1, df2, df3, df4),
  dt = m3(dt1, dt2, dt3, dt4),
  times = 10
)
mbm
## Unit: milliseconds
##       expr        min         lq       mean     median         uq      max
##  mergeBase 1010.03602 1030.72226 1099.73585 1047.30145 1208.32011 1286.489
##     Reduce  968.37101 1039.04347 1097.75323 1048.05562 1201.69135 1284.099
##         dt   43.24658   50.42165   66.52315   53.34681   70.50837  140.710
##  neval cld
##     10   b
##     10   b
##     10  a

The solution with data.table is more than 15 times faster! and the code is much shorter and elegant. Note that the differences becomes even (much) larger if the data size increases.

Example: Sums of matrix powers

The lapply/Reduce code:

We can also combine lapply() and Reduce(), such as been carried out in the following example.

library("expm")
n <- 2
P <- matrix(runif(n*n, 0.01, 0.02), ncol = n, nrow = n)
p1 <- function(vec, P){
  P_powers <- lapply(vec, function(k) P %^% k)
  Reduce("+", P_powers)
}
p1(0:2, P = P)
##            [,1]       [,2]
## [1,] 1.01718756 0.01206923
## [2,] 0.01409565 1.01037824

What is the aim of this code? It not straightforward to see this. First different powers of matrix P are applied, then the cumulative sum of resulting matrices is calculated. We can ask if we can do it with a more intuitive and more user-friendly code (just by using a simple for loop), to gain computational speed and to not store the whole list of matrix powers as above.

General note: from R version 3.0.0 for loops are approx. equally fast than using the apply family (at least for usual problems). Multiple use of the apply family results in hard to read code, especially when several apply statements are given in one line of code, and debugging of code is more difficult.

This is also to some extend the case with the solution above, where lapply() and Reduce() are used.

The simple for() loop to the problem:

The following loop just updates a matrix by adding the previous state of the matrix with the powering of the matrix.

p2 <- function(vec, P){
  P2 <- P %^% vec[1]
  for(i in vec[-1]){
    P2 <- P2 + P %^% i
  } 
  P2
}
p2(0:2, P = P)
##            [,1]       [,2]
## [1,] 1.01718756 0.01206923
## [2,] 0.01409565 1.01037824

This truly gives the same result. Let’s increase the dimension of P and the number of powers applied.

n <- 10
P <- matrix(runif(n*n, 0.01, 0.02), ncol = n, nrow = n)
mbm <- microbenchmark(
  Reduce = p1(seq_len(10000), P = P),
  Simple = p2(seq_len(10000), P = P),
  times = 10
)
mbm
## Unit: milliseconds
##    expr      min       lq     mean   median       uq      max neval cld
##  Reduce 122.9000 124.2832 131.1854 133.8611 135.4301 136.7512    10   b
##  Simple 109.7845 111.0879 113.7115 112.0759 113.4696 124.5798    10  a

The speed is comparable but slightly faster for the simple for loop, naturally the code is (much) better readable and debugging of code would be easier when using a for loop.

Note that additional gain in computational speed can be obtained using C++ within R. This can be done, e.g. using package Rcpp . However, in this case this is not straightforward to apply and either the environment of package expm must be imported within a Rcpp function or a powering of a matrix method must be first implemented in C++.

Further reading

Package data.table is incredibly efficient in terms of computational speed. Use it whenever dealing with large (rectangular) data sets. Read its vignettes, tutorials and package manual. Note that – not used here – also package dplyr is fast and very user-friendly.

To leave a comment for the author, please follow the link and comment on their blog: R – Data Science made in Switzerland.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)