Efficiency in Joining Two Data Frames

[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In R, there are multiple ways to merge 2 data frames. However, there could be a huge disparity in terms of efficiency. Therefore, it is worthwhile to test the performance among different methods and choose the correct approach in the real-world work.

For smaller data frames with 1,000 rows, all six methods shown below seem to work pretty well except that the approach with sql.select() is significantly slower than the rest. The generic merge() function in the base package is a very natural choice without much overhead of loading additional libraries and converting data frame. sqldf() is also attractive in that it might be the most user-friendly function with a very intuitive syntax.

> n <- 1000
> set.seed(2013)
> ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n))
> rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n))
> 
> # METHOD 1: MERGE
> system.time(join1 <- merge(ldf, rdf, by = c("id1", "id2")))
   user  system elapsed 
  0.032   0.012   0.064 
> 
> # METHOD 2: DATA.TABLE
> ldt <- data.table::data.table(ldf, key = c("id1", "id2"))
> rdt <- data.table::data.table(rdf, key = c("id1", "id2"))
> system.time(join2 <- merge(ldt, rdt, by = c("id1", "id2")))
   user  system elapsed 
  0.028   0.000   0.044 
> 
> # METHOD 3: FF
> lff <- ff::as.ffdf(ldf)
> rff <- ff::as.ffdf(rdf)
> system.time(join3 <- merge(lff, rff, by = c("id1", "id2")))
   user  system elapsed 
  0.044   0.004   0.096 
> 
> # METHOD 4: SQLDF
> system.time(join4 <- sqldf::sqldf(c("create index ldx on ldf(id1, id2)", 
+                                     "select * from main.ldf inner join rdf on ldf.id1 = rdf.id1 and ldf.id2 = rdf.id2")))
   user  system elapsed 
  0.168   0.008   0.332 
> 
> # METHOD 5: PLYR
> system.time(join5 <- plyr::join(ldf, rdf, by = c("id1", "id2"), type = "inner"))
   user  system elapsed 
  0.088   0.020   0.152 
> 
> # METHOD 6: SQL.SELECT
> source("http://sqlselect.googlecode.com/svn/trunk/sql.select.R")
> system.time(join6 <- sql.select("select * from ldf inner join rdf on (`ldf$id1 == rdf$id1 & ldf$id2 == rdf$id2`)"))
   user  system elapsed 
 53.775  19.725  73.813 

However, when it comes to mid-size data frames with 1,000,000 rows, the story has changed. First of all, out of six methods shown above, the last two fails directly due to the insufficient memory size in my 32-bit ubuntu virtual machine. In this case, data.table package shows a significant advantage after converting 2 data.frames to data.tables. In additional, it is interesting that although ff and sqldf packages are slower than merge() function for the small-size data with 1,000 rows, both of them seem slightly faster for the data with 1,000,000 rows.

> n <- 1000 ^ 2
> set.seed(2013)
> ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n))
> rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n))
> 
> # METHOD 1: MERGE
> system.time(join1 <- merge(ldf, rdf, by = c("id1", "id2")))
   user  system elapsed 
 55.223  12.437  68.054 
> 
> # METHOD 2: DATA.TABLE
> ldt <- data.table::data.table(ldf, key = c("id1", "id2"))
> rdt <- data.table::data.table(rdf, key = c("id1", "id2"))
> system.time(join2 <- merge(ldt, rdt, by = c("id1", "id2")))
   user  system elapsed 
  0.484   0.008   0.492 
> 
> # METHOD 3: FF
> lff <- ff::as.ffdf(ldf)
> rff <- ff::as.ffdf(rdf)
> system.time(join3 <- merge(lff, rff, by = c("id1", "id2")))
   user  system elapsed 
 49.811  13.821  64.004 
> 
> # METHOD 4: SQLDF
> system.time(join4 <- sqldf::sqldf(c("create index ldx on ldf(id1, id2)", 
+                                     "select * from main.ldf inner join rdf on ldf.id1 = rdf.id1 and ldf.id2 = rdf.id2")))
   user  system elapsed 
 40.418   1.268  42.076 

To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)