Efficiency in Joining Two Data Frames

January 28, 2013
By

(This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers)

In R, there are multiple ways to merge 2 data frames. However, there could be a huge disparity in terms of efficiency. Therefore, it is worthwhile to test the performance among different methods and choose the correct approach in the real-world work.

For smaller data frames with 1,000 rows, all six methods shown below seem to work pretty well except that the approach with sql.select() is significantly slower than the rest. The generic merge() function in the base package is a very natural choice without much overhead of loading additional libraries and converting data frame. sqldf() is also attractive in that it might be the most user-friendly function with a very intuitive syntax.

> n <- 1000
> set.seed(2013)
> ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n))
> rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n))
> 
> # METHOD 1: MERGE
> system.time(join1 <- merge(ldf, rdf, by = c("id1", "id2")))
   user  system elapsed 
  0.032   0.012   0.064 
> 
> # METHOD 2: DATA.TABLE
> ldt <- data.table::data.table(ldf, key = c("id1", "id2"))
> rdt <- data.table::data.table(rdf, key = c("id1", "id2"))
> system.time(join2 <- merge(ldt, rdt, by = c("id1", "id2")))
   user  system elapsed 
  0.028   0.000   0.044 
> 
> # METHOD 3: FF
> lff <- ff::as.ffdf(ldf)
> rff <- ff::as.ffdf(rdf)
> system.time(join3 <- merge(lff, rff, by = c("id1", "id2")))
   user  system elapsed 
  0.044   0.004   0.096 
> 
> # METHOD 4: SQLDF
> system.time(join4 <- sqldf::sqldf(c("create index ldx on ldf(id1, id2)", 
+                                     "select * from main.ldf inner join rdf on ldf.id1 = rdf.id1 and ldf.id2 = rdf.id2")))
   user  system elapsed 
  0.168   0.008   0.332 
> 
> # METHOD 5: PLYR
> system.time(join5 <- plyr::join(ldf, rdf, by = c("id1", "id2"), type = "inner"))
   user  system elapsed 
  0.088   0.020   0.152 
> 
> # METHOD 6: SQL.SELECT
> source("http://sqlselect.googlecode.com/svn/trunk/sql.select.R")
> system.time(join6 <- sql.select("select * from ldf inner join rdf on (`ldf$id1 == rdf$id1 & ldf$id2 == rdf$id2`)"))
   user  system elapsed 
 53.775  19.725  73.813 

However, when it comes to mid-size data frames with 1,000,000 rows, the story has changed. First of all, out of six methods shown above, the last two fails directly due to the insufficient memory size in my 32-bit ubuntu virtual machine. In this case, data.table package shows a significant advantage after converting 2 data.frames to data.tables. In additional, it is interesting that although ff and sqldf packages are slower than merge() function for the small-size data with 1,000 rows, both of them seem slightly faster for the data with 1,000,000 rows.

> n <- 1000 ^ 2
> set.seed(2013)
> ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n))
> rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n))
> 
> # METHOD 1: MERGE
> system.time(join1 <- merge(ldf, rdf, by = c("id1", "id2")))
   user  system elapsed 
 55.223  12.437  68.054 
> 
> # METHOD 2: DATA.TABLE
> ldt <- data.table::data.table(ldf, key = c("id1", "id2"))
> rdt <- data.table::data.table(rdf, key = c("id1", "id2"))
> system.time(join2 <- merge(ldt, rdt, by = c("id1", "id2")))
   user  system elapsed 
  0.484   0.008   0.492 
> 
> # METHOD 3: FF
> lff <- ff::as.ffdf(ldf)
> rff <- ff::as.ffdf(rdf)
> system.time(join3 <- merge(lff, rff, by = c("id1", "id2")))
   user  system elapsed 
 49.811  13.821  64.004 
> 
> # METHOD 4: SQLDF
> system.time(join4 <- sqldf::sqldf(c("create index ldx on ldf(id1, id2)", 
+                                     "select * from main.ldf inner join rdf on ldf.id1 = rdf.id1 and ldf.id2 = rdf.id2")))
   user  system elapsed 
 40.418   1.268  42.076 

To leave a comment for the author, please follow the link and comment on his blog: Yet Another Blog in Statistical Computing » S+/R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.