Site icon R-bloggers

Improve The Efficiency in Joining Data with Index

[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When managing big data with R, many people like to use sqldf() package due to its friendly interface or choose data.table() package for its lightening speed. However, very few would pay special attentions to small details that might significantly boost the efficiency of these packages by adding index to the data.frame or data.table.

In my post on 01/29/2013 (http://statcompute.wordpress.com/2013/01/29/another-benchmark-for-joining-two-data-frames), I’ve shown how to effectively join two data.frames / data.tables. However, the example is not intuitive for people to fully understand the benefit of adding index. In the demonstration below, I will compare 2 scenarios, one with the index and the other without, to show the extra efficiency gained by a simple index.

It is also important to note that creating the index in “ldf” would have the effect of adding the data.frame “ldf” from the R workspace to SQLite database. Therefore, in the 2nd “select…” statement, we need to add “main.” in front of “ldf” in order to use the indexed table “ldf” in SQLite instead of the unindexed table “ldf” in the R environment.

As shown in the benchmark table, simply adding an index can significantly reduce the user time with sqldf package and improves somewhat with data.table package.

libs <- c('sqldf', 'data.table', 'rbenchmark')
lapply(libs, require, character.only = T)

n <- 1000000
set.seed(1)
ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 1000, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n))
rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 1000, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n))

benchmark(replications = 5, order = "user.self", 
  noindex.sqldf = (sqldf('select * from ldf as l inner join rdf as r on l.id1 = r.id1 and l.id2 = r.id2')),
  indexed.sqldf = (sqldf(c('create index ldx on ldf(id1, id2)', 
                           'select * from main.ldf as l inner join rdf as r on l.id1 = r.id1 and l.id2 = r.id2')))
)

benchmark(replications = 5, order = "user.self", 
  noindex.table = {
    ldt <- data.table(ldf)
    rdt <- data.table(rdf)
    merge(ldt, rdt, by = c('id1', 'id2'))
  },
  indexed.table = {
    ldt <- data.table(ldf, key = 'id1,id2')
    rdt <- data.table(rdf, key = 'id1,id2')
    merge(ldt, rdt, by = c('id1', 'id2'))
  }
)

SQLDF OUTCOMES

           test replications elapsed relative user.self sys.self user.child
2 indexed.sqldf            5  34.774    1.000    34.511    0.244          0
1 noindex.sqldf            5  61.873    1.779    44.918   16.941          0

DATA.TABLE OUTCOMES

           test replications elapsed relative user.self sys.self user.child
2 indexed.table            5   6.719    1.000     6.609    0.104          0
1 noindex.table            5   6.777    1.009     6.696    0.076          0

To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.