Improve The Efficiency in Joining Data with Index

June 9, 2013
By

(This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers)

When managing big data with R, many people like to use sqldf() package due to its friendly interface or choose data.table() package for its lightening speed. However, very few would pay special attentions to small details that might significantly boost the efficiency of these packages by adding index to the data.frame or data.table.

In my post on 01/29/2013 (http://statcompute.wordpress.com/2013/01/29/another-benchmark-for-joining-two-data-frames), I’ve shown how to effectively join two data.frames / data.tables. However, the example is not intuitive for people to fully understand the benefit of adding index. In the demonstration below, I will compare 2 scenarios, one with the index and the other without, to show the extra efficiency gained by a simple index.

It is also important to note that creating the index in “ldf” would have the effect of adding the data.frame “ldf” from the R workspace to SQLite database. Therefore, in the 2nd “select…” statement, we need to add “main.” in front of “ldf” in order to use the indexed table “ldf” in SQLite instead of the unindexed table “ldf” in the R environment.

As shown in the benchmark table, simply adding an index can significantly reduce the user time with sqldf package and improves somewhat with data.table package.

libs <- c('sqldf', 'data.table', 'rbenchmark')
lapply(libs, require, character.only = T)

n <- 1000000
set.seed(1)
ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 1000, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n))
rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 1000, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n))

benchmark(replications = 5, order = "user.self", 
  noindex.sqldf = (sqldf('select * from ldf as l inner join rdf as r on l.id1 = r.id1 and l.id2 = r.id2')),
  indexed.sqldf = (sqldf(c('create index ldx on ldf(id1, id2)', 
                           'select * from main.ldf as l inner join rdf as r on l.id1 = r.id1 and l.id2 = r.id2')))
)

benchmark(replications = 5, order = "user.self", 
  noindex.table = {
    ldt <- data.table(ldf)
    rdt <- data.table(rdf)
    merge(ldt, rdt, by = c('id1', 'id2'))
  },
  indexed.table = {
    ldt <- data.table(ldf, key = 'id1,id2')
    rdt <- data.table(rdf, key = 'id1,id2')
    merge(ldt, rdt, by = c('id1', 'id2'))
  }
)

SQLDF OUTCOMES

           test replications elapsed relative user.self sys.self user.child
2 indexed.sqldf            5  34.774    1.000    34.511    0.244          0
1 noindex.sqldf            5  61.873    1.779    44.918   16.941          0

DATA.TABLE OUTCOMES

           test replications elapsed relative user.self sys.self user.child
2 indexed.table            5   6.719    1.000     6.609    0.104          0
1 noindex.table            5   6.777    1.009     6.696    0.076          0

To leave a comment for the author, please follow the link and comment on his blog: Yet Another Blog in Statistical Computing » S+/R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.