Mimicking SQLDF with MonetDBLite

[This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Like many useRs, I am also a big fan of the sqldf package developed by Grothendieck, which uses SQL statement for data frame manipulations with SQLite embedded database as the default back-end.

In examples below, I drafted a couple R utility functions with the MonetDBLite back-end by mimicking the sqldf function. There are several interesting observations shown in the benchmark comparison.
– The data import for csv data files is more efficient with MonetDBLite than with the generic read.csv function or read.csv.sql function in the sqldf package.
– The data manipulation for a single data frame, such as selection, aggregation, and subquery, is also significantly faster with MonetDBLite than with the sqldf function.
– However, the sqldf function is extremely efficient in joining 2 data frames, e.g. inner join in the example.


# IMPORT
monet.read.csv <- function(file) {
  monet.con <- DBI::dbConnect(MonetDBLite::MonetDBLite(), ":memory:")
  suppressMessages(MonetDBLite::monetdb.read.csv(monet.con, file, "file", sep = ","))
  result <- DBI::dbReadTable(monet.con, "file")
  DBI::dbDisconnect(monet.con, shutdown = T)
  return(result)  
}

microbenchmark::microbenchmark(monet = {df <- monet.read.csv("Downloads/nycflights.csv")}, times = 10)
#Unit: milliseconds
#  expr      min       lq     mean   median       uq      max neval
# monet 528.5378 532.5463 539.2877 539.0902 542.4301 559.1191    10

microbenchmark::microbenchmark(read.csv = {df <- read.csv("Downloads/nycflights.csv")}, times = 10)
#Unit: seconds
#     expr      min       lq     mean   median       uq      max neval
# read.csv 2.310238 2.338134 2.360688 2.343313 2.373913 2.444814    10

# SELECTION AND AGGREGATION
monet.sql <- function(df, sql) {
  df_str <- deparse(substitute(df))
  monet.con <- DBI::dbConnect(MonetDBLite::MonetDBLite(), ":memory:")  
  suppressMessages(DBI::dbWriteTable(monet.con, df_str, df, overwrite = T))
  result <- DBI::dbGetQuery(monet.con, sql)
  DBI::dbDisconnect(monet.con, shutdown = T)
  return(result)
}

microbenchmark::microbenchmark(monet = {monet.sql(df, "select * from df sample 3")}, times = 10)
#Unit: milliseconds
#  expr     min      lq     mean   median       uq     max neval
# monet 422.761 429.428 439.0438 438.3503 447.3286 453.104    10

microbenchmark::microbenchmark(sqldf = {sqldf::sqldf("select * from df order by RANDOM() limit 3")}, times = 10)
#Unit: milliseconds
#  expr      min      lq     mean   median       uq      max neval
# sqldf 903.9982 908.256 925.4255 920.2692 930.0934 963.6983    10

microbenchmark::microbenchmark(monet = {monet.sql(df, "select origin, median(distance) as med_dist from df group by origin")}, times = 10)
#Unit: milliseconds
#  expr      min       lq     mean   median       uq      max neval
# monet 450.7862 456.9589 458.6389 458.9634 460.4402 465.2253    10

microbenchmark::microbenchmark(sqldf = {sqldf::sqldf("select origin, median(distance) as med_dist from df group by origin")}, times = 10)
#Unit: milliseconds
#  expr      min       lq    mean   median       uq      max neval
# sqldf 833.1494 836.6816 841.952 843.5569 846.8117 851.0771    10

microbenchmark::microbenchmark(monet = {monet.sql(df, "with df1 as (select dest, avg(distance) as dist from df group by dest), df2 as (select dest, count(*) as cnts from df group by dest) select * from df1 inner join df2 on (df1.dest = df2.dest)")}, times = 10)
#Unit: milliseconds
#  expr      min       lq    mean   median       uq     max neval
# monet 426.0248 431.2086 437.634 438.4718 442.8799 451.275    10

microbenchmark::microbenchmark(sqldf = {sqldf::sqldf("select * from (select dest, avg(distance) as dist from df group by dest) df1 inner join (select dest, count(*) as cnts from df group by dest) df2 on (df1.dest = df2.dest)")}, times = 10)
#Unit: seconds
#  expr      min       lq     mean   median       uq      max neval
# sqldf 1.013116 1.017248 1.024117 1.021555 1.025668 1.048133    10

# MERGE 
monet.sql2 <- function(df1, df2, sql) {
  df1_str <- deparse(substitute(df1))
  df2_str <- deparse(substitute(df2))
  monet.con <- DBI::dbConnect(MonetDBLite::MonetDBLite(), ":memory:")  
  suppressMessages(DBI::dbWriteTable(monet.con, df1_str, df1, overwrite = T))
  suppressMessages(DBI::dbWriteTable(monet.con, df2_str, df2, overwrite = T))
  result <- DBI::dbGetQuery(monet.con, sql)
  DBI::dbDisconnect(monet.con, shutdown = T)
  return(result)
}

tbl1 <- monet.sql(df, "select dest, avg(distance) as dist from df group by dest")
tbl2 <- monet.sql(df, "select dest, count(*) as cnts from df group by dest")

microbenchmark::microbenchmark(monet = {monet.sql2(tbl1, tbl2, "select * from tbl1 inner join tbl2 on (tbl1.dest = tbl2.dest)")}, times = 10)
#Unit: milliseconds
#  expr      min       lq     mean  median       uq      max neval
# monet 93.94973 174.2211 170.7771 178.487 182.4724 187.3155    10

microbenchmark::microbenchmark(sqldf = {sqldf::sqldf("select * from tbl1 inner join tbl2 on (tbl1.dest = tbl2.dest)")}, times = 10)
#Unit: milliseconds
#  expr      min       lq     mean median       uq      max neval
# sqldf 19.49334 19.60981 20.29535 20.001 20.93383 21.51837    10

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)