Using MonetDB[Lite] with real-world CSV files

November 11, 2015
By

(This article was first published on rud.is » R, and kindly contributed to R-bloggers)

MonetDBLite (for R) was announced/released today and, while the examples they provide are compelling there’s a “gotcha” for potential new folks using SQL in general and SQL + MonetDB + R together. The toy example on the site shows dumping mtcars with dbWriteTable and then doing things. Real-world CSV files have headers and commas (MonetDB by default expects no headers and | as a separator). Also, you need to make a MonetDB table (with a schema) before copying your giant CSV file full of data into it. That’s a pain to do by hand.

Here’s another toy example that shows how to:

  • use a specific directory for the embedded MonetDB files
  • auto-generate the CREATE TABLE syntax from a sample of the real-world CSV file
  • load the data from the real-world CSV file (i.e. skipping the header and using a , as a delimiter
  • wire it up to R & dplyr

It’s very similar to the MonetDBLite toy example but may help folks get up and running in the real world with less frustration.

library(MonetDBLite)
library(MonetDB.R)
library(dplyr)
 
# use built-in mtcars to make a CS File
# we're more likely to find a file in this format vs what dbWriteTable produces
# i.e. it has a header and commas for separator
write.csv(add_rownames(mtcars, "auto"), "mtcars.csv", row.names=FALSE)
 
# make a connection and get rid of the old table if it exists since
# we are just playing around. in real life you prbly want to keep
# the giant table there vs recreate it every time
mdb <- dbConnect(MonetDBLite(), "/full/path/to/your/preferred/monetdb/data/dir")
try(invisible(dbSendQuery(mdb, "DROP TABLE mtcars")), silent=TRUE)
 
# now we guess the column types by reading in a small fraction of the rows
guess <- read.csv("mtcars.csv", stringsAsFactors=FALSE, nrows=1000)
create <- sprintf("CREATE TABLE mtcars ( %s )", 
                  paste0(sprintf('"%s" %s', colnames(guess), 
                                 sapply(guess, dbDataType, dbObj=mdb)), collapse=","))
 
# we build the table creation dynamically from what we've learned from guessing
invisible(dbSendQuery(mdb, create))
 
# and then we load the data into the database, skipping the header and specifying a comma
invisible(dbSendQuery(mdb, "COPY OFFSET 2 
                                 INTO mtcars 
                                 FROM '/full/path/to/where/you/wrote/the/csv/to/mtcars.csv' USING  DELIMITERS ','"))
 
# now wire it up to dplyr
mdb_src <- src_monetdb(embedded="/full/path/to/your/preferred/monetdb/data/dir")
mdb_mtcars <- tbl(mdb_src, "mtcars")
 
# and have some fun
count(mdb_mtcars, cyl)
 
## Source: MonetDB  ()
## From:  [?? x 2]
## 
##      cyl     n
##    (int) (dbl)
## 1      6     7
## 2      4    11
## 3      8    14
## ..   ...   ...

To leave a comment for the author, please follow the link and comment on their blog: rud.is » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)