Data Import Efficiency – A Case in R

statcompute

9 years ago

[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Below is a piece of R snippet comparing the data import efficiencies among CSV, SQLITE, and HDF5. Similar to the case in Python posted yesterday, HDF5 shows the highest efficiency.

> library(RSQLite)
Loading required package: DBI
> library(rhdf5) 
> df <- read.csv('credit_count.csv')
> do.call(cat, list(nrow(df), ncol(df), '\n'))
13444 14 
> 
> # WRITE DF INTO SQLITE
> if(file.exists('data.db')) file.remove('data.db')
[1] TRUE
> con <- dbConnect("SQLite", dbname = "data.db")
> dbWriteTable(con, "tbl", df)
[1] TRUE
> 
> # WRITE DF INTO HDF5
> if(file.exists('data.h5')) file.remove('data.h5')
[1] TRUE
> h5createFile("data.h5")
[1] TRUE
> h5write(df, 'data.h5', 'tbl')
> 
> # CALCULATE CPU TIMES
> system.time(for(i in 1:10) read.csv('credit_count.csv'))
   user  system elapsed 
  1.148   0.056   1.576 
> system.time(for(i in 1:10) dbReadTable(con, 'tbl'))
   user  system elapsed 
  0.492   0.024   0.649 
> system.time(for(i in 1:10) h5read('data.h5','tbl'))
   user  system elapsed 
  0.164   1.184   1.946

To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.