How to read quickly large dataset in R?

June 9, 2013
By

(This article was first published on Learning Data Science , and kindly contributed to R-bloggers)

Here, or there, I read many techniques to import a large dataset in R.
The option read.table or read.csv doesn’t work anyway because, as discusshere, R load in memory. And sometimes, when we try to load a big dataset, we got this message :
Warning messages: 
1: Reached total allocation of 8056Mb: see help(memory.size)
2: Reached total allocation of 8056Mb: see help(memory.size) 
Many techniques can be used to load a large dataset. I found some there, or there. But there is two techniques that I never think before. 
Suppose that we have a large dataset with 10 millions rows
Comparing the methods for loading in R. 
– Using read.table
read.csv() performs a lot of analysis of the data it is reading, to determine the data types. So we can help R, by reading the first rows, determine the data type of the columns, and then, read the big data and provide the type of each columns and/or squeeze some of them if it doesn’t need for analysis anyway;
Example
First we try to read a big data file (10 millions rows)
> system.time(df <-read.table(file=”bigdf.csv”,sep =”,”,dec=”.”)) Timing stopped at: 160.85 0.75 161.97 
 I let this run for a long period but no answer.
With this new method, we load the first rows, determine the data type and then, run read.table with indications of datatype.
> system.time (ds <- read.table(“bigdf.csv”, nrows=100, dec=”.”,sep=”,”)) user system elapsed 0 0 0 > classes <-sapply(ds, class) > classes V1 V2 V3 V4 “integer” “factor” “factor” “factor”
system.time(ds<-read.table(“bigdf.csv”,dec=”.”,sep=”,”colClasses=classes))
user  system elapsed 
234     432    128
As we see, this technique is not very interesting. It’s also longer.
– We can use the package sqldf.
> require(sqldf)
> f <- file("bigdf.csv")
> system.time(SQLf <- sqldf("select * from f", dbname = tempfile(),
+                           file.format = list(header = T, row.names = F)))
Le chargement a nécessité le package : tcltk
   user  system elapsed 
  53.64    4.17   58.20 
Less of 1 minute  to import 10 millions rows  of an object of
> print(object.size(SQLf), units="Mb")
267 Mb
          We can aslo used package read.table
> require(data.table)
Le chargement a nécessité le package : data.table
data.table 1.8.8  For help type: help("data.table")
> system.time(DT <- fread("bigdf.csv"))
   user  system elapsed 
 133.11    0.56  133.93 
But DT is a data.table format and a bit of transformation is require for use the table as dataframe using ddply from plyr package.
So. The point is : the package Sqldf is very useful to read quickly a large dataset in R. 10 millions rows in Less of a minute.

To leave a comment for the author, please follow the link and comment on their blog: Learning Data Science .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)