Loading Big files in R

December 5, 2012
By

(This article was first published on Daniel MarcelinoDaniel Marcelino » R, and kindly contributed to R-bloggers)

Big Data Far as I remember, today was the first day in my life I succeed to load a text file bigger than 1.5 Gb into R (~ 5 million lines and 18 columns).
My computer is not a small stuff, I’m using a Macbook Pro with rough 8GB of RAM, but the issue why I couldn’t load such huge file–I would discover later–was not related to the robustness of the machine itself, but to the R setup to read the file. I was trying to read this huge file since yesterday night, but my computer was freezing after one or two hours trying. After run all night without success, and haven’t found any magical solution to parallelize the reading task across all the cores I could afford I was really considering to “break” the file into two or three pieces, so I could read the files within a minute. This is one more case for reading the R manuals before claim for God’s help. I read again the read.table function manual to see whether something new has been implemented since my first contact with R. I found one particular “trick” that could help me, I thought.
The read.table function family (read.table, read.delim, read.delim2 etc) allows us to setup all columns as “character” in the time to read the file into memory, “alleviating” the operating system, therefore, to deal with several class variables, such as factor, character, numeric with decimal places, etc. So it worked for me, and I’m very glad! Indeed, I got the file into R’s memory pretty fast I think (less than 4 minutes). After it has been read into the memory I can attribute straightforwardly whatever class I want to those variables in my data bank. I know the issue of reading big files in R is pretty recurrent in its lists; so, I will post the steps here in order to help others when this same problem arrive. Here is how I did it, simple as:

After have setting up all variables class as Date, numeric, factors, and so on. I saved this huge file as .RData. For my surprise, it is now taking only 133 Mb of my hard drive. This begs definitely to the question why other packages are so space inefficient compared to R? If I had to save it as Stata or SPSS, for example, it would take more than 2 Gb.

To leave a comment for the author, please follow the link and comment on his blog: Daniel MarcelinoDaniel Marcelino » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.