A follow-up of my previous post Excellent Free CSV Splitter. I asked a question at LinkedIn about how to handle large CSV files in R / Matlab. Specifically,
suppose I have a large CSV file with over 30 million number of rows, both Matlab / R lacks memory when importing the data. Could you share your way to handle this issue? what I am thinking is:
a) split the file into several pieces (free, straightforward but hard to maintain);
b) use MS SQL/MySQL (have to learn it, MS SQL isn’t free, not straightforward).
A useful summary of suggested solution:
1, 1) import the large file via “scan” in R;
2) convert to a data.frame –> to keep data formats
3) use cast –> to group data in the most “square” format as possible, this step involves the Reshape package, a very good one.
2, use the bigmemory package to load the data, so in my case, using read.big.matrix() instead of read.table(). There are several other interesting functions in this package, such as mwhich() replacing which() for memory consideration, foreach() instead of for(), etc. How large can this package handle? I don’t know, the authors successfully load a CSV with size as large as 11GB.
3, switch to a 64 bit version of R with enough memory and preferably on linux. I can’t test this solution at my office due to administration constraint, although it is doable, as mentioned in R help document,
64-bit versions of Windows run 32-bit executables under the WOW (Windows on Windows) subsystem: they run in almost exactly the same way as on a 32-bit version of Windows, except that the address limit for the R process is 4GB (rather than 2GB or perhaps 3GB)….The disadvantages are that all the pointers are 8 rather than 4 bytes and so small objects are larger and more data has to be moved around, and that far less external software is available for 64-bit versions of the OS.
Search & trial.
Tags – r , csv
Read the full post at Handling Large CSV Files in R.
offers daily e-mail updates
news and tutorials
on topics such as: Data science
, Big Data, R jobs
, visualization (ggplot2
), programming (RStudio
, Web Scraping
) statistics (regression
, time series
) and more...
If you got this far, why not subscribe for updates
from the site? Choose your flavor: e-mail
, or facebook