Handling Large Datasets in R

(This article was first published on Quantitative Finance Collector, and kindly contributed to R-bloggers)

Handling large dataset in R, especially CSV data, was briefly discussed before at Excellent free CSV splitter and Handling Large CSV Files in R. My file at that time was around 2GB with 30 million number of rows and 8 columns. Recently I started to collect and analyze US corporate bonds tick data from year 2002 to 2010, and the CSV file I got is 6.18GB with 40 million number of rows, even after removing biases data as in Biases in TRACE Corporate Bond Data.

How to proceed efficiently? Below is an excellent presentation on handling large datasets in R by Ryan Rosario at http://www.bytemining.com/2010/08/taking-r-to-the-limit-part-ii-large-datasets-in-r/, a short summary of the presentation:
1, R has a few packages for big data support. The presentation covers the following: bigmemory and ff; and also some uses of parallelism to accomplish the same goal using Hadoop and MapReduce;
2, the data used in the presentation is 11GB comma-separated values with 120 million rows, 29 columns;
3, For datasets with size in the range 10GB, bigmemory and ff handle themselves well;
4, For larger datasets, use Hadoop;



BTW, determining the number of rows of a very big file is tricky, you don't have to load the data first and use dim(), which easily leads to short of memory. One way of doing it is readLines(), for example:
data <- gzfile("yourdata.zip",open="r")
MaxRows <- 50000
TotalRows <- 0
while((LeftRow <- length(readLines(data,MaxRows))) > 0 )
TotalRows <- TotalRows+LeftRow
close(data)

Tags - ,
Read the full post at Handling Large Datasets in R.

To leave a comment for the author, please follow the link and comment on his blog: Quantitative Finance Collector.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.