(This article was first published on

**Quantitative Finance Collector**, and kindly contributed to R-bloggers)**Handling large dataset in R**, especially CSV data, was briefly discussed before at Excellent free CSV splitter and Handling Large CSV Files in R. My file at that time was around 2GB with 30 million number of rows and 8 columns. Recently I started to collect and analyze US corporate bonds tick data from year 2002 to 2010, and the CSV file I got is 6.18GB with 40 million number of rows, even after removing biases data as in Biases in TRACE Corporate Bond Data.

How to proceed efficiently? Below is an excellent presentation on

**handling large datasets in R**by Ryan Rosario at http://www.bytemining.com/2010/08/taking-r-to-the-limit-part-ii-large-datasets-in-r/, a short summary of the presentation:

**1**, R has a few packages for big data support. The presentation covers the following:

**bigmemory**and

**ff**; and also some uses of parallelism to accomplish the same goal using

**Hadoop**and

**MapReduce**;

**2**, the data used in the presentation is 11GB comma-separated values with 120 million rows, 29 columns;

**3**, For datasets with size in the range 10GB, bigmemory and ff handle themselves well;

**4**, For larger datasets, use Hadoop;

**Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Datasets, LA R Users' Group 8/17/10**

View more presentations from Ryan Rosario.

BTW, determining the number of rows of a very big file is tricky, you don't have to load the data first and use dim(), which easily leads to short of memory. One way of doing it is readLines(), for example:

data <- gzfile("yourdata.zip",open="r")

MaxRows <- 50000

TotalRows <- 0

while((LeftRow <- length(readLines(data,MaxRows))) > 0 )

TotalRows <- TotalRows+LeftRow

close(data)

MaxRows <- 50000

TotalRows <- 0

while((LeftRow <- length(readLines(data,MaxRows))) > 0 )

TotalRows <- TotalRows+LeftRow

close(data)

Tags - data , csv

**Read the full post at Handling Large Datasets in R**.

To

**leave a comment**for the author, please follow the link and comment on his blog:**Quantitative Finance Collector**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...