Do we need to deal with ‘big data’ in R?

November 22, 2011

(This article was first published on Quantum Forest » rblogs, and kindly contributed to R-bloggers)

David Smith at the Revolutions blog posted a nice presentation on “big data” (oh, how I dislike that term). It is a nice piece of work and the Revolution guys manage to process a large amount of records, starting with a download of 70GB and ending up with a series of linear regressions.

I’ve spent the last two weeks traveling and finishing marking for the semester, which has somewhat affected my perception on dealing with large amounts of data. The thing is that dealing with hotel internet caps (100MB) or even with my lowly home connection monthly cap (5GB) does get one thinking… Would I spend several months of internet connection just downloading data so I could graph and plot some regression lines for 110 data points? Or does it make sense to run a linear regression with two predictors using 100 million records?

My basic question is why would I want to deal with all those 100 million records directly in R? Wouldn’t it make much more sense to reduce the data to a meaningful size using the original database, up there in the cloud, and download the reduced version to continue an in-depth analysis? There are packages to query external databases (ROracle, RMySQL, RODBC, …, pick your poison), we can sample to explore the dataset, etc.

We can deal with a rather large dataset in our laptop but is it the best that we can do to deal with the underlying modeling problem? Just wondering.

Gratuitous picture of generic Pinus radiata breeding trial in Northern Tasmania.

To leave a comment for the author, please follow the link and comment on their blog: Quantum Forest » rblogs. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , ,

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training


CRC R books series

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)