Big data problems

April 22, 2011

(This article was first published on Enterprise Software Doesn't Have to Suck, and kindly contributed to R-bloggers)

I have big data problems.

I need to analyze 100s of millions of rows of data and tried hard for 2 weeks to see if I can use R for this. My assessment so far from the experiments…

1) R is best for data that fits a computer’s RAM (so get more RAM if you can).

2) R can be used for datasets that don’t fit into RAM using Bigmemory and ff packages. However, this technique works well for datasets less than 15 GB. This is in line with the excellent analysis done by Ryan. Another good tutorial for Bigmemory.

3) If we need to analyze datasets larger than 15 GB, then SAS, MapReduce and RDBMS 🙁 seem like the only option as they store data on file system and access it as needed.

Since MapReduce implementations are clumsy and not business friendly yet, I wonder if its time to explore commercial analytics tools like SAS for big data analytics.

Can Stata, Matlab or RevolutionR analyse datasets in the range of 50 – 100GB effectively?

References (image)

To leave a comment for the author, please follow the link and comment on their blog: Enterprise Software Doesn't Have to Suck. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)