Big data problems

April 22, 2011
By

(This article was first published on Enterprise Software Doesn't Have to Suck, and kindly contributed to R-bloggers)

I have big data problems.

I need to analyze 100s of millions of rows of data and tried hard for 2 weeks to see if I can use R for this. My assessment so far from the experiments...

1) R is best for data that fits a computer's RAM (so get more RAM if you can).

2) R can be used for datasets that don't fit into RAM using Bigmemory and ff packages. However, this technique works well for datasets less than 15 GB. This is in line with the excellent analysis done by Ryan. Another good tutorial for Bigmemory.

3) If we need to analyze datasets larger than 15 GB, then SAS, MapReduce and RDBMS :( seem like the only option as they store data on file system and access it as needed.

Since MapReduce implementations are clumsy and not business friendly yet, I wonder if its time to explore commercial analytics tools like SAS for big data analytics.

Can Stata, Matlab or RevolutionR analyse datasets in the range of 50 - 100GB effectively?


References
http://www.bytemining.com/2010/07/taking-r-to-the-limit-part-i-parallelization-in-r/
http://www.austinacl.blogspot.com (image)

To leave a comment for the author, please follow the link and comment on his blog: Enterprise Software Doesn't Have to Suck.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.