I need to analyze 100s of millions of rows of data and tried hard for 2 weeks to see if I can use R for this. My assessment so far from the experiments…
1) R is best for data that fits a computer’s RAM (so get more RAM if you can).
2) R can be used for datasets that don’t fit into RAM using Bigmemory and ff packages. However, this technique works well for datasets less than 15 GB. This is in line with the excellent analysis done by Ryan. Another good tutorial for Bigmemory.
3) If we need to analyze datasets larger than 15 GB, then SAS, MapReduce and RDBMS 🙁 seem like the only option as they store data on file system and access it as needed.
Since MapReduce implementations are clumsy and not business friendly yet, I wonder if its time to explore commercial analytics tools like SAS for big data analytics.
Can Stata, Matlab or RevolutionR analyse datasets in the range of 50 – 100GB effectively?