Thanks to the R Bloggers aggregator I came across Yihui Xie’s post on a piece currently making the rounds about statistical analysis platforms. In The Next Big Thing, AnnMaria De Mars makes the argument that R—as a statistical computing platform—is not well suited for what she views as the next big things in data analytics: dealing with very large data sets, and creative visualization. She goes so far as to say that in this respect, R is an epic fail (emphasis below mine):
Contrary to what some people seem to think, R is definitely not the next big thing, either. I am always surprised when people ask me why I think that, because to my mind it is obvious…I know that R is free and I am actually a Unix fan and think Open Source software is a great idea. However, for me personally and for most users, both individual and organizational, the much greater cost of software is the time it takes to install it, maintain it, learn it and document it. On that, R is an epic fail. It does NOT fit with the way the vast majority of people in the world use computers. The vast majority of people are NOT programmers. They are used to looking at things and clicking on things.
I have edited out a bit of R code that De Mars uses to illustrate her points; first because the code itself has nothing to do with either big data or creative visualization; and second, it contains errors and does not run. This, however, is rather beside the point. De Mars’s main point is that commercial statistical platforms, such as SAS, STATA and SPSS, are better suited to handle large data and visualization. Where to begin…
First, yes, R is a difficult language to learn. Even for those that have an extensive programming background its syntactical peculiarities and functional foundation can be a difficult hurdles to climb for many. That said, R is a Turing complete language, which means once you learn the language, data analytics are bounded only by your imagination (and NP-completeness). To De Mars’s point then, it would seem a tautological fallacy that endeavors into the “next big thing” would be best nurtured in the fully limited environment of point-and-click commercial analysis platforms. Byron Ellis summarized this sentiment quite nicely on Twitter:
[R] is for making new things. Point and click is for redoing old things. I often need to make new things to analyze my data.
Well put. R allows users to build their own methods for analysis and feast on an ever expanding catalog of libraries for any number of analytical needs, commercial products provide users with the set of functionality they deem fit. Next, with respect to the specific big things De Mars is concerned with—big data and visualization—R appears to be the hands down winner.
Brendan O’Connor provided an excellent, though somewhat dated now, side-by-side comparison of several open-source and commercial data analytic platforms. One his main complaints about all of these platforms is that they cannot handle data sets that do not fit on a single hard drive, i.e., really big. Since his writing, however, R now supports computing over a cluster with MPI or SNOW, and streaming to various map/reduce frameworks such as Hadoop. In addition, the sqldf library enhances R to manipulate large relational databases.
Regarding R specifically, O’Connor notes that one of its main weaknesses is, “visualizations are great for exploratory analysis, but you want something else for very high-quality graphs.” Now, however, there are several libraries that empower users to create extremely high-quality, and publication ready visualization. To the latter, both lattice and ggplot2 provide unique visualization power, and are fully extensible to the needs of their users—again, unlike the commercial platforms. In addition, R can be extended to work with Processing to generate extremely high-end interactive visualization. There are also several libraries that allow R to generate web-ready visualization with protovis. Some commercial platforms are able perform high-performance computing, but to my knowledge, none have the flexibility and quality of visualization as R.
I am wholesale perplexed by De Mars’s argument. While most software users are more comfortable with GUI platforms, it seems entirely unlikely that the next big data analysis “thing” would come from a world catering to the lowest-common denominator. While clearly I am biased, that bias comes from experience on all of these platforms and dealing with problems of big data and visualization. For those looking for the next big thing, I highly recommend following the adoption of R and its relatives, and spending very little concerned with the commercial platforms.
UPDATE: Others have weighed in as well:
Tal Galili – “The next big thing”, R, and Statistics in the cloud
Joe Dunn – R Is (Not) the Next Big Thing
Photo: Social Media Law Student