Update 10/11/2011: There’s a good discussion on Reddit
Update 10/12/2011: Note
manipulate package and highlight data.table package
The R statistical computing platform is a rising star that’s been gaining popularity and attention, but it gets no respect in the hood. It’s telling that a popular guide to R is called The R Inferno, and that advocacy pieces are titled “Why R Doesn’t Suck.” Even the creator of R had this to say about the language in a damning article suggesting starting over with R:
I [Ross Ihaka] have been worried for some time that R isn’t going to provide the base that we’re going to need for statistical computation in the future. (It may well be that the future is already upon us.) There are certainly efficiency problems (speed and memory use), but there are more fundamental issues too.
So why do people still use R? Would we lose anything if we just migrated to (say) Python, which many consider to be a major contender/alternative to R?
In this post I’m going to highlight a few things that are nice about R—not just in the platform itself, but in the whole ecosystem. These are things that you won’t necessarily find in alternate universes like Python’s.
Obligatorily, R packages are the dominant reason people are anchored to R: ggplot2, caret, data.table (one of R’s best-kept secrets), zoo, the mountains of modeling packages, all the data munging/plumbing infrastructure. Packages for all manner of problems have been written, frequently exclusively for the R universe. The standard library is also solid.
R Studio is just a whole bucket of awesome. IPython is OK for interactive exploration…as long as you’re just banging out new lines and editing/re-running those lines. As soon as you have to tweak a function somewhere or introduce a block of code, things get hairier. After editing your source file, you can choose to either re-run your whole script (first commenting/uncommenting various source lines), or if you want to preserve your data you can reload your module and all dependent imports, or you can sometimes use
%edit but it’s limited and glitchy, or you can fudge around with
joblib and the like. Because of Python’s modules and scoping, you can’t get away with something as simple as editing and re-evaluating some lines in Emacs.
But R Studio makes interactive coding easy-peasy. You simply tweak and (re-)run exactly the line(s) you want. This style of IDE is one of the things I missed the most from MATLAB. Even for tasks perfectly suitable for REPLs, it’s nice to be editing a concrete file, since this interactive style is how many of my scripts get fully written out.
It’s tricky to design an IDE for dynamic languages; they often end up pretty limited when it comes to things like completion suggestions. But I like the way R Studio handles dynamism: you get things like completions based on the current execution environment, and in general you get to inspect/play around with the environment, not unlike in a debugger. This is perfect for that interactive style of development.
Then there’s the fact that R Studio is available as a web app, which I love. Besides being able to resume my session from anywhere with a browser, I can work with graphics without hassle. I don’t want to be tunneling X over ssh or dumping PNGs or whathaveyou to see data visualizations. Add on top of all this the manipulate package that comes with R Studio, which gives you basic interactive plotting that builds on top of any plotting system including ggplot2. (I was excited to see that recent work on IPython has introduced a web interface too, but ggplot2 is light-years ahead of matplotlib.)
Wait! What’s this doing here? Isn’t the language R’s biggest liability?
There’s no special syntax for defining named functions. There’s no required
return keyword. There’s no distinguishing among statements, blocks, or expressions—everything’s an expression. Operators, including = and , are functions.
Like in Matlab, values are all immutable, pass-by-value, copy-on-write, etc. R also has open-world polymorphism that doesn’t introduce new syntax; it’s at the same time more flexible and more TOOWTDI than Python and other similar OOP languages. The function argument semantics are also more powerful/useful than those found in many other languages.
The Data Structures
Here’s another one that’s equally asset and liability. Since we’re just focusing on the ups: the data frame and the factor stand out in particular. R’s data types are by themselves straightforward to implement (though tedious to implement well and optimize and deal with missing values and joins and pivoting and whatnot), but the fact is that they served as the foundation for a lot of R code. Such an established foundation simply does not exist yet in other environments like Python. Projects like Wes McKinney’s pandas add these crucial data structures, are making steady progress (including the recent addition of factors), and will probably be “standardized on” in the not-too-distant future, but until then there’s still a lot of work to do, multiple such projects in competition, and relatively little built on top of them.
Nearly everything in R is serializable. This goes beyond Python pickle, which has plenty of limitations. Code, even closures, can be treated as first-class values—you can serialize it and send it around, something that is rarely seen outside of the Lisp family. Your execution environment, the session, is something you’ll regularly save and restore all the time—a huge boon for interactive development and exploration. Sure, a restored file/socket won’t be of any use, but everything else just works.
Yes, other languages have analogues, like PyPI and npm. But R is the only place where I’ve never once had to go outside this system in all my time using R software.
I run into this problem all the time in other ecosystems. Just earlier I had to repackage Google Protocol Buffers for Python to actually have a working setup.py. But I’ve run into this problem in a whole ton of projects, a sample of which includes Pyevolve, PyStemmer, unicodecsv, progressbar, re2. And it’s not just Python. I’ve run into problems with gems, CPAN, Cabal, all sorts of other places. And it’s not just broken package installers. Sometimes it’s problems/limitations with the package manager (just earlier I had to separately install numpy before installing scipy). Sometimes it’s the occasional packages that don’t even publish to these repositories at all, forcing you to step outside the system (particularly poor coverage in the Java/Maven world, but Maven arrived late).
I don’t know if there’s something about the way R package authoring/publishing works that makes distribution particularly robust and straightforward, or what. But shit just works.
Embedded R and Rserve
Not much to say here other than the fact that there’s some good interop in the form of RPy2 for Python and REngine for Java. Using complementary tools lets you work around R shortcomings and opens up many more opportunities to use R. Although you can certainly choose to write everything in R, there’s a healthy widespread awareness of R’s weaknesses (and strengths) that so far seems to be doing a good job of drawing boundaries. This attitude and this degree of accessibility are just another thing I like about R.
So, there’s quite a bit to learn from R. All that said, it’s important to understand the opening quote. There are many fundamental problems with R, stemming not just from the platform’s intrinsic properties but simply from the fact that it exists at all. Some of my happier dreams involve burning R to the ground.
But that’s a future post.
Follow me on Twitter for stuff far more interesting than what I blog.