by Joseph Rickert
Every month I look forward to getting my copy of AMSTATNEWS, the monthly magazine of the American Statistical Association, in the mail. This July, I was both pleased and bemused by ASA President Marie Davidian’s article Aren’t We Data Science?. I was pleased to see a follow up to last month’s article The ASA and Big Data, but mystified by the overall tone of the article. It portrays mainstream academic statisticians as being left behind by the rise of Big Data, some trapped in moribund departments that are unwilling to change, overlooked by university administrators who see them as “small data” scholars without the tools for “Big Data” and “Big Questions” and surprised to find out that they, indeed, are not data science. Midway through the article president Davidian asks: “What skills does a statistician need to engage in data science activities, and how should we be preparing statistics students?”. A good bit of what follows is a call to arms, exhorting statisticians to immerse themselves in “real-world” problems, participate in data science meetup groups and reach out to local business and research organizations to accumulate case studies for their students. The article closes with the dramatic and sobering suggestion that statisticians ask themselves: “How would you feel if there were no departments of statistics 50 years from now?”.
What really surprised me in President Davidian’s article was the mere passing reference to R. R was allocated less text than Python. I have nothing at all against Python, but R is a fundamental tool of modern computational statistics that provides the very bridge to data science that President Davidian is seeking. In recent years, survey after survey has singled out R as being one of the top tools for data scientists and R is the go-to tool for Kaggle data mining competitions.
Not only do statisticians already have a way into data science, but R developers are relentless in their drive to extend the reach of R into Big Data. This July’s useR! conference, for example, included presentations on: ffbase, a package for dealing with data sets too big for memory; bigvis a package for visualizing large data sets based on the sound statistical principles of aggregation and smoothing; a tutorial on the Rcpp package which makes it relatively easy for a statistician with intermediate level R skills to harness the speed C++ for computationally challenging data science problems; and a case study of a Big Data analytics firm using the GLM implementation in the RevoScaleR package to apply survival analysis techniques to massive internet marketing datasets. (Note that Revolution Analytics makes the production level RevoScaleR package available to academics for free. Students are just a download away from getting experience with the very tools used by industry to do statistics on Big Data.)
The fact that President Davidian overlooked the value of R to Big Data and data science is unfortunate but probably reflects the relatively low status accorded to software development in statistics departments. Academic statisticians are among the most tireless and prolific contributors to the open source R project. However, it is my impression that they receive little or no formal recognition for their contributions to R from their respective departments of statistics. Although my responsibilities with Revolution Analytics require that I have considerable contact with academic statisticians, I am not myself an academic and perhaps not in a position to see what really makes for a successful statistics department. Nevertheless, from the perspective of a person working on “real world” problems it is difficult to see why a paper cited a couple of hundred times over the course of several years should be reckoned to have more impact than an R package that sees daily use by thousands of statisticians and data scientists. Certainly it would be helpful for the ASA to sponsor a “conference on statistics and data science featuring top data scientists and statisticians as speakers” as President Davidian suggests. However, if departments of statistics want to improve their chances of being around in 50 years they could make a bigger investment in their future by recognizing and encouraging the contributions of R developers and other tool makers.