(first published on the HP blog)
A recent workshop at HP Labs addressed “Distributed Computing in R” Contributed by Indrajit Roy, Principal Researcher, HP Labs Over the last two decades, R has established itself as the most-used open source tool in data analysis. R’s greatest strength is its user community, which has collectively contributed thousands of packages that extend R’s use in everything from cancer research to graph analysis. But as users in these and many other areas embrace distributed computing, we need to ensure that R continues to be easy for people to write, share, and contribute code. When it comes to distributed computing, though, while R has many packages that provideparallelism constructs, it has no standardized API. Each package has its own syntax, parallelism techniques, and operating systems that they support. Unfortunately, this makes it difficult for users to write distributed programs for themselves, or make contributions that extend easily to other scenarios. Figuring it was time to brainstorm and standardize an API, Michael Lawrence (Genentech, R-core member) and I recently organized a workshop on “Distributed Computing in R” at HP Labs. It was attended by members of R-core and some of R’s most important academic (e.g., Univ. of Iowa, Yale, Purdue), research lab (e.g., AT&T research, ORNL), and industry (e.g., TIBCO, Revolution Analytics, Microsoft) contributors. These attendees have authored many popular R packages such as snow, Rcpp, RHIPE, foreach, and Bioconductor. The one and a half day workshop featured a number of interesting talks and many collaborative discussions. In his presentation, Robert Gentleman, R’s co-creator, emphasized the need to streamline language constructs in order to move R forward in the era of Big Data. Other talks were by authors of prominent R packages, who both presented overviews of their packages and commented on the strengths and weaknesses of their parallelism constructs. The talks were grouped into three sessions. The first focused on interfaces around MPI, such as R’s snow package. The second session looked at R’s integration with external analytics systems like Hadoop MapReduce, and the third included talks on external memory algorithms and packages to access data that don’t fit in main memory. All these talks are available at the workshop home page. Overall, a few common themes emerged:
- For those who prefer high-performance computing and are willing to write at a low-level interface, MPI and R wrappers around MPI are a very good option.
- For in-memory processing, adding some form of distributed objects in R can potentially improve performance.
- Using simple parallelism constructs, such as lapply, that operate on distributed data structures may make it easier to program in R.
- Any high level API should support multiple backends, each of which can be optimized for a specific platform, much like R’s snow and foreach package run on any available backend.