by Joseph Rickert
Distcomp, a new R package available on GitHub from a group of Stanford researchers has the potential to significantly advance the practice of collaborative computing with large data sets distributed over separate sites that may be unwilling to explicitly share data. The fundamental idea is to be able to rapidly set up a web service based on Shiny and opencpu technology that manages and performs a series of master / slave computations which require sharing only intermediate results. The particular target application for distcomp is any group of medical researchers who would like to fit a statistical model using the data from several data sets, but face daunting difficulties with data aggregation or are constrained by privacy concerns. Distcomp and its methodology, however, ought to be of interest to any organization with data spread across multiple heterogeneous database environments.
Setting up the distcomp environment requires some preliminary work and out-of-band communication among the collaborators. In the first step, the lead investigator uses a distcomp function to invoke a browser-based Shiny application to describe the location of her data set, the variables to be used in the computation, the model formula and other metadata necessary to describe the computation.
Next, the investigator invokes another distcomp function to move the metadata and a copy of the local data set to computation server with a unique identifier. Once the master server is in place, collaborating investigators at remote locations perform a similar process to set up slave computation servers at their sites. When the lead investigator receives the URLs pointing to the slave servers she is ready to kick off the computation.
All of the details of this setup process are described in this paper by Narasimham et al. The paper also describes two non-trivial computations: a distributed rank-k singular value decomposition and distributed, stratified Cox model that are of interest in their own right. The algorithm and code for the stratified Cox model ought to be useful to data scientists in a number of fields working on time to event models. A really nice feature of the algorithm is that it only requires each site to independently optimize the partial likelihood function using its local data. The master process uses the partial likelihood information from all of the sites to compute a final estimate of the coefficients and their variances.
There are several nice aspects to this work:
- It builds on the cumulative work of the R community to provide a big league, big data application around open source R.
- It provides a flexible paradigm for implementing distributed / parallel applications that leverages existing R algorithms (e.g. the Cox model makes use of code in the survival package)
- It illustrates the ease with which R projects can be deployed in web services applications with Shiny and other R centric software such as DeployR
- It provides an alternative to building out infrastructure and aggregating data before realizing the benefits of a big data computation. (Prototyping calculations with distcomp might also serve to justify the expense and effort of developing centralized infrastructure.)
- It recognizes that privacy and other social concerns are important in big data applications and provides a model for respecting some of the social requirements for dealing with sensitive data.
Distcomp is new work and the developers acknowledge several limitations. (So far, they have only built out two algorithms and they don’t have a way to easily deal with factor data across the distributed data sets.) Nevertheless, the project appears to show great promise.