Written by Hai Qian & Woo J. Jung of Pivotal Data Labs
When discussing data science tools, it’s common for folks to passionately debate about algorithm breadth, scalability, and performance among the many available options. Yet one of the most important aspects to consider when choosing a data science tool—usability—is often ignored in these discussions.
We believe that usability is perhaps the most important aspect to consider when selecting data science tools. In day-to-day settings, a data scientist should be focusing on what she wants to do with the data, rather than having to determine the technical aspects of how she is going to get there. Unfortunately, we’re at a stage where the how takes up a large chunk of a typical data scientist’s workflow.
There are a number of user-friendly data science tools available: R, Python, SAS, Stata, and more. In particular, the S language (R being an open-source implementation of S) was designed specifically for data analysis. While these tools offer excellent, interactive interfaces for performing data science, they face scalability and performance challenges when end users transition from small to big data.
At Pivotal, we asked ourselves this question: Wouldn’t it be great if there was a way to harness the familiarity and usability of a tool like R, and at the same time take advantage of the performance and scalability benefits of in-database/in-Hadoop computation?
Within our team, the clear answer was yes. We realized that we could achieve this by simply building a tool—in particular, an R package—that translates R code into SQL which feeds into the database for execution.
We’re excited to announce that this tool—PivotalR—is available on GitHub to download and use today. PivotalR is an R library with a familiar user interface that enables data scientists to perform in-database/in-Hadoop computations. While data scientists interact with a familiar R environment, those complex computations all occur “under the hood.”
PivotalR builds on R’s tradition of providing an interface with backend routines that run when needed in other languages or environments, operating at various levels of abstraction (e.g., Fortran, C, Stan, etc.). This framework allows the data scientist to express ideas and interact with R’s unified, user-friendly interface, while allowing her to piggy-back on faster subroutines or pre-existing tools when it makes sense to do so.
We believe that PivotalR takes this paradigm a step further. Traditionally, these backend subroutines are often executed on the same hardware as the R client itself (e.g., your laptop, a dedicated R server, etc.). This is fine when working with reasonably sized datasets, but can quickly become problematic when working with big data. For practitioners who have powerful, dedicated hardware for their database or Hadoop cluster, it’s unfortunate to have to leave all that computing power unused and spend the time to move all that big data from the data store to a laptop or server for modeling. Furthermore, there are no guarantees that this laptop or server would have enough memory to run these models.
PivotalR’s backend SQL queries can either run on a local database running on one’s laptop, or directly in the dedicated database or Hadoop cluster. This framework allows for the best of both worlds: A familiar, user-friendly R interface provided by the client machine, with highly-scalable, parallelized computing capabilities available through the database or Hadoop cluster.
As a short side note, if you haven’t been exposed to the availability of canned SQL functions that execute sophisticated machine learning routines like Elastic Net and Latent Dirichlet Allocation directly in database/Hadoop, we invite you to check out the open-source library MADlib. It’s worth mentioning that PivotalR piggy-backs substantially on MADlib.
The diagram below is our attempt to illustrate the mechanics of PivotalR’s design. At its core, an R function in PivotalR:
- Translates R code into corresponding SQL statements in the R client
- Executes these statements in the database or Hadoop cluster
- Returns summarized output to R
- Call MADlib’s in-DB machine learning functions directly from R
- Syntax is analogous to native R function
- Data doesn’t need to leave the database
- All heavy lifting, including model estimation and computation, are done in the database
This framework allows practitioners to benefit from the scalability and performance of in-database/in-Hadoop analytics without leaving the R command line. We leverage RPostreSQL as the communication bridge between the database or Hadoop cluster and the client machine. All of the heavy lifting, including model estimation and computation, are done in-database/in-Hadoop.
The principal design philosophy behind PivotalR is to not compromise the “R-ness” of the user experience. This is a common approach among R contributors who leverage subroutines in the backend, and their efforts are well-appreciated by end users in the R community. For example, the PivotalR function for linear regression, madlib.lm(), is pretty much identical in look-and-feel to R’s native lm() function. For those of you who have had the unfortunate experience of manually creating indicators for a categorical variable with many distinct values in SQL, we are happy to say that PivotalR supports automated dummy variable coding à la as.factor().
If you’re anything like us and have become accustomed to R’s convenience operators, you may find geek-comfort in knowing that code like the following is also fully supported:
We greatly prioritized the look-and-feel of R while developing PivotalR. We look forward to demonstrate the fruits of this labor in a series of future blog posts, which will demonstrate one or two use cases per post using working R code examples. In the meantime, we invite you to get started with PivotalR by visiting the PivotalR GitHub page and watching this video demo.
Hai Qian is Senior Software Engineer in the Pivotal Predictive Analytics team
Woo J. Jung is Senior Data Scientist at Pivotal Data Labs