Syberia: A development framework for R code in production

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Putting R code into production generally involves orchestrating the execution of a series of R scripts. Even if much of the application logic is encoded into R packages, a run-time environment typically involves scripts to ingest and prepare data, run the application logic, validate the results, and operationalize the output. Managing those scripts, especially in the face of working with multiple R versions, can be a pain — and worse, very complex scripts are difficult to understand and reuse for future applications.

That's where Syberia comes in: an open-source framework created by Robert Krzyzanowski and other engineers at the consumer lending company Avant. There, Syberia has been used by more than 30 developers to build a production data modeling system. In fact, building production R systems was the motivating tenet of Syberia

Developing classifiers using the Syberia modeling engine follows one primary tenet: development equals production. When a web developer experiments with a new website layout or feature, they are targeting a production system. After they are satisfied with their work, they push a button and deploy the code base so it is live and others can interact with it.

Feature engineering and statistical modeling … should belong to the same class of work. When an architect designs a skyscraper their work has to be translated to reality through a construction team by replaying the design using a physical medium. This is the current industry standard for machine learning: prototype the model in one language, typically a Python or R “notebook,” and then write it in a “solid production-ready” language so it can survive the harsh winds of the real world.

This is wrong.

In much the same way that ggplot2 is a completely different way of thinking about R graphics, Syberia is a completely different way of thinking about R scripts. It's also similarly difficult to get your head around at first, but once you do, it reveals itself as an elegant and customizable way of managing complex and interconnected R code — and a much better solution than simply source-ing an 800-line R script.

At its core, Syberia defines a set of conventions for defining the steps in a data-analysis workflow, and specifying them with a collection (in real-world projects, a large collection) of small R scripts in a standardized folder structure. Collectively, these define the complete data analysis process, which you can execute with a simple R command: run. To make modifying and maintaining this codebase (which you'd typically manage in a source-code control system) easier, Syberia is designed to isolate dependencies between filew. For example, rather than specifying a file name and format (say, "data.csv") in a script that reads data, you'd instead define “adapters” to read and write data in the adapters/adapters.R script:

# ./adapters/csv.R
read <- function(key) {
  read.csv(file.path("/some/path", paste0(key, ".csv")))
}

write <- function(value, key) {
  write.csv(value, file.path("/some/path", paste0(key, ".csv")))
}

Syberia will then use those "read" and "write" adapters to connect with your data. That way, when you later decide to source the data from a database, you can just write new adapters rather than trying to find the lines dealing with data I/O in a massive script. (This also helps avoid conflicts when working in a large code-base with multiple developers.) Syberia defines similar "patterns" for data preparation (including feature generation), statistical modeling, and testing; the "run" function conceptually synthesizes the entire codebase into a single R script to run.

Syberia also encourages you to break up your process into a series of distinct steps, each of which can be run (and tested) independently. It also has a make-like feature, in that results from intermediate steps are cached, and do not need to be re-run each time unless their dependencies have been modified.

Syberia can also be used to associate specific R versions with scripts, or even other R engines like Microsoft R. I was extremely impressed when during a 30-minute-break at the R/Finance conference last month, Robert was able to sketch out a Syberia implementation of a modeling process using the RevoScaleR library. In fact Robert's talk from the conference, embedded below, provides a nice introduction to Syberia.

Syberia is available now under the open-source MIT license. (Note: git is required to install Syberia.) To get started with Syberia check out the documentation, which is available at the Syberia website linked below.

Syberia: The development framework for R

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)