rstudio::conf was held in San Diego at the end of January, bringing together a wide range of speakers, topics, and attendees. Covering all of it would require several people and a lot of space, but I’d like to highlight two broad topics that received a lot of coverage: new tools for
shiny and enhanced modeling capabilities for R.
Several speakers introduced a collection of new tools for enhancing the capabilities of Shiny developers: asynchronous processing, simplified functional testing, and load testing are all coming to the
shiny world. These are common, standardized tools in the non-R web development world, so it’s great to see the
shiny ecosystem maturing and allowing those of us that prefer R to have access to the same tooling.
Have you ever tried to have more than one user on a shiny app, where each user triggers a time-consuming operation, like training a model or hitting a database with a slow query? It doesn’t work very well. One user logs on, starts training a model, and now no user can even connect to the server until that operation is finished. This is because R and
shiny are single-threaded by default, so while that model trains it is tying up the only R process, preventing it from doing anything else. One way to handle this today is to use ShinyServer Pro, and allow it to spin up enough processes so there is one for each connected user. But what about those of us that can’t afford the pro license, or have so many users we can’t afford a process per user? Enter
promises package is built on top of the excellent
future package and introduces a new operator,
%...>%, to build promise objects. Promises will execute and run as background processes, so they won’t block other users from logging in and doing their own thing on the same app instance. I’ll let Joe Cheng’s slides provide more detail, but I’m pretty excited about the possibilities here. Joe made sure to point out that
promises aren’t specific to shiny either; we may see plenty of other cool asynchronous work being done in R in the near future.
Testing a shiny app is currently a bit painful. You can unit-test any underlying functions you might have (at Methods we try to put all our non-reactive stuff into files of pure functions), but if you want to test how the app itself behaves, like “if I click this button, does the right thing happen?” (functional testing), you’re probably stuck using Selenium (via
RSelenium) to emulate a browser and the clicking for you. This works, but it’s notoriously buggy and brittle. Winston Chang is spearheading work on a new
shinytest package to address this.
shinytest avoids some of the brittleness of Selenium by using a headless browser to execute tests, and perhaps more importantly, provides an easy and intuitive way to create tests. Rather than writing R code directly like you might for unit-testing with
shinytest lets you record yourself performing actions on your actual app, and the package will generate code to represent those actions in an efficient way. You can then replay those actions, and ensure the output is the same. Output is stored as both
png images and
json, which gives you the options to visually inspect the end-state of a test, while also providing a
git-diffable output for more deterministic and automated comparisons. There’s a whole page on the website linked above on integrating
shinytest with CI systems like Travis or CircleCI; this is what I’m most excited about from this project. Build an app, record some tests, then the next time you change the app you can let your CI system figure out if you broke the old behavior or not. Why do something yourself when you can tell a computer to do it for you?
When trying to make a particularly important shiny app, you might start by unit-testing functions that do the core work, then use
shinytest to make sure the app built around those functions behaves correctly. This will help ensure your code keeps working far into the future, but what if you want to make sure your app is scalable, so it can handle hundreds or thousands of simultaneous users? You need to load test your application.
Load testing is more complex than unit or functional testing because it’s not enough to just test the code, you need to test the environment that the code runs in. Hosting your app via a single server running ShinyServer Open will scale much differently than if you used a cluster of servers each running RStudio Connect.
The tool introduced here,
shinyloadtest by Sean Lopp, is in very early stages of development, so it’s not quite ready for prime-time usage. That being said, Sean ran a very impressive demo where he used a cluster of servers to generate 10,000 (!!!) connections to an app hosted on a cluster of RStudio connect servers. I was impressed not only by the tool he was demoing, but by the complexity of the environment he was managing; lots of servers and services (EC2, ALB, Grafana, etc.) all orchestrated to work together. I won’t do it justice here, so check out his slides for a better idea of what he pulled off in this (live!) demo.
Even without that testing-side firepower, Sean said he wants this package to be able to simulate at least 1000 connections from a modest laptop, so efficiency is a big concern for him here. I suspect that will also hit a sweet spot for a lot of
shiny users; we never need to handle 1000 simultaneous connections, but I’d love to be sure some of our apps can handle dozens or hundreds of connections, so I’ll be monitoring the progress of this package.
Modeling, particularly predictive modeling using machine learning, can be a touchy subject in the R world. The python ecosystem is pretty dominant here:
scikit-learn is insanely easy to use (and fast), and the deep learning community has very clearly coalesced around python as evidenced by the popularity of frameworks like TensorFlow, Keras, and PyTorch. The RStudio folks seem set to bring R up to par in this realm, and are working hard to make better tools for ML workflows in R.
The future of Caret
If you do much machine learning in R, you’re likely familiar with the
caret package, a common interface for hundreds of other modeling packages in R.
caret is primarily the work of Max Kuhn, whose name you might also recognize from the book “Applied Predictive Modeling” (I refer to it often) he wrote with Kjell Johnson. As wonderful a tool as
caret is, it lags behind
sklearn in many ways. Luckily RStudio hired Max last fall to support his efforts to build a more
tidyverse-friendly set of modeling tools in R.
In my most-anticipated and favorite talk of the conference, Max made it clear the future of tidy-modeling is in many smaller packages, rather than the monolith approach
caret took. Some of the packages being worked on include:
rsamplefor setting up bootstrap, cross-fold validation, and other data resampling techniques
recipesfor preprocessing (scaling, centering, etc.) similar to
parsnipfor the core modeling
yardstickfor computing model metrics
tidyposteriorfor post-hoc model comparison
I’m already using and liking
yardstick. They’re all quite straightforward and play nice with
caret; check out their Github and
pkgdown sites linked above.
parsnip is very ambitious; Max wants this to be an even higher-level abstraction than
caret tries to be. He wants us to be able to give a model type (“random forest”, “SVM”, etc.) and a computation target (“R”, “STAN”, “Spark”, “TensorFlow”) and let the package figure out the rest (
sparklyr, etc.). This includes making sure we don’t need to worry if tree count is controlled by
parsnip will settle on one of those, and translate as needed for the underlying package being used. This is perhaps the least-mature of the packages listed above; Max warned he’s still working out the ideal syntax here, but once he does, he expects development to follow rapidly.
tidyposterior represents additional functionality in a way the others don’t; you can do most of what the other packages can already in R or in python with
tidyposterior is all about fancy Bayesian-based approaches for choosing between multiple trained models. I don’t understand the possibilities here enough to say much (check out the link above), but selecting models in an unbiased and not-over-fit way can be super tricky, so I welcome more tools here, especially ones based on strong underlying statistical theory as this one seems to be.
These tools complement Dave Robinson’s
broom package (for tidying up model outputs), but replace the
modelr package. If you’re using
modelr to set up your cross-fold validation samples, as I was, check out
rsample first; it already does everything I used
Max’s slides have a lot more info; check them out here.
TensorFlow, Keras, and beyond
In the second keynote at
rstudio::conf, JJ Allaire (Founder and CEO of RStudio) made it clear he wants the R environment to be a smart choice for working with deep learning models. JJ and his team have provided three distinct ways to specify TF models, each targeting a distinct abstraction level:
tensorflowfor building compute graphs directly in TF
tfestimatorsfor using pre-built models in TF, like TF- and GPU-powered random forests (yeah, I love random forests)
Keras, a high-level API for building deep learning models which can run in TensorFlow as well as other computation environments
To help spur Keras adoption among R users, JJ worked with Keras author Francois Chollet to port the wonderful, Keras-oriented “Deep Learning with Python” book to R: Deep Learning with R. I haven’t ordered this one yet, but from reading the python one I feel comfortable recommending it: Francois is a really great author and instructor, and the only difference here should be the code examples. He’s also one of my favorite people to follow on twitter!
In addition to the above packages, there’s several others that either JJ or another speaker covered:
tfdatasetsfor streaming data from disk for training TF models
tfrunsfor tracking different models you experiment with
cloudmlfor interacting with Google’s managed TF-model service to simplify training, tuning, and deployment
tfdeployfor deploying trained TF models to many places
Another cool tool JJ highlighted is
Greta, a STAN-like MCMC library in R built on TensorFlow. This isn’t from RStudio, but is a testament to the quality of the tools JJ and RStudio have worked hard to build.
I also found it interesting that all this TensorFlow stuff is the real reason RStudio created the
reticulate package, a new way to use python from within R, specifically so he could expose the power of all these python-specific TensorFlow-related tools to the R community. Even for non-TF uses, I can say from experience that
reticulate is vastly superior to previous options.
You can find JJ’s keynote slides here.
That covers some of my favorite talks and tools from
rstudio::conf 2018, but that only scratches the surface; there were many other excellent talks, and even more happening during the ones I attended. RStudio’s official repository of talks can be found here, but I’d recommend starting with this list from Peter Simecek instead.