Update Your Machine Learning Pipeline With vetiver and Quarto

[This article was first published on RStudio | Open source & professional software for data science teams on RStudio, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Machine learning operations (MLOps) are a set of best practices for running machine learning models successfully in production environments. Data scientists and system administrators have expanding options for setting up their pipeline. However, while many tools exist for preparing data and training models, there is a lack of streamlined tooling for tasks like putting a model in production, maintaining the model, or monitoring performance.

Enter vetiver, an open-source framework for the entire model lifecycle. Vetiver provides R and Python programmers with a fluid, unified way of working with machine learning models.

Our Solutions Engineering team developed a Shiny app for Washington D.C.’s Capital Bikeshare program a few years ago. This app provides real-time predictions of the number of bikes available at stations across the city. The end-to-end machine learning pipeline feeding the app uses R to import and modify data, save it in a pin, develop a model, then move the model to a deployable location. Alex Gold delivered a presentation on this workflow in 2020.

Sam Edwardes updated the project to apply Quarto and the new vetiver framework. Previously, we used R Markdown and a combination of one-off functions and scripts for each MLOps task. Using the latest from RStudio:

  • Quarto provides a refreshed look and language-agnostic tool for computational documents. Like R Markdown documents, the Quarto documents are available on RStudio Connect.
  • The pipeline now uses vetiver to train, pin, monitor, and deploy the model.
    • This streamlines the code and makes the MLOps pipeline easier to maintain.
    • By using vetiver across the organization, we have a consistent way to perform MLOps tasks.
    • Deploying the model as an API endpoint using vetiver allows us to reuse the machine learning model for other apps or use cases.

We will walk through the updated pipeline below. To see the entire project, check out the Bike Predict page on solutions.rstudio.com.

Building A Predictive Web App With Shiny

The Shiny app predicts the number of bikes at a station in the near future based on real-time streaming data from an API. The steps involved are:

  • Write the latest station status data from the Capital Bikeshare API to a database
  • Join the station status data with the station information dataset and tidy the data
  • Train the model using this tidied dataset
  • Save and version the model to Connect as a pin using vetiver
  • Use the vetiver model card template to document essential facts and considerations of the deployed model
  • Use functions provided by vetiver to document and monitor model performance
  • Use the API endpoint to serve predictions to a Shiny app interactively
  • Make the Shiny app available to anybody interested in the predictions

The project shows an exciting set of capabilities, combining open source with RStudio’s professional products.

  • RStudio Workbench is a centralized, server-based environment for working with code.
  • RStudio Connect publishes and schedules data science assets like pins, APIs, and Quarto reports.
  • RStudio Package Manager (RSPM) controls and distributes packages throughout an organization.

Creating An End-to-End Machine Learning Pipeline

1. Create a custom package for pulling data

Capital Bikeshare has an API that publishes real-time system data. We created a set of helper functions for pulling the data. To increase efficiency, we wanted to reuse and share these functions.

For that, we created the bikehelpR package to house, document, and test the functions we used. To deploy the package, we used RSPM. RSPM makes it easy to create a package and have it available via install.packages() for everybody on our team.

2. Extract, transform, load process in R

The first step of the pipeline pulls the latest data from the Capital Bikeshare API using the bikehelpR package. We write the raw data to the Content Database’s bike_raw_data and bike_station_info tables.

The station info is also written to a pin. This pin will be accessed by the Shiny app so that it can extract the bike station info without connecting to the database. Read more about “production-izing” Shiny with pins.

A Quarto document with the title ETL Step 1 Raw Data Refresh. The first section is the description and code for pulling raw data.

ETL Step 1 – Raw Data Refresh Quarto Document

3. Tidy and join datasets

We tidy the bike_raw_data table using tidyverse packages. Then, we join it with the bike_station_info table and write the output into the Content Database’s bike_model_data table.

A Quarto document with the title ETL Step 2 Tidy data. The contents show the description and code for getting data from a database.

ETL Step 2 – Tidy Data Quarto Document

4. Train and deploy the model

We use the bike_model_data table to train and evaluate a random forest model. The model is saved to RStudio Connect as a pin (using vetiver) and then it is converted into an API endpoint (also using vetiver). By using vetiver to pin and deploy our model, we ensure a consistent approach across the organization for how we pin, version, and deploy machine learning models. Then, we deploy the API to RStudio Connect.

A Quarto document with the title Model Step 1 Train and Deploy Model. The document shows the description and code for modeling.

Model Step 1 – Train and Deploy Model

5. Create a model card

Next, we evaluate the training and evaluation data using various methods. Vetiver’s model card template helps document essential facts and considerations of the deployed model.

A Quarto document with the title Model Step 2 Model Card. The document describes a model card.

Model Step 2 – Model Card

6. Monitor model metrics

We can document model performance using vetiver and write the metrics to a pin on RStudio Connect. With these functions, we can monitor for model performance degradation. Using vetiver to monitor model performance again ensures a consistent approach to model governance across teams.

A Quarto document with the title Model Step 3 Model Metrics. The document describes the background and displays tables.

Model Step 3 – Model Metrics

7. Deploy a Shiny app that displays real-time predictions

We use the API endpoint to serve predictions to a Shiny app interactively. Clicking on a station shows us a line graph of the time and predicted number of bikes.

A Shiny app showing a map of the Capitol Bikeshare predictions on the top and a line graph over time with the number of predicted bikes at a particular station. A cursor scrolls through the map and clicks on various stations, prompting the line graph to update.

Link to Shiny App

8. Create project dashboard

This project is composed of many different tasks. We wanted a single place to share the full context and content with others. We created a dashboard made with connectwidgets to link to the entire project. This makes it easy for anybody new to the Bike Share app to understand its purpose and steps involved.

A page on RStudio Connect with the title Bike Share. The page describes the project and includes an image of the machine learning pipeline.

Link to Dashboard

See the entire updated pipeline here:

The pipeline for the Bike Share prediction app, where the ETL feeds into pinned datasets, pinned model, live database connection, then the API serves predictions leading to the prod and dev client apps. An internal package is listed alongside the pipeline.

Learn More

We hope that you enjoyed this example of using vetiver, pins, and RStudio Connect to create an end-to-end machine learning pipeline. Folks in machine-learning-heavy contexts can use vetiver to streamline their work and easily “production-ize” content.

Join Julia Silge and Isabel Zimmerman to learn more about MLOps with vetiver in Python and R at the RStudio Enterprise Meetup on September 20th!

Add the event to your calendar
To leave a comment for the author, please follow the link and comment on their blog: RStudio | Open source & professional software for data science teams on RStudio.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)