Update Your Machine Learning Pipeline With vetiver and Quarto

Posted on September 12, 2022 by RStudio | Open source & professional software for data science teams on RStudio in R bloggers | 0 Comments

[This article was first published on RStudio | Open source & professional software for data science teams on RStudio, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Machine learning operations (MLOps) are a set of best practices for running machine learning models successfully in production environments. Data scientists and system administrators have expanding options for setting up their pipeline. However, while many tools exist for preparing data and training models, there is a lack of streamlined tooling for tasks like putting a model in production, maintaining the model, or monitoring performance.

Enter vetiver, an open-source framework for the entire model lifecycle. Vetiver provides R and Python programmers with a fluid, unified way of working with machine learning models.

Our Solutions Engineering team developed a Shiny app for Washington D.C.’s Capital Bikeshare program a few years ago. This app provides real-time predictions of the number of bikes available at stations across the city. The end-to-end machine learning pipeline feeding the app uses R to import and modify data, save it in a pin, develop a model, then move the model to a deployable location. Alex Gold delivered a presentation on this workflow in 2020.

Sam Edwardes updated the project to apply Quarto and the new vetiver framework. Previously, we used R Markdown and a combination of one-off functions and scripts for each MLOps task. Using the latest from RStudio:

Quarto provides a refreshed look and language-agnostic tool for computational documents. Like R Markdown documents, the Quarto documents are available on RStudio Connect.
The pipeline now uses vetiver to train, pin, monitor, and deploy the model.
- This streamlines the code and makes the MLOps pipeline easier to maintain.
- By using vetiver across the organization, we have a consistent way to perform MLOps tasks.
- Deploying the model as an API endpoint using vetiver allows us to reuse the machine learning model for other apps or use cases.

We will walk through the updated pipeline below. To see the entire project, check out the Bike Predict page on solutions.rstudio.com.

Building A Predictive Web App With Shiny

The Shiny app predicts the number of bikes at a station in the near future based on real-time streaming data from an API. The steps involved are:

Write the latest station status data from the Capital Bikeshare API to a database
Join the station status data with the station information dataset and tidy the data
Train the model using this tidied dataset
Save and version the model to Connect as a pin using vetiver
Use the vetiver model card template to document essential facts and considerations of the deployed model
Use functions provided by vetiver to document and monitor model performance
Use the API endpoint to serve predictions to a Shiny app interactively
Make the Shiny app available to anybody interested in the predictions

The project shows an exciting set of capabilities, combining open source with RStudio’s professional products.

RStudio Workbench is a centralized, server-based environment for working with code.
RStudio Connect publishes and schedules data science assets like pins, APIs, and Quarto reports.
RStudio Package Manager (RSPM) controls and distributes packages throughout an organization.

Creating An End-to-End Machine Learning Pipeline

1. Create a custom package for pulling data

Capital Bikeshare has an API that publishes real-time system data. We created a set of helper functions for pulling the data. To increase efficiency, we wanted to reuse and share these functions.

For that, we created the bikehelpR package to house, document, and test the functions we used. To deploy the package, we used RSPM. RSPM makes it easy to create a package and have it available via install.packages() for everybody on our team.

2. Extract, transform, load process in R

The first step of the pipeline pulls the latest data from the Capital Bikeshare API using the bikehelpR package. We write the raw data to the Content Database’s bike_raw_data and bike_station_info tables.

The station info is also written to a pin. This pin will be accessed by the Shiny app so that it can extract the bike station info without connecting to the database. Read more about “production-izing” Shiny with pins.

ETL Step 1 – Raw Data Refresh Quarto Document

3. Tidy and join datasets

We tidy the bike_raw_data table using tidyverse packages. Then, we join it with the bike_station_info table and write the output into the Content Database’s bike_model_data table.

ETL Step 2 – Tidy Data Quarto Document

4. Train and deploy the model

We use the bike_model_data table to train and evaluate a random forest model. The model is saved to RStudio Connect as a pin (using vetiver) and then it is converted into an API endpoint (also using vetiver). By using vetiver to pin and deploy our model, we ensure a consistent approach across the organization for how we pin, version, and deploy machine learning models. Then, we deploy the API to RStudio Connect.

Model Step 1 – Train and Deploy Model

5. Create a model card

Next, we evaluate the training and evaluation data using various methods. Vetiver’s model card template helps document essential facts and considerations of the deployed model.

Model Step 2 – Model Card

6. Monitor model metrics

We can document model performance using vetiver and write the metrics to a pin on RStudio Connect. With these functions, we can monitor for model performance degradation. Using vetiver to monitor model performance again ensures a consistent approach to model governance across teams.

Model Step 3 – Model Metrics

7. Deploy a Shiny app that displays real-time predictions

We use the API endpoint to serve predictions to a Shiny app interactively. Clicking on a station shows us a line graph of the time and predicted number of bikes.

Link to Shiny App

8. Create project dashboard

This project is composed of many different tasks. We wanted a single place to share the full context and content with others. We created a dashboard made with connectwidgets to link to the entire project. This makes it easy for anybody new to the Bike Share app to understand its purpose and steps involved.

Link to Dashboard

See the entire updated pipeline here:

Learn More

We hope that you enjoyed this example of using vetiver, pins, and RStudio Connect to create an end-to-end machine learning pipeline. Folks in machine-learning-heavy contexts can use vetiver to streamline their work and easily “production-ize” content.

Review the Bike Share pipeline code on GitHub.
Check out this project and other RStudio product workflows on solutions.rstudio.com.

Join Julia Silge and Isabel Zimmerman to learn more about MLOps with vetiver in Python and R at the RStudio Enterprise Meetup on September 20th!

Add the event to your calendar

To leave a comment for the author, please follow the link and comment on their blog: RStudio | Open source & professional software for data science teams on RStudio.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Update Your Machine Learning Pipeline With vetiver and Quarto

Building A Predictive Web App With Shiny

Creating An End-to-End Machine Learning Pipeline

Learn More

Related

Building A Predictive Web App With Shiny

Creating An End-to-End Machine Learning Pipeline

Learn More

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)