Sharing Data With the pins Package

[This article was first published on RStudio | Open source & professional software for data science teams on RStudio, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Photo by Universal Eye on Unsplash

Teams often need access to key data to do their work, but have you ever opened your coworker’s script to see:

dat <-  
 read_csv("C://Users/someone_else/data/dataset.csv")

more_dat <- 
 read_csv("S://Path_to_mapped_drive_that_you_dont_have/dataset.csv")

Yikes! How will you get these files? Let’s hope you can reach your coworker before they’ve logged off for the day.

How can your code be reproducible if you have to manually change the file paths? Shudder.

What if you need to make edits to the data, will you have to keep copying CSVs and emailing files forever? Double shudder.

What if your coworker accidentally forwards your email to someone who is not supposed to have access? Oh no.

We can struggle to share data assets easily and safely, relying on emailed files to keep our analyses up to date. This makes it difficult to keep current or know what version of the data we’re using. If you’ve ever experienced any of the scenarios above, consider pins as a solution that can help you share your data assets.

What is a pin, anyway?

Pins, from the R package of the same name, are a versatile way to publish R objects on a virtual corkboard so you can share them across projects and people.

Good pins are data or assets that are a few hundred megabytes or smaller. You can pin just about any object: data, models, JSON files, feather files from the Arrow package, and more. One of the most frequent use cases is pinning small data sets — often ephemeral data or reference tables that don’t quite merit being in a database, but seemingly don’t have a good home elsewhere (until now).

Pins get published to a board, which can be an RStudio Connect server, an AWS S3 bucket or Azure Blob Storage, a shared drive like Dropbox or Sharepoint, or a variety of other options. Try it out for yourself — read in this data set we’ve pinned for you on RStudio Connect!

# Install the latest pins from CRAN
install.packages("pins")

library(pins)

# Identify the board
board <-
  board_url(c("penguins" = "https://colorado.rstudio.com/rsc/example_pin/"))

# Read the shared data
board %>%
  pin_read("penguins")

In short, if you’ve ever wondered where to put an R object that you or your colleague will need to use again, you might just want to pin it.

Pins for Sharing Across Projects and Teams

One of the greatest strengths of pins is how your pin becomes accessible directly from your R scripts and the R scripts of anyone else to whom you’ve given access. Different projects can include code that reads the same pin without creating more copies of the data:

Three projects using the same pin to download data

It’s easier (and safer) to share a pin across multiple projects or people than to email files around. Pins respect the access controls of the board. Say you’ve pinned to RStudio Connect: you can control who gets to use the pin, just like any other piece of content.

Pins for Updating and Versioning

You may be wondering why use pins if you already have a shared drive with your teammates. But what happens if you need to replace the dataset with a new one? Do you email everybody to let them know? Is it dataFINALv2.csv? Or dataFINALfinal.csv?

The pins package retrieves the newest version of the pin by default. That means pin users never have to worry about getting a stale version of the pin. If you need to update your pin regularly, a scheduled R Markdown on RStudio Connect can handle this task for you, so your pin stays fresh.

But you’re not locked into losing old versions of a pin. You can version pins so that writing to an existing pin adds a new copy rather than replacing the existing data.

Here’s what versioning looks like using a temporary board:

library(pins)


board2 <- board_temp(versioned = TRUE)

board2 %>% pin_write(1:5, name = "x", type = "rds")
#> Creating new version '20210304T050607Z-ab444'
#> Writing to pin 'x'

board2 %>% pin_write(2:6, name = "x", type = "rds")
#> Creating new version '20210304T050607Z-a077a'
#> Writing to pin 'x'

board2 %>% pin_write(3:7, name = "x", type = "rds")
#> Creating new version '20210304T050607Z-0a284'
#> Writing to pin 'x'

# see all versions
board2 %>% pin_versions("x")
#> # A tibble: 3 × 3
#>   version                created             hash 
#>   <chr>                  <dttm>              <chr>
#> 1 20210304T050607Z-0a284 2021-03-04 05:06:00 0a284
#> 2 20210304T050607Z-a077a 2021-03-04 05:06:00 a077a
#> 3 20210304T050607Z-ab444 2021-03-04 05:06:00 ab444

Learn More

With pins, you and your teammates can know where your important data assets are, how to access them, and whether they are the correct version. You can work with confidence knowing you’re using the right asset, your work is reproducible, and you’re following good practices for data management.

There’s more to explore with pins. We’re excited to share how you can adopt them into your workflow.

Learn more about how and when to use pins:

See pins in action:

To leave a comment for the author, please follow the link and comment on their blog: RStudio | Open source & professional software for data science teams on RStudio.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)