pins 0.3: Azure, GCloud and S3

[This article was first published on pins, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new version of pins is available on CRAN! pins 0.3 comes with many improvements and the following major features:

  • Support for new cloud boards to pin resources in Azure, GCloud and S3 storage.
  • Retrieve pin information with pin_info() including properties particular to each board.

You can install this new version from CRAN as follows:

install.packages("pins")

In addition, there is a new Use Cases section in our docs, various improvements (see NEWS) and two community extensions being developed to support databases and Nextcloud as boards.

Cloud Boards

pins 0.3 adds support to find, retrieve and store resources in various cloud providers like: Microsoft Azure, Google Cloud and Amazon Web Services.

To illustrate how they work, lets first try to find the World Bank indicators dataset in Kaggle:

library(pins)
pin_find("indicators", board = "kaggle")
# A tibble: 6 x 4
  name                                            description                             type  board 
  <chr>                                           <chr>                                   <chr> <chr> 
1 worldbank/world-development-indicators          World Development Indicators            files kaggle
2 theworldbank/world-development-indicators       World Development Indicators            files kaggle
3 cdc/chronic-disease                             Chronic Disease Indicators              files kaggle
4 bigquery/worldbank-wdi                          World Development Indicators (WDI) Data files kaggle
5 rajanand/key-indicators-of-annual-health-survey Health Analytics                        files kaggle
6 loveall/human-happiness-indicators              Human Happiness Indicators              files kaggle

Which we can then easily download with pin_get(), beware this is a 2GB download:

pin_get("worldbank/world-development-indicators")
[1] "/.../worldbank/world-development-indicators/Country.csv"     
[2] "/.../worldbank/world-development-indicators/CountryNotes.csv"
[3] "/.../worldbank/world-development-indicators/database.sqlite" 
[4] "/.../worldbank/world-development-indicators/Footnotes.csv"   
[5] "/.../worldbank/world-development-indicators/hashes.txt"      
[6] "/.../worldbank/world-development-indicators/Indicators.csv"  
[7] "/.../worldbank/world-development-indicators/Series.csv"      
[8] "/.../worldbank/world-development-indicators/SeriesNotes.csv" 

The Indicators.csv file contains all the indicators, so let’s load it with readr:

indicators <- pin_get("worldbank/world-development-indicators")[6] %>%
  readr::read_csv()

Analysing this dataset would be quite interesting; however, this post focuses on how to share this in S3, Google Cloud or Azure storage. More specifically, we will learn how to publish to an S3 board. To publish to other cloud providers, take a look at the Google Cloud and Azure boards articles.

As you would expect, the first step is to register the S3 board. When using RStudio, you can use the New Connection action to guide you through this process, or you can specify your key and secret as follows. Please refer to the S3 board article to understand how to store your credentials securely.

board_register_s3(name = "rpins",
                  bucket  = "rpins",
                  key = "VerySecretKey",
                  secret = "EvenMoreImportantSecret")

With the S3 board registered, we can now pin the indicators dataset with pin():

pin(indicators, name = "worldbank/indicators", board = "rpins")

That’s about it! We can now find and retrieve this dataset from S3 using pin_find(), pin_get() or view the uploaded resources in the S3 management console:

To make this even easier for others to consume, we can make this S3 bucket public; which means you can now connect to this board without even having to configure S3, making it possible to retrieve this dataset with just one line of R code!

pins::pin_get("worldbank/indicators", "https://rpins.s3.amazonaws.com")
# A tibble: 5,656,458 x 6
   CountryName CountryCode IndicatorName                          IndicatorCode    Year      Value
   <chr>       <chr>       <chr>                                  <chr>           <dbl>      <dbl>
 1 Arab World  ARB         Adolescent fertility rate (births per… SP.ADO.TFRT      1960    1.34e+2
 2 Arab World  ARB         Age dependency ratio (% of working-ag… SP.POP.DPND      1960    8.78e+1
 3 Arab World  ARB         Age dependency ratio, old (% of worki… SP.POP.DPND.OL   1960    6.63e+0
 4 Arab World  ARB         Age dependency ratio, young (% of wor… SP.POP.DPND.YG   1960    8.10e+1
 5 Arab World  ARB         Arms exports (SIPRI trend indicator v… MS.MIL.XPRT.KD   1960    3.00e+6
 6 Arab World  ARB         Arms imports (SIPRI trend indicator v… MS.MIL.MPRT.KD   1960    5.38e+8
 7 Arab World  ARB         Birth rate, crude (per 1,000 people)   SP.DYN.CBRT.IN   1960    4.77e+1
 8 Arab World  ARB         CO2 emissions (kt)                     EN.ATM.CO2E.KT   1960    5.96e+4
 9 Arab World  ARB         CO2 emissions (metric tons per capita) EN.ATM.CO2E.PC   1960    6.44e-1
10 Arab World  ARB         CO2 emissions from gaseous fuel consu… EN.ATM.CO2E.GF…  1960    5.04e+0
# … with 5,656,448 more rows

This works since pins 0.3 automatically register URLs as a website board to save you from having to explicitly call board_register_datatxt().

It’s also worth mentioning that pins stores the dataset using an R native format, which requires only 72MB and loads much faster than the original 2GB dataset.

Pin Information

Boards like Kaggle and RStudio Connect, store additional information for each pin which you can now easily retrieve with pin_info().

For instance, we can retrieve additional properties from the indicators pin from Kaggle as follows,

pin_info("worldbank/world-development-indicators", board = "kaggle")
# Source: kaggle<worldbank/world-development-indicators> [files]
# Description: World Development Indicators
# Properties:
#   - id: 23
#   - subtitle: Explore country development indicators from around the world
#   - tags: (ref) business, economics, international relations, business finance...
#   - creatorName: Megan Risdal
#   - creatorUrl: mrisdal
#   - totalBytes: 387054886
#   - url: https://www.kaggle.com/worldbank/world-development-indicators
#   - lastUpdated: 2017-05-01T17:50:44.863Z
#   - downloadCount: 42961
#   - isPrivate: FALSE
#   - isReviewed: TRUE
#   - isFeatured: FALSE
#   - licenseName: World Bank Dataset Terms of Use
#   - ownerName: World Bank
#   - ownerRef: worldbank
#   - kernelCount: 422
#   - topicCount: 7
#   - viewCount: 254379
#   - voteCount: 1121
#   - currentVersionNumber: 2
#   - usabilityRating: 0.7647
#   - extension: zip

And from RStudio Connect boards as well,

pin_info("worldnews", board = "rsconnect")
# Source: rsconnect<jluraschi/worldnews> [table]
# Properties:
#   - id: 6446
#   - guid: 1b9f04c5-ddd4-43ca-8352-98f6f01a7034
#   - access_type: all
#   - url: https://beta.rstudioconnect.com/content/6446/
#   - vanity_url: FALSE
#   - bundle_id: 16216
#   - app_mode: 4
#   - content_category: pin
#   - has_parameters: FALSE
#   - created_time: 2019-09-30T18:20:21.911777Z
#   - last_deployed_time: 2019-11-18T16:00:16.919478Z
#   - build_status: 2
#   - run_as_current_user: FALSE
#   - owner_first_name: Javier
#   - owner_last_name: Luraschi
#   - owner_username: jluraschi
#   - owner_guid: ac498f34-174c-408f-8089-a9f10c630a37
#   - owner_locked: FALSE
#   - is_scheduled: FALSE
#   - rows: 44
#   - cols: 1

To retrieve all the extended information when discovering pins, pass extended = TRUE to pin_find().

Thank you for reading this post!

Please refer to rstudio.github.io/pins for detailed documentation, GitHub to file issues or feature requests and Gitter to chat with us about anything else.

To leave a comment for the author, please follow the link and comment on their blog: pins.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)