Acquiring data for language research (2/3): package interfaces

[This article was first published on R on francojc ⟲, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Package interfaces

A convenient alternative method for acquiring data in R is through package interfaces to web services. These interfaces are built using R code to make connections with resources on the web through Automatic Programming Interfaces (APIs). Websites such as Project Gutenberg, Twitter, Facebook, and many others provide APIs to allow access to their data under certain conditions, some more limiting for data collection than others. Programmers (like you!) in the R community take up the task of wrapping calls to an API with R code to make accessing that data from R possible. For example, gutenbergr provides access to Project Gutenberg, rtweet to Twitter, and Rfacebook to Facebook.

Using R package interfaces, however, often requires some more knowledge about R objects and functions. Let’s take a look at how to access data from Project Gutenberg through the gutenbergr package. Along the way we will touch upon various functions and concepts that are key to working with the R data types vectors and data frames including filtering and writing tabular data to disk in plain-text format.

The following code is available on GitHub recipes-acquiring_data and is built on the recipes-project_template I have discussed in detail here and made accessible here. I encourage you to follow along by downloading the recipes-project_template with git from the Terminal or create a new RStudio R Project and select the “Version Control” option.

To get started let’s install and load the package. The most simple method for downloading an R package in RStudio is to select the ‘Packages’ tab in the Files pane and click the ‘Install’ icon. To ensure that our code is reproducible, however, it is better to approach the installation of packages programmatically. If the package is not part of the R base library, we will not assume that the user will have the package on their system. The code to install and load the gutenbergr package is:

install.packages("gutenbergr") # install `gutenbergr` package
library(gutenbergr) # load the `gutenbergr` package

This approach works just fine, but luck has it that there is an R package for installing and loading packages! The pacman package includes a set of functions for managing packages. A very useful one is p_load() which will look for a package on a system, load it if it is found, and install and then load it if it is not found. This helps potentially avoid using unnecessary bandwidth to install packages that may already exist on a user’s system. But, to use pacman we need to include the code to install and load it with the functions install.packages() and library(). I’ve included some code that will mimic the behavior of p_load() for installing pacman itself, but as you can see it is not elegant, luckily it’s only used once as we add it to the SETUP section of our master file, _pipeline.R.

# Load `pacman`. If not installed, install then load.
if (!require("pacman", character.only = TRUE)) {
  install.packages("pacman")
  library("pacman", character.only = TRUE)
}

Now that we have pacman installed and loaded into our R session, let’s use the p_load() function to make sure to install/ load the two packages we will need for the upcoming tasks. If you are following along with the recipes-project_template, add this code within the SETUP section of the acquire_data.R file.

# Script-specific options or packages
pacman::p_load(tidyverse, gutenbergr)

Note that the arguments tidyverse and gutenbergr are comma-separated but not quoted.

Project Gutenberg provides access to thousands of texts in the public domain. The gutenbergr package contains a set of tables, or data frames in R speak, that index the meta-data for these texts broken down by text (gutenberg_metadata), author (gutenberg_authors), and subject (gutenberg_subjects). I’ll use the glimpse() function loaded in the tidyverse package1 to summarize the structure of these data frames.

glimpse(gutenberg_metadata) # summarize text meta-data
## Observations: 51,997
## Variables: 8
## $ gutenberg_id        <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...
## $ title               <chr> NA, "The Declaration of Independence of th...
## $ author              <chr> NA, "Jefferson, Thomas", "United States", ...
## $ gutenberg_author_id <int> NA, 1638, 1, 1666, 3, 1, 4, NA, 3, 3, NA, ...
## $ language            <chr> "en", "en", "en", "en", "en", "en", "en", ...
## $ gutenberg_bookshelf <chr> NA, "United States Law/American Revolution...
## $ rights              <chr> "Public domain in the USA.", "Public domai...
## $ has_text            <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, ...
glimpse(gutenberg_authors) # summarize authors meta-data
## Observations: 16,236
## Variables: 7
## $ gutenberg_author_id <int> 1, 3, 4, 5, 7, 8, 9, 10, 12, 14, 16, 17, 1...
## $ author              <chr> "United States", "Lincoln, Abraham", "Henr...
## $ alias               <chr> NA, NA, NA, NA, "Dodgson, Charles Lutwidge...
## $ birthdate           <int> NA, 1809, 1736, NA, 1832, NA, 1819, 1860, ...
## $ deathdate           <int> NA, 1865, 1799, NA, 1898, NA, 1891, 1937, ...
## $ wikipedia           <chr> NA, "http://en.wikipedia.org/wiki/Abraham_...
## $ aliases             <chr> NA, "United States President (1861-1865)/L...
glimpse(gutenberg_subjects) # summarize subjects meta-data
## Observations: 140,173
## Variables: 3
## $ gutenberg_id <int> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5...
## $ subject_type <chr> "lcc", "lcsh", "lcsh", "lcc", "lcc", "lcsh", "lcs...
## $ subject      <chr> "E201", "United States. Declaration of Independen...

The gutenberg_metadata, gutenberg_authors, and gutenberg_subjects are periodically updated. To check to see when each data frame was last updated run:

attr(gutenberg_metadata, "date_updated")
## [1] "2016-05-05"

To download the text itself we use the gutenberg_download() function which takes one required argument, gutenberg_id. The gutenberg_download() function is what is known as ‘vectorized’, that is, it can take a single value or multiple values for the argument gutenberg_id. Vectorization refers to the process of applying a function to each of the elements stored in a vector –a primary object type in R. A vector is a grouping of values of one of various types including character (chr), integer (int), and logical (lgl) and a data frame is a grouping of vectors. The gutenberg_download() function takes an integer vector which can be manually added or selected from the gutenberg_metadata or gutenberg_subjects data frames using the $ operator (e.g. gutenberg_metadata$gutenberg_id).

Let’s first add them manually here as a toy example by generating a vector of integers from 1 to 5 assigned to the variable name ids.

ids <- 1:5 # integer vector of values 1 to 5
ids
## [1] 1 2 3 4 5

To download the works from Project Gutenberg corresponding to the gutenberg_ids 1 to 5, we pass the ids object to the gutenberg_download() function.

works_sample <- gutenberg_download(gutenberg_id = ids) # download works with `gutenberg_id` 1-5
glimpse(works_sample) # summarize `works` dataset
## Observations: 2,939
## Variables: 2
## $ gutenberg_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ text         <chr> "December, 1971  [Etext #1]", "", "", "The Projec...

Two attributes are returned: gutenberg_id and text. The text column contains values for each line of text (delimited by a carriage return) for each of the 5 works we downloaded. There are many more attributes available from the Project Gutenberg API that can be accessed by passing a character vector of the attribute names to the argument meta_fields. The column names of the gutenberg_metadata data frame contains the available attributes.

names(gutenberg_metadata) # print the column names of the `gutenberg_metadata` data frame
## [1] "gutenberg_id"        "title"               "author"             
## [4] "gutenberg_author_id" "language"            "gutenberg_bookshelf"
## [7] "rights"              "has_text"

Let’s augment our previous download with the title and author of each of the works. To create a character vector we use the c() function, then, quote and delimit the individual elements of the vector with a comma.

# download works with `gutenberg_id` 1-5 including `title` and `author` as attributes
works_sample <- gutenberg_download(gutenberg_id = ids, 
                            meta_fields = c("title",
                                            "author"))
glimpse(works_sample) # summarize dataset
## Observations: 2,939
## Variables: 4
## $ gutenberg_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ text         <chr> "December, 1971  [Etext #1]", "", "", "The Projec...
## $ title        <chr> "The Declaration of Independence of the United St...
## $ author       <chr> "Jefferson, Thomas", "Jefferson, Thomas", "Jeffer...

Now, in a more practical scenario we would like to select the values of gutenberg_id by some principled query such as works from a specific author, language, or subject. To do this we first query either the gutenberg_metadata data frame or the gutenberg_subjects data frame. Let’s say we want to download a random sample of 10 works from English Literature (Library of Congress Classification, “PR”). Using the filter() function (part of the tidyverse package set) we first extract all the Gutenberg ids from gutenberg_subjects where subject_type == "lcc" and subject == "PR" assigning the result to ids.2

ids <- 
  filter(gutenberg_subjects, subject_type == "lcc", subject == "PR")
glimpse(ids)
## Observations: 7,100
## Variables: 3
## $ gutenberg_id <int> 11, 12, 13, 16, 20, 26, 27, 35, 36, 42, 43, 46, 5...
## $ subject_type <chr> "lcc", "lcc", "lcc", "lcc", "lcc", "lcc", "lcc", ...
## $ subject      <chr> "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "...

The operators = and == are not equivalents. == is used for logical evaluation and = is an alternate notation for variable assignment (<-).

The gutenberg_subjects data frame does not contain information as to whether a gutenberg_id is associated with a plain-text version. To limit our query to only those English Literature works with text, we filter the gutenberg_metadata data frame by the ids we have selected in ids and the attribute has_text in the gutenberg_metadata data frame.

ids_has_text <- 
  filter(gutenberg_metadata, gutenberg_id %in% ids$gutenberg_id, has_text == TRUE)
glimpse(ids_has_text)
## Observations: 6,724
## Variables: 8
## $ gutenberg_id        <int> 11, 12, 13, 16, 20, 26, 27, 35, 36, 42, 43...
## $ title               <chr> "Alice's Adventures in Wonderland", "Throu...
## $ author              <chr> "Carroll, Lewis", "Carroll, Lewis", "Carro...
## $ gutenberg_author_id <int> 7, 7, 7, 10, 17, 17, 23, 30, 30, 35, 35, 3...
## $ language            <chr> "en", "en", "en", "en", "en", "en", "en", ...
## $ gutenberg_bookshelf <chr> "Children's Literature", "Children's Liter...
## $ rights              <chr> "Public domain in the USA.", "Public domai...
## $ has_text            <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, ...

A couple R programming notes on the code phrase gutenberg_id %in% ids$gutenberg_id. First, the $ symbol in ids$gutenberg_id is the programmatic way to target a particular column in an R data frame. In this example we select the ids data frame and the column gutenberg_id, which is a integer vector. The gutenberg_id variable that precedes the %in% operator does not need an explicit reference to a data frame because the primary argument of the filter() function is this data frame (gutenberg_metadata). Second, the %in% operator logically evaluates whether the vector elements in gutenberg_metadata$gutenberg_ids are also found in the vector ids$gutenberg_id returning TRUE and FALSE accordingly. This effectively filters those ids which are not in both vectors.

As we can see the number of works with text is fewer than the number of works listed, 7100 versus 6724. Now we can safely do our random selection of 10 works, with the function sample_n() and be confident that the ids we select will contain text when we take the next step by downloading the data.

set.seed(123) # make the sampling reproducible
ids_sample <- sample_n(ids_has_text, 10) # sample 10 works
glimpse(ids_sample) # summarize the dataset
## Observations: 10
## Variables: 8
## $ gutenberg_id        <int> 7688, 33533, 12160, 37761, 40406, 1050, 18...
## $ title               <chr> "Lucretia — Volume 04", "The Convict's Far...
## $ author              <chr> "Lytton, Edward Bulwer Lytton, Baron", "Pa...
## $ gutenberg_author_id <int> 761, 35765, 1865, 1256, 25821, 467, 1062, ...
## $ language            <chr> "en", "en", "en", "en", "en", "en", "en", ...
## $ gutenberg_bookshelf <chr> NA, NA, NA, NA, NA, "One Act Plays", NA, N...
## $ rights              <chr> "Public domain in the USA.", "Public domai...
## $ has_text            <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, ...

As before, we can now pass our ids (ids_sample$gutenberg_id) as the argument of gutenberg_download().

works_pr <- gutenberg_download(gutenberg_id = ids_sample$gutenberg_id, meta_fields = c("author", "title"))
glimpse(works_pr) # summarize the dataset
## Observations: 79,200
## Variables: 4
## $ gutenberg_id <int> 1050, 1050, 1050, 1050, 1050, 1050, 1050, 1050, 1...
## $ text         <chr> "THE DARK LADY OF THE SONNETS", "", "By Bernard S...
## $ author       <chr> "Shaw, Bernard", "Shaw, Bernard", "Shaw, Bernard"...
## $ title        <chr> "The Dark Lady of the Sonnets", "The Dark Lady of...

At this point we have data and could move on to processing this data in preparation for analysis. However, we are aiming for a reproducible workflow and this code does not conform to our principle of modularity: each subsequent step in our analysis will depend on running this code first. Furthermore, running this code as it is creates issues with bandwidth, as in our previous examples from direct downloads. To address modularity we will write the data to disk in plain-text format. In this way each subsequent step in our analysis can access the data locally. To address bandwidth concerns, we will devise a method for checking to see if the data is already downloaded and skip the download, if possible, to avoid accessing the Project Gutenberg server unnecessarily.

To write our data frame to disk we will export it into a standard plain-text format for two-dimensional data: a CSV file (comma-separated value). The CSV structure for this data will look like this:

## gutenberg_id,text,author,title
## 1050,THE DARK LADY OF THE SONNETS,"Shaw, Bernard",The Dark Lady of the Sonnets
## 1050,,"Shaw, Bernard",The Dark Lady of the Sonnets
## 1050,By Bernard Shaw,"Shaw, Bernard",The Dark Lady of the Sonnets
## 1050,,"Shaw, Bernard",The Dark Lady of the Sonnets
## 1050,,"Shaw, Bernard",The Dark Lady of the Sonnets
## 1050,,"Shaw, Bernard",The Dark Lady of the Sonnets

The first line contains the names of the columns and subsequent lines the observations. Data points that contain commas themselves (e.g. “Shaw, Bernard”) are quoted to avoid misinterpreting these commas a deliminators in our data. To write this data to disk we will use the write_csv() function.

write_csv(works_pr, path = "data/original/gutenberg_works_pr.csv")

To avoid downloading data that already resides on disk, let’s implement a similar strategy to the one used in the previous post for direct downloads. I’ve incorporated the code for sampling and downloading data for a particular subject from Project Gutenberg with a control statement to check if the data file already exists into a function I named get_gutenberg_subject(). Take a look at this function below.

get_gutenberg_subject <- function(subject, target_file, sample_size = 10) {
  # Function: to download texts from Project Gutenberg with 
  # a specific LCC subject and write the data to disk.
  
  # Check to see if the data already exists
  if(!file.exists(target_file)) { # if data does not exist, download and write
    target_dir <- dirname(x) # generate target directory for the .csv file
    dir.create(path = target_dir, recursive = TRUE, showWarnings = FALSE) # create target data directory
    cat("Downloading data... \n") # print status message
    # Select all records with a particular LCC subject
    ids <- 
      filter(gutenberg_subjects, 
             subject_type == "lcc", subject == subject) # select subject
    # Select only those records with plain text available
    set.seed(123) # make the sampling reproducible
    ids_sample <- 
      filter(gutenberg_metadata, 
             gutenberg_id %in% ids$gutenberg_id, # select ids in both data frames 
             has_text == TRUE) %>% # select those ids that have text
      sample_n(sample_size) # sample N works (default N = 10)
    # Download sample with associated `author` and `title` metadata
    works_sample <- 
      gutenberg_download(gutenberg_id = ids_sample$gutenberg_id, 
                         meta_fields = c("author", "title"))
    # Write the dataset to disk in .csv format
    write_csv(works_sample, path = target_file)
    cat("Data downloaded! \n") # print status message
  } else { # if data exists, don't download it again
    cat("Data already exists \n") # print status message
  }
}

Adding this function to our function script functions/acquire_functions.R, we can now use this function in our code/acquire_data.R script to download multiple subjects and store them in on disk in their own file.

Let’s download American Literature now (LCC code “PQ”).

# Download Project Gutenberg text for subject 'PQ' (American Literature)
# and then write this dataset to disk in .csv format
get_gutenberg_subject(subject = "PQ", target_file = "data/original/gutenberg/works_pq.csv")

Applying this function to both the English and American Literature datasets, our data directory structure now looks like this:

data
├── derived
└── original
    ├── gutenberg
    │   ├── works_pq.csv
    │   └── works_pr.csv
    ├── sbc
    │   ├── meta-data
    │   └── transcriptions
    └── scs
        ├── README
        ├── discourse
        ├── disfluency
        ├── tagged
        ├── timed-transcript
        └── transcript

7 directories, 8 files

And as before in the previous post, it is a good idea to log the results of our work.

# Log the directory structure of the Project Gutenberg data
system(command = "tree data/original/gutenberg >> log/data_original_gutenberg.log")

Round up

In this post I provided an overview to acquiring data from web service APIs through R packages. We took at closer look at the gutenbergr package which provides programmatic access to works available on Project Gutenberg. Working with package interfaces requires more knowledge of R including loading/ installing packages, working with vectors and data frames, and exporting data from an R session. We touched on these programming concepts and also outlined a method to create a reproducible workflow.

Our last step in this mini series on acquiring data for language research with R, we will explore methods for acquire language data from the browsable web. I will discuss using the rvest package for downloading and isolating text elements from HTML pages and show how to organize and write the data to disk.

References

Robinson, David. 2018. Gutenbergr: Download and Process Public Domain Works from Project Gutenberg. https://CRAN.R-project.org/package=gutenbergr.

Wickham, Hadley. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.


  1. tidyverse is not a typical package. It is a set of packages: ggplot2, dplyr, tidyr, readr, purrr, and tibble. These packages are all installed/ loaded with tidyverse and form the backbone for the type of work you will typically do in most analyses.

  2. See Library of Congress Classification documentation for a complete list of subject codes.

To leave a comment for the author, please follow the link and comment on their blog: R on francojc ⟲.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)