Acquiring data for language research (1/3): direct downloads

[This article was first published on R on francojc ⟲, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

There are three main ways to acquire corpus data using R that I will introduce you to: direct download, package interfaces, and web scraping. In this post we will start by directly downloading a corpus as it is the most straightforward process for the novice R programmer and incurs the least number of steps. Along the way I will introduce some key R coding concepts including control statements and custom functions.

The following code is available on GitHub recipes-acquiring_data and is built on the recipes-project_template I have discussed in detail here and made accessible here. I encourage you to follow along by downloading the recipes-project_template with git from the Terminal or create a new RStudio R Project and select the “Version Control” option.

Direct downloads

Published corpus data found in repositories or individual sources are usually the easiest to start working with as it is generally a matter of identifying a resource to download and then downloading it with R. OK, there’s a little more involved, but that’s the basic idea.

Let’s take a look at how this works starting with the a sample from the Switchboard Corpus, a corpus of 2,400 telephone conversations by 543 speakers. First we navigate to the site with a browser and download the file that we are looking for. In this case I found the Switchboard Corpus on the NLTK data repository site. More often than not this file will be some type of compressed archive file with an extension such as .zip or .tz, which is the case here. Archive files make downloading multiple files easy by grouping files and directories into one file. In R we can used the download.file() function from the base R library1. There are a number of arguments that a function may require or provide optionally. The download.file() function minimally requires two: url and destfile. That is the file to download and the location where it is to be saved to disk.

# Download .zip file and write to disk
download.file(url = "https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/switchboard.zip", destfile = "data/original/switchboard.zip")

Once an archive file is downloaded, however, the file needs to be ‘decompressed’ to reveal the file structure. The file we downloaded is located on our disk at data/original/switchboard.zip. To decompress this file we use the unzip() function with the arguments zipfile pointing to the .zip file and exdir specifying the directory where we want the files to be extracted to.

I encourage you to use the TAB key to expand the list of options of a function to avoid having to remember the arguments of a function and also to avoid typos. After typing the name of the function and opening ( hit TAB to view and select the argument(s) you want. Furthermore, the TAB key can also help you expand paths to files and directories. Note that the expansion will default to the current working directory.

# Decompress .zip file and extract to our target directory
unzip(zipfile = "data/original/switchboard.zip", exdir = "data/original/")

The directory structure of data/ now should look like this:

data/
├── derived
└── original
    ├── switchboard
    │   ├── README
    │   ├── discourse
    │   ├── disfluency
    │   ├── tagged
    │   ├── timed-transcript
    │   └── transcript
    └── switchboard.zip

3 directories, 7 files

At this point we have acquired the data programmatically and with this code as part of our workflow anyone could run this code and reproduce the same results. The code as it is, however, is not ideally efficient. Firstly the switchboard.zip file is not strictly needed after we decompress it and it occupies disk space if we keep it. And second, each time we run this code the file will be downloaded from the remote serve leading to unnecessary data transfer and server traffic. Let’s tackle each of these issues in turn.

To avoid writing the switchboard.zip file to disk (long-term) we can use the tempfile() function to open a temporary holding space for the file. This space can then be used to store the file, unzip it, and then the temporary file will be destroyed. We assign the temporary space to an R object we will name temp with the tempfile() function. This object can now be used as the value of the argument destfile in the download.file() function. Let’s also assign the web address to another object url which we will use as the value of the url argument.

# Create a temporary file space for our .zip file
temp <- tempfile()
# Assign our web address to `url`
url <- "https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/switchboard.zip"
# Download .zip file and write to disk
download.file(url, temp)

In the previous code I’ve used the values stored in the objects url and temp in the download.file() function without specifying the argument names –only providing the names of the objects. R will assume that values of a function map to the ordering of the arguments. If your values do not map to ordering of the arguments you are required to specify the argument name and the value. To view the ordering of objects hit TAB after entering the function name or consult the function documentation by prefixing the function name with ? and hitting ENTER.

At this point our downloaded file is stored temporarily on disk and can be accessed and decompressed to our target directory using temp as the value for the argument zipfile from the unzip() function. I’ve assigned our target directory path to target_dir and used it as the value for the argument exdir to prepare us for the next tweak on our approach.

# Assign our target directory to `target_dir`
target_dir <- "data/original/"
# Decompress .zip file and extract to our target directory
unzip(zipfile = temp, exdir = target_dir)

Our directory structure now looks like this:

data/
├── derived
└── original
    └── switchboard
        ├── README
        ├── discourse
        ├── disfluency
        ├── tagged
        ├── timed-transcript
        └── transcript

3 directories, 6 files

The second issue I raised concerns the fact that running this code as part of our project will repeat the download each time. Since we would like to be good citizens and avoid unnecessary traffic on the web it would be nice if our code checked to see if we already have the data on disk and if it exists, then skip the download, if not then download it. To achieve this we need to introduce two new functions if() and dir.exists(). dir.exists() takes a path to a directory as an argument and returns the logical value, TRUE, if that directory exists, and FALSE if it does not. if() evaluates logical statements and processes subsequent code based on the logical value it is passed as an argument. Let’s look at a toy example.

num <- 1
if(num == 1) { 
  cat(num, "is 1") 
  } else {
  cat(num, "is not 1")
  }
## 1 is 1

I assigned num to the value 1 and created a logical evaluation num == whose result is passed as the argument to if(). If the statement returns TRUE then the code withing the first set of curly braces {...} is run. If num == 1 is false, like in the code below, the code withing the braces following the else will be run.

num <- 2
if(num == 1) { 
  cat(num, "is 1") 
  } else {
  cat(num, "is not 1")
  }
## 2 is not 1

The function if() is one of various functions that are called control statements. Theses functions provide a lot of power to make dynamic choices as code is run.

Before we get back to our key objective to avoid downloading resources that we already have on disk, let me introduce another strategy to making code more powerful and ultimately more efficient and as well as more legible –the custom function. Custom functions are functions that the user writes to create a set of procedures that can be run in similar contexts. I’ve created a custom function named eval_num() below.

eval_num <- function(num) {
  if(num == 1) { 
  cat(num, "is 1") 
  } else {
  cat(num, "is not 1")
  }
}

Let’s take a closer look at what’s going on here. The function function() creates a function in which the user decides what arguments are necessary for the code to perform its task. In this case the only necessary argument is the object to store a numeric value to be evaluated. I’ve called it num because it reflects the name of the object in our toy example, but there is nothing special about this name. It’s only important that the object names be consistently used. I’ve included our previous code (except for the hard-coded assignment of num) inside the curly braces and assigned the entire code chunk to eval_num.

We can now use the function eval_num() to perform the task of evaluating whether a value of num is or is not equal to 1.

eval_num(num = 1)
## 1 is 1
eval_num(num = 2)
## 2 is not 1
eval_num(num = 3)
## 3 is not 1

I’ve put these coding strategies together with our previous code in a function I named get_zip_data(). There is a lot going on here. Take a look first and see if you can follow the logic involved given what you now know.

get_zip_data <- function(url, target_dir) {
  # Function: to download and decompress a .zip file to a target directory
  
  # Check to see if the data already exists
  if(!dir.exists(target_dir)) { # if data does not exist, download/ decompress
    cat("Creating target data directory \n") # print status message
    dir.create(path = target_dir, recursive = TRUE, showWarnings = FALSE) # create target data directory
    cat("Downloading data... \n") # print status message
    temp <- tempfile() # create a temporary space for the file to be written to
    download.file(url = url, destfile = temp) # download the data to the temp file
    unzip(zipfile = temp, exdir = target_dir, junkpaths = TRUE) # decompress the temp file in the target directory
    cat("Data downloaded! \n") # print status message
  } else { # if data exists, don't download it again
    cat("Data already exists \n") # print status message
  }
}

OK. You should have recognized the general steps in this function: the argument url and target_dir specify where to get the data and where to write the decompressed files, the if() statement evaluates whether the data already exists, if not (!dir.exists(target_dir)) then the data is downloaded and decompressed, if it does exist (else) then it is not downloaded.

The prefixed ! in the logical expression dir.exists(target_dir) returns the opposite logical value. This is needed in this case so when the target directory exists, the expression will return FALSE, not TRUE, and therefore not proceed in downloading the resource.

There are a couple key tweaks I’ve added that provide some additional functionality. For one I’ve included the function dir.create() to create the target directory where the data will be written. I’ve also added an additional argument to the unzip() function, junkpaths = TRUE. Together these additions allow the user to create an arbitrary directory path where the files, and only the files, will be extracted to on our disk. This will discard the containing directory of the .zip file which can be helpful when we want to add multiple .zip files to the same target directory.

A practical scenario where this applies is when we want to download data from a corpus that is contained in multiple .zip files but still maintain these files in a single primary data directory. Take for example the Santa Barbara Corpus. This corpus resource includes a series of interviews in which there is one .zip file, SBCorpus.zip which contains the transcribed interviews and another .zip file, metadata.zip which organizes the meta-data associated with each speaker. Applying our initial strategy to download and decompress the data will lead to the following directory structure:

data
├── derived
└── original
    ├── SBCorpus
    │   ├── TRN
    │   └── __MACOSX
    │       └── TRN
    └── metadata
        └── __MACOSX

8 directories

By applying our new custom function get_zip_data() to the transcriptions and then the meta-data we can better organize the data.

# Download corpus transcriptions
get_zip_data(url = "http://www.linguistics.ucsb.edu/sites/secure.lsit.ucsb.edu.ling.d7/files/sitefiles/research/SBC/SBCorpus.zip", target_dir = "data/original/sbc/transcriptions/")

# Download corpus meta-data
get_zip_data(url = "http://www.linguistics.ucsb.edu/sites/secure.lsit.ucsb.edu.ling.d7/files/sitefiles/research/SBC/metadata.zip", target_dir = "data/original/sbc/meta-data/")

Now our data/ directory is better organized; both the transcriptions and the meta-data are housed under data/original/sbc/.

data
├── derived
└── original
    └── sbc
        ├── meta-data
        └── transcriptions

5 directories

If we add data from other sources we can keep them logical separate and allow our data collection to scale without creating unnecessary complexity. Let’s add the Switchboard Corpus sample using our get_zip_data() function to see this in action.

# Download corpus
get_zip_data(url = "https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/switchboard.zip", target_dir = "data/original/scs/")

Our corpora our housed in their own directories and the files are clearly associated.

data
├── derived
└── original
    ├── sbc
    │   ├── meta-data
    │   └── transcriptions
    └── scs

6 directories

At this point we have what we need to continue to the next step in our data analysis project. But before we go, we should do some housekeeping to document and organize this process to make our work reproducible. We will take advantage of the project-template directory structure, seen below.

.
├── README.md
├── _pipeline.R
├── code
│   ├── acquire_data.R
│   ├── analyze_data.R
│   ├── curate_data.R
│   ├── generate_reports.R
│   └── transform_data.R
├── data
│   ├── derived
│   └── original
├── figures
├── functions
├── log
├── recipes-acquire-data.Rproj
└── report
    ├── article.Rmd
    ├── bibliography.bib
    ├── slides.Rmd
    └── web.Rmd

8 directories, 13 files

First it is good practice to separate custom functions from our processing scripts. We can create a file in our functions/ directory named acquire_functions.R and add our custom function get_zip_data() there. We then use the source() function to read that function into our current script to make it available to use as needed. It is good practice to source your functions in the SETUP section of your script.

# Load custom functions for this project
source(file = "functions/acquire_functions.R")

Second it is advisable to log the structure of the data in plain text files. You can create a directory tree (as those seen in this post) with the bash command tree on the command line. R provides a function system() which will interface the command line. Adding the following code to the LOG section of your acquire_data.R R script will generate the directory structure for each of the corpora that we have downloaded in this post in the files data_original_sbc.log and data_original_scs.log.

# Log the directory structure of the Santa Barbara Corpus
system(command = "tree data/original/sbc >> log/data_original_sbc.log")
# Log the directory structure of the Switchboard Corpus sample
system(command = "tree data/original/scs >> log/data_original_scs.log")

Our project directory structure now looks like this:

.
├── README.md
├── _pipeline.R
├── code
│   ├── acquire_data.R
│   ├── analyze_data.R
│   ├── curate_data.R
│   ├── generate_reports.R
│   └── transform_data.R
├── data
│   ├── derived
│   └── original
├── figures
├── functions
│   └── acquire_functions.R
├── log
│   ├── data_original_sbc.log
│   └── data_original_scs.log
├── recipes-acquire-data.Rproj
└── report
    ├── article.Rmd
    ├── bibliography.bib
    ├── slides.Rmd
    └── web.Rmd

8 directories, 15 files

Round up

In this post we’ve covered how to access, download, and organize data contained in .zip files; the most common format for language data found on repositories and individual sites. This included an introduction to a few key R programming concepts and strategies including using functions, writing custom functions, and controlling program flow with control statements. Our approach was to gather data while also keeping in mind the reproducibility of the code. To this end I introduced programming strategies for avoiding unnecessary web traffic (downloads), scalable directory creation, and data documentation.

In the next post in this three part mini-series I will cover acquiring data from web services such as Project Gutenberg, Twitter, and Facebook through R packages. Using package interfaces will require additional knowledge of R objects. I will discuss vector types and data frames and show how to manipulate these objects in practical situations like filtering data and writing data to disk in plain-text files.


  1. Remember base R packages are installed by default with R and are loaded and accessible by default in each R session.

To leave a comment for the author, please follow the link and comment on their blog: R on francojc ⟲.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)