Learn how to schedule robust data extraction with Cron and Docker containers.
This article was also published on https://www.r-bloggers.com/.
This is part two of a multi article series on building and deploying an API robustly with R and Docker that lets you extract live data from Google Big Query. For part one, see Google Big Query with R. For a short introduction into Docker, see part two of the article series about building and deploying a Dashboard, Deploying a Shiny Flexdashboard with Docker.
Scheduled data extractions via database queries or API calls are an important part of building data infrastructure. This enables you to copy a data source easily and have the most recent data with some lag. Another use case is pre-computing aggregated/filtered data from an constantly updated source to enhance the performance of a service. I will show you how to use a Unix cron job to schedule an extraction. More advanced methods to minimize stale data would be setting up a listener or with a web hook.
We will extract data from the public Big Query Real-time Air Quality data set from openAQ. This is an open source project which provides real time data (if you stretch the definition of “real time”) from 5490 world wide air quality measurement stations, yet we will only extract measurements from Indian stations. The global air quality data set gets updated regularly, however older entries are omitted, probably to save storage costs. To hold the older measurements, we will set up data extraction via cron job inside a Docker container. For a short introduction to Docker and why we use it, see this article. For an introduction to Google Big Query, how to get access to this public data set and querying it with dplyr verbs, see part one of this series, Google Big Query with R.
R extraction script for scheduling
The following script will be used to extract data if the data set was updated. You can find the script at cron/src/get_data_big_query.R in the project github repo.
Scheduling the extraction with Cron
Cron is a scheduling program already contained in most modern Unix based distributions. Scheduling of so called cron jobs are managed via crontab. You can see the cron jobs for the current user in the crontab table via crontab -l or edit the cron jobs via crontab -e. The following syntax is used to define the execution interval through five time parameters:
* * * * * command to be executed
- - - - -
| | | | |
| | | | ----- Day of week (0 - 7) (Sunday=0 or 7)
| | | ------- Month (1 - 12)
| | --------- Day of month (1 - 31)
| ----------- Hour (0 - 23)
------------- Minute (0 - 59)
instead it can also be defined by special strings :
@reboot Run once, at startup.
@yearly Run once a year, "0 0 1 1 *".
@annually (same as @yearly)
@monthly Run once a month, "0 0 1 * *".
@weekly Run once a week, "0 0 * * 0".
@daily Run once a day, "0 0 * * *".
@midnight (same as @daily)
@hourly Run once an hour, "0 * * * *".
We will set up our cron job for data extraction to run the R script every 12 hours at the 11th minute. This is best practice to avoid conflicts with any processes that run at full hours or five minute intervals. It is easy to get the the file paths wrong the first time as cronjobs are executed in the home directory. Check that you have the right file paths to R, to the R script and inside the R script for the dependencies. In the cronjob, >> var/log/cron.log 2>&1 appends the script output to a log file and redirects standard error to standard output, so we have all the printed R output as well as the warnings and errors logged.
Building the Dockerimage
This assumes basic knowledge of Docker, if not see Deploying a Shiny Flexdashboard with Docker. To run our scheduled extraction containerized we build an image, constructed through recipes in a Dockerfile. We will use the rocker/tidyverse image from Dockerhub as base image and add layers on top in the recipe with the needed R libraries and system dependencies, copy the directory with the R script and cronjob to the image and finally the CMD will start cron and tail the log file, so the output gets shown in the Docker container logs:
Then in the directory of the Dockerfile run docker build -t openaq_extraction ., this will build the image from the Dockerfile and tag it as openaq_extraction.
$ docker run -d \
--name openaq_extraction_container \
--mount type=bind,source=/filepath_to/openaq_extraction/shared-data,target=/src/shared-data \
This runs the container in detached mode, always restarts and removes the saved filesystem at exit. Additionally, this mounts the directory where the extracted data is saved to an existing source directory on the host which you need to retain the extracted data if the container gets stopped.
Notice: Querying open data sets gets billed on your Google Cloud billing account, however you have 5TB of querying free per month. Still remember to stop this Docker container if you do not need the data extraction.
Now we have a robust, shareable and reproducible scheduled data extraction up and running. In the last part of the project we will build an REST API with R in a Docker container network to enable easy access to the now permanent records of Indian air quality that are getting extracted. For this see part three of this article series.