Live data extraction with Cron and R

[This article was first published on Stories by Tim M. Schendzielorz on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Learn how to schedule robust data extraction with Cron and Docker containers.

The schema of the public live data set on Google Big Query we are going to extract

This article was also published on https://www.r-bloggers.com/.

This is part two of a multi article series on building and deploying an API robustly with R and Docker that lets you extract live data from Google Big Query. For part one, see Google Big Query with R. For a short introduction into Docker, see part two of the article series about building and deploying a Dashboard, Deploying a Shiny Flexdashboard with Docker.

Scheduled data extractions via database queries or API calls are an important part of building data infrastructure. This enables you to copy a data source easily and have the most recent data with some lag. Another use case is pre-computing aggregated/filtered data from an constantly updated source to enhance the performance of a service. I will show you how to use a Unix cron job to schedule an extraction. More advanced methods to minimize stale data would be setting up a listener or with a web hook.

We will extract data from the public Big Query Real-time Air Quality data set from openAQ. This is an open source project which provides real time data (if you stretch the definition of “real time”) from 5490 world wide air quality measurement stations, yet we will only extract measurements from Indian stations. The global air quality data set gets updated regularly, however older entries are omitted, probably to save storage costs. To hold the older measurements, we will set up data extraction via cron job inside a Docker container. For a short introduction to Docker and why we use it, see this article. For an introduction to Google Big Query, how to get access to this public data set and querying it with dplyr verbs, see part one of this series, Google Big Query with R.

R extraction script for scheduling

The following script will be used to extract data if the data set was updated. You can find the script at cron/src/get_data_big_query.R in the project github repo.

Scheduling the extraction with Cron

Cron is a scheduling program already contained in most modern Unix based distributions. Scheduling of so called cron jobs are managed via crontab. You can see the cron jobs for the current user in the crontab table via crontab -l or edit the cron jobs via crontab -e. The following syntax is used to define the execution interval through five time parameters:

* * * * * command to be executed
- - - - -
| | | | |
| | | | ----- Day of week (0 - 7) (Sunday=0 or 7)
| | | ------- Month (1 - 12)
| | --------- Day of month (1 - 31)
| ----------- Hour (0 - 23)
------------- Minute (0 - 59)

instead it can also be defined by special strings :

string         meaning
              ------         -------
              @reboot        Run once, at startup.
              @yearly        Run once a year, "0 0 1 1 *".
              @annually      (same as @yearly)
              @monthly       Run once a month, "0 0 1 * *".
              @weekly        Run once a week, "0 0 * * 0".
              @daily         Run once a day, "0 0 * * *".
              @midnight      (same as @daily)
              @hourly        Run once an hour, "0 * * * *".

You can check how to set up specific intervals at https://crontab.guru/. Note that there are various cron monitoring tools worth a look at such as https://deadmanssnitch.com/ or https://cronitor.io/.

We will set up our cron job for data extraction to run the R script every 12 hours at the 11th minute. This is best practice to avoid conflicts with any processes that run at full hours or five minute intervals. It is easy to get the the file paths wrong the first time as cronjobs are executed in the home directory. Check that you have the right file paths to R, to the R script and inside the R script for the dependencies. In the cronjob, >> var/log/cron.log 2>&1 appends the script output to a log file and redirects standard error to standard output, so we have all the printed R output as well as the warnings and errors logged.

Building the Dockerimage

This assumes basic knowledge of Docker, if not see Deploying a Shiny Flexdashboard with Docker. To run our scheduled extraction containerized we build an image, constructed through recipes in a Dockerfile. We will use the rocker/tidyverse image from Dockerhub as base image and add layers on top in the recipe with the needed R libraries and system dependencies, copy the directory with the R script and cronjob to the image and finally the CMD will start cron and tail the log file, so the output gets shown in the Docker container logs:

Then in the directory of the Dockerfile run docker build -t openaq_extraction ., this will build the image from the Dockerfile and tag it as openaq_extraction.

You can either export the image and deploy the container on a server or cloud service such as AWS, Google Cloud and DigitalOcean or deploy it locally. Start the container via:

$ docker run -d \
  --restart=always \
  --name openaq_extraction_container \
  --rm \
  --mount type=bind,source=/filepath_to/openaq_extraction/shared-data,target=/src/shared-data \
   openaq_extraction

This runs the container in detached mode, always restarts and removes the saved filesystem at exit. Additionally, this mounts the directory where the extracted data is saved to an existing source directory on the host which you need to retain the extracted data if the container gets stopped.

Notice: Querying open data sets gets billed on your Google Cloud billing account, however you have 5TB of querying free per month. Still remember to stop this Docker container if you do not need the data extraction.

Now we have a robust, shareable and reproducible scheduled data extraction up and running. In the last part of the project we will build an REST API with R in a Docker container network to enable easy access to the now permanent records of Indian air quality that are getting extracted. For this see part three of this article series.


Live data extraction with Cron and R was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: Stories by Tim M. Schendzielorz on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)