Easy API building for Data Scientists with R

[This article was first published on Stories by Tim M. Schendzielorz on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Learn how to build a robust REST API with Docker and R Plumber.

The schema of the public live data set on Google Big Query for which we will build an API

This article was also published on https://www.r-bloggers.com/.

This is part three of a multi article series on building and deploying an API robustly with R and Docker that lets you extract live data from Google Big Query. For part one, see Google Big Query with R and for part two, see Live data extraction with Cron and R. For a short introduction into Docker, see part two of the article series about building and deploying a Dashboard, Deploying a Shiny Flexdashboard with Docker.

APIs are a convenient way to access data across devices independently of a programming language. REpresentational State Transfer APIs are the most common type of Web API. REST is a software architecture paradigm which defines a set of uniform and stateless operations. The uniform operations make it simple to define an interface and the statelessness makes it reliable, fast and easy to modify and scale. The commonly used exchange protocol is the HTT Protocol, with its operations GET, HEAD, POST, PUT, PATCH, DELETE, CONNECT, OPTIONS and TRACE send to an IP address or its associated URL.

We will build a REST API in R with the Plumber package to enable easy access to data from the public Big Query Real-time Air Quality data set from openAQ. We enable to hold the permanent records of air quality through a scheduled data extraction via Cron in a Docker container as older entries of the data set get omitted when new data is added. The R Plumber API will be run in its own Docker container and this container will be run in a container network together with the data extraction Docker container. This architecture has several advantages:

  • Portability of the whole service to other machines or cloud services
  • Clearly defined dependencies avoid breakage of functionality
  • Extracting and pre-aggregating the data enables fast API response times without the need for data base querying
  • Enhanced data security as the API operations and the data base access are in separate containers
  • Modularity enables easier debugging of the service and integration of additional parts

R Plumber script

Plumber allows you to decorate R code with comments that define endpoints and various input parameters. You can then expose the decorated R code at an defined IP address. Install Plumber from CRAN or GitHub and open a new Plumber script to see some examples. If you have RStudio, you can click the “Run API” button to test your endpoints locally with Swagger. Per default, Plumber output is send as JSON, however you can use other serializers or create new ones to instruct plumber to render the output in a different format. For more information, see the Plumber documentation.

The following script instructs plumber to expose various R functions to get data from the extracted air quality data which is saved in the shared volume of the two Docker containers as /shared-data/airquality-india.RDS . The last function at endpoint /plot will get you a test histogram when called in PNG format as specified by serializer #* @png . Note that instead of the elseif statements you could parameterize the function to get a more concise code. You can clone the whole project GitHub repo here.

Dockerfile for the API

We will create an Docker image with recipes in a Dockerfile that adds layers of images on top of each other. We will use the rocker/tidyverse image from Dockerhub as base image. This assumes basic knowledge of Docker, if not see Deploying a Shiny Flexdashboard with Docker. The final commad layer CMD [“R”, “-e”, “pr <- plumber::plumb('/src/API.R'); pr$run(host='0.0.0.0', port=3838)"] will expose the functions in our script at our localhost (the container) as endpoint at container port 3838 which we exposed via layer EXPOSE 3838.

In the directory of the Dockerfile run docker build -t openaq_api ., this will build the image from the Dockerfile and tag it as openaq_api. To test the dockerized API run the docker container via this command to bind the host port 3838 to the exposed container port at which the API runs.

$ docker run -d \
  -p 3838:3838 \
  --restart=always \
  --name openaq_api_container \
  --rm \
   openaq_api

Then check if the test histogram gets returned from the API via curl in your console or via httr with R:

require("httr")
# send API HTTP operation
api_output <- GET("http://0.0.0.0:3838/plot", verbose(),
   # param list not needed here as /plot endpoint has none
   #query = list(param1 =, param2=,...)
   )
# show body of output                                                                                                 
content(api_output)

This should show you a test histogram in PNG format.

Creating the multi container service

We define a service consisting of the API container and the data extraction container with a shared volume between them via docker-compose. Docker-compose is a tool you can install additionally to the Docker engine which makes it easy to set up a multi-container service programmatically through definitions in a YAML file. We define the shared volume via parameter volumes: and a shared network to enable the containers to listen to each others ports via parameter networks: (This is not necessary in this service and just shown for clompleteness). The containers are defined through parameter services:, here the build: parameter specifies that the container images are rebuild from the Dockerfiles in context:. The shared volume is mounted to a directory inside the containers in volumes:. The exposed port 3838 of the API container is bound to port 3838 of the host via ports:.

If you cloned the project GitHub repo, you can see the file structure with the docker-compose.yml file in the top directory. In the top directory build and start the containers with command

$ docker-compose up

To run in detached mode add -d. To force the recreation of existing containers and/or force the images to rebuild add --force-recreate --build . To stop the all the started networks and containers specified in the YAML file just run docker-compose down.

The extraction process should now be up and running as seen in the docker logs because we tailed the logs of the scheduled cron job. When the first extraction run finished you can use the Plumber API to receive the data in R:

Where to go from here: Concluding remarks and additional notes

That‘s it, we build a robust service for extracting data from Google Big Query and made the data easily accessible through a REST API with Docker and R in this three article series.

Originally, I mounted in the docker-compose.yml for the API container the docker UNIX socket of the host Docker daemon as a volume -/var/run/docker.sock:/var/run/docker.sock to be able to get the docker logs from the host via an API call. However, I removed this part as this practice is a huge security issue, especially if the containers are used in production. See https://raesene.github.io/blog/2016/03/06/The-Dangers-Of-Docker.sock/ for more information.

From here on, you could deploy this multi container service into production, for example to cloud services such as AWS, Google Cloud and DigitalOcean. It is useful to have a container orchestration tool deployed such as Docker Swarm or Kubernetes to manage your Docker containers and their shared resources.

In a production setting you might want to use a reverse proxy server, such as Nginx to redirect the API requests to an URL further to the exposed port of your API Docker container and encrypt it via HTTPS. Additionally you might want to write unit tests for your API with R package testthat and also load testing your API while under many requests with e.g. the R package loadtest.

Plumber handles API requests sequentially. If you experience a lot of API calls, one option would be to deploy several of the API containers and load balance the incoming traffic to them via Nginx. If you want to run four of the API containers, run docker-compose up with scale parameter:

$ docker-compose up --scale REST-API=4

Another option is not to build your API with Plumber but with R package RestRserve. It handles requests in parallel and might be overall the better solution if you need an industry grade API, however it is more complicated to define the endpoints. For additional tips on speeding up your API, see https://medium.com/@JB_Pleynet/how-to-do-an-efficient-r-api-81e168562731.


Easy API building for Data Scientists with R was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: Stories by Tim M. Schendzielorz on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)