Rstudio + Selenium + AWS: Deep in Docker Hell

[This article was first published on Digital Age Economist on Digital Age Economist, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As with most data scientist, there comes a time in your life when you have enough false confidence that you try to build production systems using docker on AWS. This past weekend, I fell in this trap. Having recently attended a R-ladies event in Cape Town that dealt with the topic of Docker for Data Science, I achieved a +5 docker confidence bonus – and you know what they say about having a new hammer… you look for anything that looks like a nail!

So, with the new confidence in my DevOps ability I finally found the perfect use case for a dockerised production system using only containers: Building a reliable scraper on an AWS cloud server. The set up was perfect:

  • Already written and unit tested code
  • Quick setup needed with data collection priority
  • Built around tidyverse
  • Uses Selenium which also comes in a container (this was the gremlin without me knowing it before I started)

At this stage, I was still excited. If I was to describe the system design I had in mind, it would be:

I have regularly used docker to run a small scripts on AWS using the rocker/tidyverse image. This worked well when I just needed to quickly experiment or needed to run a small script that takes time (i.e I dont want to keep my computer on for the next 4 days while my analysis runs). Another use case for aws scripts has been whenever I have had to run something using RSelenium. They recently dockerized the chrome drivers – and thank goodness for this. The results is much more stable execution of scripts using RSelenium.

By now, you can image what happenend next… why not combine Rstudio + Rselenium, both in dockers. Warning, if you ever try this, don’t try this at home, have a SysAdmin guy nearby. The precursor is simple, run 2 docker run commands and boom Data Science. Well, that wasn’t the case. For the life of me, getting the two containers to speak to each other was not so simple.

First off, when you run the docker command, be aware that the instance listens on 0.0.0.0, so anyone can connect!! Secondly, if you start up your docker headless driver, the same applies. So remember to bind the container to localhost as a start otherwise, anyone will be able to use your cloud server 😉

Next we gonna kick off the RStudio container, dont use the default user settings!! Big security risk: user is rstudio, and you guessed it, password is rstudio.

docker run --name rstudio -d -p 7800:8787 -e USER=hanjo -e PASSWORD=testtest -v /home/hanjo/rstudio:/home/hanjo/ rocker/tidyverse

Breaking this down:

  • --name: I always name the dockers, because its easier to work with
  • -d -p 7800:8787: I change the exposed external port to help prevent people trying to hack defaults
  • -e USER=* -e PASSWORD=*: Change the default login
  • -v: Mount the volumes so my docker image can see my localhost. I do this because in my scraper, I will be downloading information

Next, I start my Selenium docker

docker run --name chrome -d -p 127.0.0.1:4445:4444 selenium/standalone-chrome

So now the big kicker, making these two talk! I started with linking the dockers in the old school way:

docker run --name chrome1 -d -p 127.0.0.1:4445:4444 selenium/standalone-chrome

docker run -it --link chrome:chrome1 --name rstudio -d -p 8787:8787 -e USER=hanjo -e PASSWORD=testtest -v /home/hanjo/rstudio:/home/hanjo/ rocker/tidyverse

# Lets look inside container
docker exec -it rstudio bash
cat /etc/hosts

>199.11.0.2      chrome1 782f3286c03d chrome
>199.11.0.3      0bc6806d8b8e

Ok, so this took a while to figure out. As you can see that this workflow is very tedious as well as frustrating when you realise the link between the containers did not work.

Once you have all of this working, you also realise, that the port to connect to your selenium is not the external port of the container, but the internal i.e 4444, not 4445.

library(Rselenium)
eCaps <- list(
  chromeOptions =
    list(prefs = list(
      "profile.default_content_settings.popups" = 0L,
      "download.prompt_for_download" = FALSE,
      # Set the download file path here... in this case this is where the VM downloads the files. need to map this volume to the host machine through docker
      "download.default_directory" = "/home/seluser",
      "marionette"= FALSE
    )
    )
)

# docker run --name chrome -d -p 127.0.0.1:4445:4444 selenium/standalone-chrome

remDr <-  remoteDriver(remoteServerAddr = "199.11.0.3", 
                       port = 4444L, 
                       extraCapabilities=eCaps,browser="chrome")
remDr$open()

I got it to work this way and it is a good to know if you are more eager to know how the underlying environments operate, but what I should have done from the start is use Docker Compose which uses yaml:

version: '2'
services:
  ropensci: 
    image: rocker/ropensci
    ports:
      - "8788:8787"
    links:
      - selenium:selenium
  selenium:
    image: selenium/standalone-firefox:2.53.0
    ports:
      - "4445:4444"

If you go this route, its more abstruse to what is really happening, but at least compose handles all the heavy lifting and headaches for you.

I found this beautiful code in the ISSUES section Rselenium on github

Conclusion

I have learned a lot in terms of designing dockerized environments for data science. Lesson 1: practice is very different from knowing a small bit of code and wanting to apply that in a complex problem. Lesson 2: docker has its place, but I think I would have been better off installing Rstudio on the localhost and using Rselenium for stability.

If you are keen to learn more about the tidyverse, web collection as well as deployment of scrapers on cloud servers - we are hosting a workshop in Cape Town where we go in-depth into online data collection practices, ethics and cloud deployment. Looking forward to seeing you there! Link: https://bit.ly/2IeJffs

To leave a comment for the author, please follow the link and comment on their blog: Digital Age Economist on Digital Age Economist.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)