As with most data scientist, there comes a time in your life when you have enough false confidence that you try to build production systems using docker on AWS. This past weekend, I fell in this trap. Having recently attended a R-ladies event in Cape Town that dealt with the topic of Docker for Data Science, I achieved a
+5 docker confidence bonus – and you know what they say about having a new hammer… you look for anything that looks like a nail!
— Hanjo Odendaal (@UbuntR314) April 17, 2018
So, with the new confidence in my DevOps ability I finally found the perfect use case for a dockerised production system using only containers: Building a reliable scraper on an AWS cloud server. The set up was perfect:
- Already written and unit tested code
- Quick setup needed with data collection priority
- Built around tidyverse
- Uses Selenium which also comes in a container (this was the gremlin without me knowing it before I started)
At this stage, I was still excited. If I was to describe the system design I had in mind, it would be:
I have regularly used docker to run a small scripts on AWS using the
rocker/tidyverse image. This worked well when I just needed to quickly experiment or needed to run a small script that takes time (i.e I dont want to keep my computer on for the next 4 days while my analysis runs). Another use case for aws scripts has been whenever I have had to run something using
RSelenium. They recently dockerized the chrome drivers – and thank goodness for this. The results is much more stable execution of scripts using
By now, you can image what happenend next… why not combine Rstudio + Rselenium, both in dockers. Warning, if you ever try this, don’t try this at home, have a SysAdmin guy nearby. The precursor is simple, run 2
docker run commands and boom Data Science. Well, that wasn’t the case. For the life of me, getting the two containers to speak to each other was not so simple.
First off, when you run the docker command, be aware that the instance listens on
0.0.0.0, so anyone can connect!! Secondly, if you start up your docker headless driver, the same applies. So remember to bind the container to localhost as a start otherwise, anyone will be able to use your cloud server 😉
Next we gonna kick off the RStudio container, dont use the default user settings!! Big security risk: user is rstudio, and you guessed it, password is rstudio.
docker run --name rstudio -d -p 7800:8787 -e USER=hanjo -e PASSWORD=testtest -v /home/hanjo/rstudio:/home/hanjo/ rocker/tidyverse
Breaking this down:
--name: I always name the dockers, because its easier to work with
-d -p 7800:8787: I change the exposed external port to help prevent people trying to hack defaults
-e USER=* -e PASSWORD=*: Change the default login
-v: Mount the volumes so my docker image can see my localhost. I do this because in my scraper, I will be downloading information
Next, I start my Selenium docker
docker run --name chrome -d -p 127.0.0.1:4445:4444 selenium/standalone-chrome
So now the big kicker, making these two talk! I started with linking the dockers in the old school way:
docker run --name chrome1 -d -p 127.0.0.1:4445:4444 selenium/standalone-chrome docker run -it --link chrome:chrome1 --name rstudio -d -p 8787:8787 -e USER=hanjo -e PASSWORD=testtest -v /home/hanjo/rstudio:/home/hanjo/ rocker/tidyverse # Lets look inside container docker exec -it rstudio bash cat /etc/hosts >220.127.116.11 chrome1 782f3286c03d chrome >18.104.22.168 0bc6806d8b8e
Ok, so this took a while to figure out. As you can see that this workflow is very tedious as well as frustrating when you realise the link between the containers did not work.
Once you have all of this working, you also realise, that the port to connect to your selenium is not the external port of the container, but the internal i.e 4444, not 4445.
library(Rselenium) eCaps <- list( chromeOptions = list(prefs = list( "profile.default_content_settings.popups" = 0L, "download.prompt_for_download" = FALSE, # Set the download file path here... in this case this is where the VM downloads the files. need to map this volume to the host machine through docker "download.default_directory" = "/home/seluser", "marionette"= FALSE ) ) ) # docker run --name chrome -d -p 127.0.0.1:4445:4444 selenium/standalone-chrome remDr <- remoteDriver(remoteServerAddr = "22.214.171.124", port = 4444L, extraCapabilities=eCaps,browser="chrome") remDr$open()
I got it to work this way and it is a good to know if you are more eager to know how the underlying environments operate, but what I should have done from the start is use Docker Compose which uses
version: '2' services: ropensci: image: rocker/ropensci ports: - "8788:8787" links: - selenium:selenium selenium: image: selenium/standalone-firefox:2.53.0 ports: - "4445:4444"
If you go this route, its more abstruse to what is really happening, but at least
compose handles all the heavy lifting and headaches for you.
I found this beautiful code in the
ISSUES section Rselenium on github
I have learned a lot in terms of designing dockerized environments for data science. Lesson 1: practice is very different from knowing a small bit of code and wanting to apply that in a complex problem. Lesson 2:
docker has its place, but I think I would have been better off installing
Rstudio on the localhost and using
Rselenium for stability.
If you are keen to learn more about the
tidyverse, web collection as well as deployment of scrapers on cloud servers - we are hosting a workshop in Cape Town where we go in-depth into online data collection practices, ethics and cloud deployment. Looking forward to seeing you there!