I had a bit of a play last night trying to hook a Jupyter notebook container up to an Apache Drill container using
docker-compose. The idea was to have a shared data volume between the two of them, but I couldn’t for the life of me get that to work using the the
docker-compose version 2 or 3 (services/volumes) syntax – for some reason, any of the Apache Drill containers I tried wouldn’t fire up properly.
So I eventually (3am…:-( went for a simpler approach, synching data through a local directory on host.
The result is something that looks like this:
The Apache Drill container, and an Apache Zookeeper container to keep it in check, I found via Dockerhub. I also reused an official RStudio container. The Jupyter container is one I rolled for TM351.
The Jupyter and RStudio containers can both talk to the Apache Drill container, and both analysis apps have access to their own data folder mounted in an application folder in the current directory on host.The data folders mount into separate directories in the Apache Drill container. Both applications can query into data files contained in either data directory as viewable from Apache Drill.
This is far from ideal, but it works. (The structure is as suggested so that RStudio and Jupyter scripts can both be used to download data into a data directory viewable from the Apache Drill container. Another approach would be to mount a separate
./data directory and provide some means for populating it with data files. Alternatively, if the files already exist on host, mounting the host data directory onto a
/data volume in the Apache Drill container would work too.
docker-compose.yaml file I’ve ended up with:
drill: image: dialonce/drill ports: - 8047:8047 links: - zookeeper volumes: - ./notebooks/data:/nbdata - ./R/data:/rdata zookeeper: image: jplock/zookeeper notebook: container_name: notebook-apache-drill-test image: psychemedia/ou-tm351-jupyter-custom-pystack-test ports: - 35200:8888 volumes: - ./notebooks:/notebooks/ links: - drill:drill rstudio: container_name: rstudio-apache-drill-test image: rocker/tidyverse environment: - PASSWORD=letmein #default user is: rstudio volumes: - ./R:/home/rstudio ports: - 8787:8787 links: - drill:drill
If you have
docker installed and running, running
docker-compose up -d in the folder containing the
docker-compose.yaml file will launch three linked containers: Jupyter notebook on localhost port 35200, RStudio on port 8787, and Apache Drill on port 8047. If the
./R/data subfolders don’t exist they will be created.
We can use the clients to variously download data files and run Apache Drill queries against them. In Jupyter notebooks, I used the
pydrill package to connect. Note the hostname used is the linked container name (in this case,
If we download data to the
./notebooks/data folder which is mounted inside the Apache Drill container as
/nbdata, we can query against it.
(Note – it probably would make sense to used a modified Apache Drill container configured to use CSV headers, as per Querying Large CSV Files With Apache Drill.)
We can also query against that same data file from the RStudio container. In this case I used the
DrillR package (I had hoped to use the
sergeant package (“drill sergeant”, I assume?! Sigh..;-) but it uses the RJDBC package which expects to find
java installed, rather than
java isn’t installed in the
rocker/tidyverse container I used.) UPDATE:
sergeant now works without Java dependency... Thanks, Bob:-)
I’m not sure if
DrillR is being actively developed, but it would be handy if it could return the data from the query as a dataframe.
So , getting up and running with Apache Drill and a data analysis environment is not that hard at all, if you have