Dockerizing Data Science: Introduction
PreReqs: Docker, images, and containers
Dockerizing data science packages have become more relevant these days mainly because you can isolate your data science projects without breaking anything. Dockerizing data science projects also make most of your projects portable and sharable and not worrying about installing right dependencies (you python fans know about it). One of the greatest challenges that threaten data science projects are the ability to deploy them. Docker makes it easy to either deploying them as API’s (using plumber or flask) or deploy data science applications (Shiny) or scheduling runs (cron or taskscheduleR). To put this on steroids, you can orchestrate these through Kubernetes (k8s) or Docker swarm which can do a lot more things such as maintaining the state of your container, managing multiple containers and load balancing. Most data science platforms (DSP) are built based on this architecture and leveraging them for a very long time. When these tools are out of reach for you or for your enterprise (due to exorbitant price), you can always leverage all the open source tools with few extra steps and achieve the same. So, here are some of the best Docker images that are out there for you to start exploring data in an isolated environment in minutes.
So here are the top 8 docker images out there
Jupyter is one of the favorite tools for many data scientists and analysts today. That is mainly because of their notebook style data analysis technique. Jupyter also supports various kernels such as Python, R, Julia and many more and hence it has gained a huge fan base. Most DSP’s comes with this notebook by default which has added on further. There is a Docker image on the Docker hub for both single user and hub. This makes it easier to pull an image do all the analysis, commit it and share it with anyone you want. Hence, it comes on top of the list.
- Jupyter Lab
Jupyter lab is an extension to Jupyter. This provides a more extensive interface and more options on your notebook such as having notebooks side by side, viewing all kernels on the same page, browsing through folders and many more. If you have already experienced Jupyter, this is something that you have to try. This is my preferred notebook, but the only reason it’s not in the first place is that it is due to lack of add-ons.
Note: there are a couple of dozen variants of various Jupyter lab and Jupyter images with add-ons such as spark, Hadoop, etc. I won’t be mentioning it here.
R has been my go-to language for a very long time because of its extensive statistics library which python lacks in and is very easy to use. No frills at all. If you look at standard RIDE, it is really boring and sometimes frustrates you. That’s where Rstudio makes an entrance. It’s probably the best RIDE out there and thanks to Rocker, its available in Docker as well. The only thing this lacks in is Jupyter style notebook writing. If Rstudio can make it happen, I would never cheat on Rstudio.
- Python Base
Once you have built all your models and time for deployment, you might not need all that interactive IDE for deployment. The main reason is because it consumes a lot of space. In those instances, you can use base python images, install required libraries and deploy them. In the end, you can have a containerized project under 500MB running in an isolated environment.
Technically, Python and R-base go hand in hand. They both deserve the same place. Similar to Python, Rstudio consumes close 1GB for each image. So, if you want to deploy containers say for a plumber app, use R-base which is much leaner in size.
If you are not a coder, Dataiku (DSP) has their platform as a Docker image. You can pull the image and get it up and running in 5 minutes. Dataiku is one of the best enterprise ready DSP’s out there. It supports both coding as well as clicking kind of interface (Dataiku calls it to code or click). When you go through close to 30 or 40 DSP’s out there, this is probably the only DSP to support this. They also have a free version with limited features that you can use for quick analysis. Also, to mention they do have AutoML built into this tool.
KNIME is another open source DSP. This is available for both windows as well as Linux. If you are not good at programming, then this is probably the best tool out there for data science. Thanks to contributors to KNIME, a Docker image is available on Docker hub for this.
- H2O flow
H2O flow is another open source tool by h2o.ai. This an interactive tool where you can load data, cleaning, processing, model building and analyzing data interactively on the browser. They also have their own version of AutoML where you can build multiple models with not needing to code explicitly. I usually use this to benchmark my ML models and has saved me a lot of times. H2O is also available in both R and python if you are interested in coding.
So, here are my thoughts on top 8 docker images out there for data science projects. Let, me know what you think. If you think there is a particular image that deserves to be on this list, then comment below.