Affordable automatic deployment of Spark and HDFS with Kubernetes and Gitlab CI/CD

[This article was first published on Angel Sevilla Camins' Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Summary

Running an application on Spark with external dependencies, such as R and python packages, requires the installation of these dependencies on all the workers. To automate this tedious process, a continuous deployment workflow has been developed using Gitlab CI/CD. This workflow consists of: (i) Building the HDFS and Spark docker images with the required dependencies for workers and the master (Python and R), (ii) deploying the images on a Kubernetes cluster. For this, we will be using an affordable cluster made of mini PCs. More importantly, we will demonstrate that this cluster is fully operational. The Spark cluster is accessible using Spark UI, Zeppelin and R Studio. In addition, HDFS is fully integrated together with Kubernetes. Source code for both the custom Docker images and the Kubernetes objects definitions can be found here and here respectively.

Go here to read the entire blog.

To leave a comment for the author, please follow the link and comment on their blog: Angel Sevilla Camins' Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)