Using parallelization, multiple git repositories and setting permissions when automating R applications with Jenkins

[This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In the previous post, we focused on setting up declarative Jenkins pipelines with emphasis on parametrizing builds and using environment variables across pipeline stages.

In this post, we look at various tips that can be useful when automating R application testing and continuous integration, with regards to orchestrating parallelization, combining sources from multiple git repositories and ensuring proper access right to the Jenkins agent.

Running stages in parallel

Parallel computation using R

There are numerous way to achieve parallel computation in the context of an R application, those native to R are for example

Governing parallelism directly within R code requires tackling many aspects, starting with logging and ending in handling conditions and exception. We might therefore also be interested in leaving the orchestration of parallelism to a layer above the R application code itself. This approach has both benefits and limitations, so careful consideration should be taken before the implementation starts.

Orchestrating parallelization of R jobs with Jenkins

Declarative Jenkins pipelines are one of the ways to orchestrate parallelism with many options, a very simple example of a parallelized process can look as follows:

pipeline {
agent any
stages {
stage('Preparation') {
steps {
// Cleanup, Environment setup, etc.
}
}
stage('Tests') {
parallel {
stage('Unit Tests') {
steps {
// Invoke unit tests
}
}
stage('Integration Tests') {
steps {
// Invoke integration tests
}
}
stage('Regression Tests') {
steps {
// Invoke regression tests
}
}
stage('Technical checks') {
steps {
// Invoke Technical checks
}
}
}
}
}
}

Note the parallel directive, which will ensure that the (sub)stages within it

  • Unit Tests
  • Integration Tests
  • Regression Tests and
  • Technical checks

will be executed in parallel.

The parallelization will be orchestrated only after the first stage – “Preparation” was finished first. This is useful in case we need a stage that is shared among the parallel stages to be executed first.

Failing early

If we want to fail the parallel stages early (as soon as any of them fails), we can add failFast true into the parallel stage:

stage('Tests') {
failFast true
parallel {
// ...
}
}
An example parallel Jenkins pipeline shown by BlueOcean. Image credit https://bit.ly/31e8cAy

An example parallel Jenkins pipeline shown by BlueOcean. Image credit https://bit.ly/31e8cAy

Cloning multiple git repositories

In certain situations, we may need to clone not just the main repository that is subject to our multibranch pipeline, but also secondary repositories.

An example of such setup is when we store modeling parameters for our run in a separate repository, or when configurations governing the runs are stored in a separate repository.

The git directive allows us to clone another repository. Note that if you need to use credentials for the process, those are configured in Jenkins’ credential configuration.

stage('Clone another repository') {
steps {
git branch: 'master',
credentialsId: 'my-credential-id',
url: '[email protected]:user/repo.git'
}
}

Cloning into a separate subdirectory

Note however this will clone the repository into the current working directory, where the main repository subject to the pipeline is likely already checked out. This may have unintended consequences, so a safer approach is to checkout the secondary repository into a separate directory. We can achieve this using the dir directive:

stage('Clone another repository to subdir') {
steps {
sh 'rm subdir -rf; mkdir subdir'
dir ('subdir') {
git branch: 'master',
credentialsId: 'my-credential-id',
url: '[email protected]:user/repo.git'
}
}
}

Cleaning up

After the pipeline is done, it may be useful do perform cleanup steps, for example removing unneeded directories. Since we likely want to clean those up regardless of the pipeline results, we can take advantage of the post directive running always, which will be executed regardless of the outcome of the pipeline stages.

One example use is to remove the hidden .git directories from both the working directory, where the main repository is checked out and the "subdir", where we checked out the secondary repository:

post {
always {
sh 'rm .git -rf'
sh 'rm subdir/.git -rf'
}
}

Changing permissions to allow the Jenkins user to read

One aspect of using Jenkins to execute our R code is to ensure that the Jenkins user executing the code on the worker node has access to all the necessary files. The following is a list of useful Linux commands that can help with the setup. These should, of course, be used with care.

# Add user `jenkins` to group `somegroup`
usermod -a -G somegroup jenkins
# Change group of somedir/ to somegroup, recursively
chgrp -R somegroup somedir/
# Allow group to read `somedir`, recursively
chmod -R g+r somedir/
# Find all directories in a path and allow group to traverse
find /dir/moredir/somedir -type d -exec chmod g+x {} \;

To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)