Admit it or not, we human beings become anxious and impatient when it comes to wait. Especially when we are blindfolded — that is, we are unaware of how long we have to suffer the endless wait. As pointed out by Brad Allan Myers, arguably the designer of progress bar in 1980s, being able to track the progress during the waiting can significantly improve the user experience (Myers, 1985).
As an R programmer in bioinformatic research, often my codes are not designed for the general public, but it is important to make sure that my users, namely my fellow colleagues and researchers, are as happy as possible. However, tracking process in R can be tricky. In this article, I am going to present you some approaches and my solution (pbmcapply).
The easiest way to handle progress tracking in R is to periodically print the percentage of completion to the output, which is the screen by default, or write it to your log file located somewhere on the disk. Needless to say, this is probably the least elegant way to solve the problem, but many people are still following this path nowadays.
An better (and still easy) solution is to adopt a package named pbapply. According to its dev page, the package has been very popular — 90k downloads. The package is easy to use. Whenever you are about to call the apply function, use the pbapply version of it. For example:
# Some numbers we are going to work with
nums <- 1:10
# Let's call the lapply to get the sqare root of these numbers
sqrt <- sapply(nums, sqrt)
# Now let's track the process using pbapply package
sqrt <- pbsapply(nums, sqrt)
While the numbers are processed, a progress bar will be printed to the output and refreshed repeatedly.
Although pbapply is a great tool and I use it frequently, it failed to track the progress of the paralleled version of apply — mcapply — until recently. In September, the author of pbapply updated his package with support to snow-type clusters and multicore-type forking. However, his approach relies on splitting the elements into fractions and applies mcapply to them sequentially. One of the greatest caveat of this approach is that if the number of elements is significantly higher than the number of cores, a lot of mcapply calls will be executed. Mcapply calles, which is built upon the fork() function in Unix/Linux, is very expensive: forking into lots of child processes is time consuming and creates memory overhead.
Pbmcapply is my own solution to address this problem. Available as a CRAN package, it can be easily incorporated into your code:
# Install pbmcapply
As you might have realized by its name, I was inspired by the pbapply package. Unlike pbapply, my solution does not rely on executing multiple mcapply calls. Instead, pbmcapply takes advantages of a package named future.
In Computer Science, future refers to an object that will hold values later. It allows the program to execute some code as a future and, without waiting for the return, proceed to the next step. In pbmcapply, mcapply will be wrapped into a future. The future will update the main program with its progress periodically and the main program will maintain a progress bar to display the updates.
Because the overhead was minimal and non-linear in pbmcapply, a dramatic increase of performance is seen when the number of elements to iterate over is significantly bigger than the number of CPU cores. Single-thread and multi-threaded apply functions from the R base are used as a reference. It is obvious that even with pbmcapply, the performance is affected due to time required to set up the monitor process.
Everything comes at a price. When enjoying the convenience of interactive progress tracking, please keep in mind that it slightly slows down the program.
Like always, one shoe doesn’t fit all. If performance is your top priority (e.g. when running a program on a cluster), a better way to track progress might be print. On the other hand, if letting the program run for an extra second sounds reasonable, you are more than welcome to check either my solution (pbmcapply) or pbapply in order to get a more user-friendly way to track the progress.
Myers, B. A. (1985). The importance of percent-done progress indicators for computer-human interfaces. In ACM SIGCHI Bulletin (Vol. 16, №4, pp. 11–17). ACM.