I’ve been optimizing various functions in ALDEx2 on Bioconductor to make it more efficient. One bottleneck has been the aldex.effect() function which calculates an effect size for the difference between distributions. I will write a separate post on how this calculation works, but I’m happy to say that the new effect size calculation is about 3X faster than the old one. .
I was aided in this by the fabulous profvis package that gives a graphical overview of where the time and space bottlenecks are in R code. I discovered several bottlenecks.
First, when the original version was first written for R v2.8 we had substantial memory issues that were solved by adding in gc() calls to free up memory. In the current version of R v4.0 the garbage collection occurs in the background as seen in the profvis()output. Removing these legacy calls resulted in a large speedup.
Second, the calculations done in the background are usually non-parametric such as finding maximum or minimum values, or medians. These were replaced with the pmax() function from base R and replacing apply() functions with corresponding rowMedians() and colMedians() functions from the Rfast package on CRAN.
These simple changes result in quite remarkable speed increases. The function is now able to process 128 Dirichlet MC replicates from 14 samples and 1600 rows in about 1.6 seconds, down from about 4.7 seconds; about a 3X speedup. As a side effect, the memory footprint is cut in half from about 1500 MB to about 700 MB, making the entire function more efficient in time and space.
You can try it yourself by installing the development version of ALDEx2 from Bioconductor and running the following code.
conds <- c(rep("N",7),rep("S",7))
x <- aldex.clr(selex, conds)
system.time(x.e <- aldex.effect(x))