Comparing the execution time between foverlaps and findOverlaps
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Both of these functions find overlaps between genomic intervals. The findOverlaps function is from the Bioconductor package GenomicRanges(or IRanges if you don’t need to compare intervals with an associated chromosome and strand). foverlaps is from the data.tablepackage and is inspired by findOvelaps.
In genomics, we often have one large data set X with small interval ranges (usually sequenced reads) and another smaller data set Y with larger interval spans (usually exons, introns etc.). Generally, we are tasked with finding which intervals in X overlap with which intervals in Y.
In the foverlaps function Y has to be indexed using the setkey function (we don’t have to do it on X). The key is intended to speed-up finding overlaps.
Which one is faster?
To check this we used the benchmark function from the rbenchmarkpackage. It’s a simple wrapper of the system.time function.
The code below plots the execution time of both functions for increasing numbers of rows of data set X.
Interestingly, foverlaps is the fastest way to solve the problem of finding overlaps, but only when the large data set has less than 200k rows.
We also plotted situation when we exchanged the place of X and Y in arguments of both functions. In this case you can see that almost from the beginning foverlaps is much slower than findOverlaps.
Information about my R session:
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.