Comparing the execution time between foverlaps and findOverlaps [data.table vs GenomicRanges]

Posted on April 13, 2015 by Katarzyna Wręczycka in R bloggers | 0 Comments

[This article was first published on Recipes, scripts and genomics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Both of these functions find overlaps between genomic intervals. The findOverlaps function is from the Bioconductor package GenomicRanges(or IRanges if you don’t need to compare intervals with an associated chromosome and strand). foverlaps is from the data.tablepackage and is inspired by findOvelaps.

In genomics, we often have one large data set X with small interval ranges (usually sequenced reads) and another smaller data set Y with larger interval spans (usually exons, introns etc.). Generally, we are tasked with finding which intervals in X overlap with which intervals in Y.

In the foverlaps function Y has to be indexed using the setkey function (we don’t have to do it on X). The key is intended to speed-up finding overlaps.

Which one is faster?

To check this we used the benchmark function from the rbenchmarkpackage. It’s a simple wrapper of the system.time function.

The code below plots the execution time of both functions for increasing numbers of rows of data set X.

Interestingly, foverlaps is the fastest way to solve the problem of finding overlaps, but only when the large data set has less than 200k rows.

We also plotted situation when we exchanged the place of X and Y in arguments of both functions. In this case you can see that almost from the beginning foverlaps is much slower than findOverlaps.

Information about my R session:

> sessionInfo()

R version 3.1.3 (2015-03-09)

Platform: x86_64-apple-darwin13.4.0 (64-bit)

Running under: OS X 10.9.4 (Mavericks)

locale:

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:

[1] stats4    parallel  stats     graphics  grDevices utils     datasets 

[8] methods   base     

other attached packages:

[1] data.table_1.9.4     rbenchmark_1.0.0     GenomicRanges_1.18.4

[4] GenomeInfoDb_1.2.4   IRanges_2.0.1        S4Vectors_0.4.0     

[7] BiocGenerics_0.12.1 

loaded via a namespace (and not attached):

[1] chron_2.3-45   plyr_1.8.1     Rcpp_0.11.5    reshape2_1.4.1 stringr_0.6.2 

[6] tools_3.1.3    XVector_0.6.0

To leave a comment for the author, please follow the link and comment on their blog: Recipes, scripts and genomics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Comparing the execution time between foverlaps and findOverlaps [data.table vs GenomicRanges]

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)