Comparing the execution time between foverlaps and findOverlaps [data.table vs GenomicRanges]

April 13, 2015
By

(This article was first published on Recipes, scripts and genomics, and kindly contributed to R-bloggers)

Both of these functions find overlaps between genomic intervals. The findOverlaps function is from the Bioconductor package GenomicRanges(or IRanges if you don’t need to compare intervals with an associated chromosome and strand). foverlaps is from the data.tablepackage and is inspired by findOvelaps.

In genomics, we often have one large data set X with small interval ranges (usually sequenced reads) and another smaller data set Y with larger interval spans (usually exons, introns etc.). Generally, we are tasked with finding which intervals in X overlap with which intervals in Y.

In the foverlaps function Y has to be indexed using the setkey function (we don’t have to do it on X). The key is intended to speed-up finding overlaps.

Which one is faster?

To check this we used the benchmark function from the rbenchmarkpackage. It’s a simple wrapper of the system.time function.

The code below plots the execution time of both functions for increasing numbers of rows of data set X.

Interestingly,  foverlaps is the fastest way to solve the problem of finding overlaps, but only when the large data set has less than 200k rows.

We also plotted situation when we exchanged the place of X and Y in arguments of both functions. In this case you can see that almost from the beginning foverlaps is much slower than findOverlaps.

Information about my R session:

> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.4 (Mavericks)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     
other attached packages:
[1] data.table_1.9.4     rbenchmark_1.0.0     GenomicRanges_1.18.4
[4] GenomeInfoDb_1.2.4   IRanges_2.0.1        S4Vectors_0.4.0     
[7] BiocGenerics_0.12.1 
loaded via a namespace (and not attached):
[1] chron_2.3-45   plyr_1.8.1     Rcpp_0.11.5    reshape2_1.4.1 stringr_0.6.2 
[6] tools_3.1.3    XVector_0.6.0

To leave a comment for the author, please follow the link and comment on their blog: Recipes, scripts and genomics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)