Fast(ish) extraction of exon locations from a BED12 file using data.table

March 20, 2011

(This article was first published on Recipes, scripts and genomics, and kindly contributed to R-bloggers)

Here is a fast R function to extract exon locations from a BED12 file. Note that fast is a relative term, the function below is fast enough for me, may not be fast enough for others šŸ™‚ Anyway, a BED12 file typically has locations of genomic features (those features are usually genes or transcripts if the format is BED12 ). 11th and 12th columns of the BED12 file contains information about exon sizes and exon starts, see here for more info on BED format. To be able to use that information one has to employ string functions since exon sizes and starts are concatenated to each other with commas. A straightforward way to do is to use strplit() and apply() to extract numbers located in strings, but it takes too much time if you have a large number of rows.

By using data.table I could get locations of exons for all human Ensembl transcripts in 5-6 seconds, while using apply() and strsplit() on a regular data.frame takes about 30 seconds ( see the EDIT below, this is probably because I used apply in an inefficient way, it may not be needed at all ). The function returns a data.table object which then can be transformed into IRanges or GenomicRanges objects easily. Doing the string operations and iteration in C or C++ may decrease this time even further.

EDIT: Iā€™m not sure why this is particularly faster than a data.frame, I guess I used apply in an idiotic way on the BED data.frame. I applied strsplit row-by-row on 11th column of each row, rather than using strsplit directly on 11th column of the data.frame. Anyway, it was a good exercise but a bit useless šŸ™‚<-function(bed.file,skip=0)
ref =read.table(bed.file,header=F,skip=skip,colClasses = colClasses) # give colClasses for fast read-in
# apply strsplit on columns 11 and 12 to extract exon start positions and exon sizes
b.start.size=ref.dt[, cbind(as.integer(unlist(strsplit(as.character(V12),"," ))),as.integer(unlist(strsplit(as.character(V11),"," ))))]
rep.ref.dt=ref.dt[rep(1:nrow(ref.dt),ref[,10]),] # replicate rows occurs as many as its exon number
rep.ref.dt$V3=rep.ref.dt$V2+b.start.size[,1]+b.start.size[,2] # calculate exon start and ends
# a line from a bed file, just to show how it looks like
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 chr1 223340009 223340009 ENST00000328556 0 + 223340009 223351568 0 2 105,126,
1 0,11433,

To leave a comment for the author, please follow the link and comment on their blog: Recipes, scripts and genomics. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)