Fast(ish) extraction of exon locations from a BED12 file using data.table

March 20, 2011
By

(This article was first published on Recipes, scripts and genomics, and kindly contributed to R-bloggers)

Here is a fast R function to extract exon locations from a BED12 file. Note that fast is a relative term, the function below is fast enough for me, may not be fast enough for others :) Anyway, a BED12 file typically has locations of genomic features (those features are usually genes or transcripts if the format is BED12 ). 11th and 12th columns of the BED12 file contains information about exon sizes and exon starts, see here for more info on BED format. To be able to use that information one has to employ string functions since exon sizes and starts are concatenated to each other with commas. A straightforward way to do is to use strplit() and apply() to extract numbers located in strings, but it takes too much time if you have a large number of rows.

By using data.table I could get locations of exons for all human Ensembl transcripts in 5-6 seconds, while using apply() and strsplit() on a regular data.frame takes about 30 seconds ( see the EDIT below, this is probably because I used apply in an inefficient way, it may not be needed at all ). The function returns a data.table object which then can be transformed into IRanges or GenomicRanges objects easily. Doing the string operations and iteration in C or C++ may decrease this time even further.

EDIT: I'm not sure why this is particularly faster than a data.frame, I guess I used apply in an idiotic way on the BED data.frame. I applied strsplit row-by-row on 11th column of each row, rather than using strsplit directly on 11th column of the data.frame. Anyway, it was a good exercise but a bit useless :)

bed12.to.exons<-function(bed.file,skip=0)
{
require(data.table)
colClasses=c("factor","integer","integer","factor","integer","factor","integer","integer","integer","integer","factor","factor")
ref =read.table(bed.file,header=F,skip=skip,colClasses = colClasses) # give colClasses for fast read-in
ref.dt=data.table(ref)
 
# apply strsplit on columns 11 and 12 to extract exon start positions and exon sizes
b.start.size=ref.dt[, cbind(as.integer(unlist(strsplit(as.character(V12),"," ))),as.integer(unlist(strsplit(as.character(V11),"," ))))]
rep.ref.dt=ref.dt[rep(1:nrow(ref.dt),ref[,10]),] # replicate rows occurs as many as its exon number
 
rep.ref.dt$V3=rep.ref.dt$V2+b.start.size[,1]+b.start.size[,2] # calculate exon start and ends
rep.ref.dt$V2=rep.ref.dt$V2+b.start.size[,1]+1
 
return(rep.ref.dt[,1:6,with=F])
}
 
res=bed12.to.exons(bed.file) 
system.time(bed12.to.exons(bed.file))
 
 
# a line from a bed file, just to show how it looks like
>read.table(bed.file,header=F)[1,]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 chr1 223340009 223340009 ENST00000328556 0 + 223340009 223351568 0 2 105,126,
V12
1 0,11433,
Created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: Recipes, scripts and genomics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.