Quality trimming in R using ShortRead and Biostrings

March 3, 2010
By

(This article was first published on Jermdemo Raised to the Law, and kindly contributed to R-bloggers)

I wrote an R function to do soft-trimming, right clipping FastQ reads based on quality.

This function has the option of leaving out sequences trimmed to extinction and will do left-side fixed trimming as well.
#softTrim
#trim first position lower than minQuality and all subsequent positions
#omit sequences that after trimming are shorter than minLength
#left trim to firstBase, (1 implies no left trim)
#input: ShortReadQ reads
# integer minQuality
# integer firstBase
# integer minLength
#output: ShortReadQ trimmed reads
library("ShortRead")
softTrim<-function(reads,minQuality,firstBase=1,minLength=5){
qualMat<-as(FastqQuality(quality(quality(reads))),'matrix')
qualList<-split(qualMat,row(qualMat))
ends<-as.integer(lapply(qualList,
function(x){which(x < minQuality)[1]-1}))
#length=end-start+1, so set start to no more than length+1 to avoid negative-length
starts<-as.integer(lapply(ends,function(x){min(x+1,firstBase)}))
#use whatever QualityScore subclass is sent
newQ<<-ShortReadQ(sread=subseq(sread(reads),start=starts,end=ends),
quality=new(Class=class(quality(reads)),
quality=subseq(quality(quality(reads)),
start=starts,end=ends)),
id=id(reads))

#apply minLength using srFilter
lengthCutoff <- srFilter(function(x) {
width(x)>=minLength
},name="length cutoff")
newQ[lengthCutoff(newQ)]
}


To use:
library("ShortRead")
source("softTrimFunction.R") #or whatever you want to name this
reads<-readFastq("myreads.fq") trimmedReads<-softTrim(reads=reads,minQuality=5,firstBase=4,minLength=3) writeFastq(trimmedReads,file="trimmed.fq")
I strongly recommend reading the excellent UC Riverside HT-Sequencing Wiki cookbook and tutorial if you wish to venture into using R for NGS handling. Among other things, it will explain how to perform casting if you have Solexa scaled (base 64) fastq files. The function should respect that. http://manuals.bioinformatics.ucr.edu/home/ht-seq

To leave a comment for the author, please follow the link and comment on his blog: Jermdemo Raised to the Law.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.