Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When you are working with large FASTA datasets is probable to find out that the sequences are in sort of a mixed quality (obviously, depending on your scientific question),

I mean for example, imagine that you retrieve the whole collection of exons of a given organism and suppose that the FASTA file is 50mb and there are included ~50,000 DNA sequences but, if you look at them, you may find that there are sequences much larger than others and others will be probably 50 bases long.

Well that’s what I mean with mixed quality and therefore, a sort of filter might be very helpful because you may find more informative to put a threshold and say “hey,  from the pool of ~50K sequences I only want 5K sequences randomly chosen (from the entire FASTA dataset) and every sequence must have a fixed length, say 1000 bases long”.

That’s why I wrote some code in R language, it’s a function titled “shuffleAndExtract” and you can download the example set and the source here.

Here is the description of the function:

shuffleAndExtract: This function in R is designed to open a fasta file dataset, shuffle the sequences and extract the desired sequences wanted by the user to generate a new dataset of fixed size (number of required sequences) and with the same length for each sequence.

And, after you download the example set and the R source file, you can try to run a very simple example:

NOTE: my implementation depends on the function “seqinr“, if is not installed, you may do this before all the magic begin with a simple install.packages(“seqinr”).

# run example:
shuffleAndExtract(“example.fasta”,1000,200)

The arguments of the function are:

inputFastaFile: name of the input fasta file
numberOfoutputSeqs: number of desired sequences in the output file
lengthOutputSeqs: fixed length of every sequence in the output file
initialPos: Position where the new window sizing will begin, default = 1
outputFileName: name of the output file, by default will be (e.g):  “inputFastaFile.fasta.output.fasta”

CODE:

################################################################################

shuffleAndExtract <- function(inputFastaFile,
numberOfoutputSeqs,
lengthOutputSeqs,
initialPos=1,
outputFileName = paste(inputFastaFile,"output",
"fasta",sep=".")
){

#
#
#    This function in R is designed to open a fasta file dataset, shuffle the
#    sequences and extract the desired sequences wanted by the user to generate
#    a new dataset of fixed size (number of required sequences) and with the
#    same length for each sequence.
#
#    Author: Benjamin Tovar
#    Date: 22/JUNE/2012
#
################################################################################

#    run example:
#    shuffleAndExtract("example.fasta",1000,200)

# inputFastaFile: name of the input fasta file
# numberOfoutputSeqs: number of desired sequences in the output file
# lengthOutputSeqs: fixed length of every sequence in the output file
# initialPos: Position where the new window sizing will begin, default = 1
# outputFileName: name of the output file, by default will be (e.g):
#                 "inputFastaFile.fasta.output.fasta"

cat("*** Starting computations |",date(),"****\n")

require(seqinr)

# Load the large seq and not shuffled dataset

cat("\tProcessing for",length(inputSeqs),"sequences |",date(),"\n")

# Extract the length of every sequence and store the results in a vector.
inputSeqsSize <- rep(NA,length(inputSeqs))
for(i in 1:length(inputSeqs)){
inputSeqsSize[i] <- summary(inputSeqs[[i]])\$length
}

# Extract the index of the sequences which are longer than threshold
inputSeqsLongSeqIndex <- which(inputSeqsSize > (lengthOutputSeqs+initialPos))

# randomly pick numberOfoutputSeqs indexes that will be used to create
# the output dataset
inputSeqsIndex <- sample(inputSeqsLongSeqIndex,numberOfoutputSeqs,rep=F)

# Store the Fasta header of each selected sequence in a vector
inputSeqsIndexNames <- rep(NA,numberOfoutputSeqs)

# create output object
outputSeqs <- list()
for(i in 1:numberOfoutputSeqs){
inputSeqsIndexNames[i] <- attr(inputSeqs[[inputSeqsIndex[i]]],"name")
# Extract the sequence
outputSeqs[[i]] <- inputSeqs[[inputSeqsIndex[i]]][initialPos:((initialPos+lengthOutputSeqs)-1)]
}

# Export the sequences in a new fasta file
write.fasta(outputSeqs,inputSeqsIndexNames,outputFileName)

cat("\n*** DONE |",date(),"****\n")
}

# 22 Jun 2012 | Benjamin Tovar


Benjamin