Handling large FASTA sequence datasets in R: Shuffle and retrieve "n" number of sequences of fixed length from the whole FASTA file and export them in a new FASTA file.

June 23, 2012
By

(This article was first published on Computational Biology Blog in fasta format, and kindly contributed to R-bloggers)

When you are working with large FASTA datasets is probable to find out that the sequences are in sort of a mixed quality (obviously, depending on your scientific question),

I mean for example, imagine that you retrieve the whole collection of exons of a given organism and suppose that the FASTA file is 50mb and there are included ~50,000 DNA sequences but, if you look at them, you may find that there are sequences much larger than others and others will be probably 50 bases long.

Well that’s what I mean with mixed quality and therefore, a sort of filter might be very helpful because you may find more informative to put a threshold and say “hey,  from the pool of ~50K sequences I only want 5K sequences randomly chosen (from the entire FASTA dataset) and every sequence must have a fixed length, say 1000 bases long”.

That’s why I wrote some code in R language, it’s a function titled “shuffleAndExtract” and you can download the example set and the source here.

Here is the description of the function:

handleFastaDatasets.R 

shuffleAndExtract: This function in R is designed to open a fasta file dataset, shuffle the sequences and extract the desired sequences wanted by the user to generate a new dataset of fixed size (number of required sequences) and with the same length for each sequence.


And, after you download the example set and the R source file, you can try to run a very simple example:

NOTE: my implementation depends on the function “seqinr“, if is not installed, you may do this before all the magic begin with a simple install.packages(“seqinr”).

        # run example:
        source(“handleFastaDatasets.R”)
        shuffleAndExtract(“example.fasta”,1000,200)

The arguments of the function are:

     inputFastaFile: name of the input fasta file
     numberOfoutputSeqs: number of desired sequences in the output file
     lengthOutputSeqs: fixed length of every sequence in the output file
     initialPos: Position where the new window sizing will begin, default = 1
     outputFileName: name of the output file, by default will be (e.g):  “inputFastaFile.fasta.output.fasta”

CODE:

################################################################################

shuffleAndExtract <- function(inputFastaFile,
numberOfoutputSeqs,
lengthOutputSeqs,
initialPos=1,
outputFileName = paste(inputFastaFile,"output",
"fasta",sep=".")
){

#
# handleFastaDatasets.R
#
# This function in R is designed to open a fasta file dataset, shuffle the
# sequences and extract the desired sequences wanted by the user to generate
# a new dataset of fixed size (number of required sequences) and with the
# same length for each sequence.
#
# Author: Benjamin Tovar
# Date: 22/JUNE/2012
#
################################################################################

# run example:
# source("handleFastaDatasets.R")
# shuffleAndExtract("example.fasta",1000,200)

# inputFastaFile: name of the input fasta file
# numberOfoutputSeqs: number of desired sequences in the output file
# lengthOutputSeqs: fixed length of every sequence in the output file
# initialPos: Position where the new window sizing will begin, default = 1
# outputFileName: name of the output file, by default will be (e.g):
# "inputFastaFile.fasta.output.fasta"

cat("*** Starting computations |",date(),"****\n")

# Load the seqinr package
require(seqinr)

# Load the large seq and not shuffled dataset
inputSeqs <- read.fasta(inputFastaFile)

cat("\tProcessing for",length(inputSeqs),"sequences |",date(),"\n")

# Extract the length of every sequence and store the results in a vector.
inputSeqsSize <- rep(NA,length(inputSeqs))
for(i in 1:length(inputSeqs)){
inputSeqsSize[i] <- summary(inputSeqs[[i]])$length
}

# Extract the index of the sequences which are longer than threshold
inputSeqsLongSeqIndex <- which(inputSeqsSize > (lengthOutputSeqs+initialPos))

# randomly pick numberOfoutputSeqs indexes that will be used to create
# the output dataset
inputSeqsIndex <- sample(inputSeqsLongSeqIndex,numberOfoutputSeqs,rep=F)

# Store the Fasta header of each selected sequence in a vector
inputSeqsIndexNames <- rep(NA,numberOfoutputSeqs)

# create output object
outputSeqs <- list()
for(i in 1:numberOfoutputSeqs){
# Extract the fasta headers
inputSeqsIndexNames[i] <- attr(inputSeqs[[inputSeqsIndex[i]]],"name")
# Extract the sequence
outputSeqs[[i]] <- inputSeqs[[inputSeqsIndex[i]]][initialPos:((initialPos+lengthOutputSeqs)-1)]
}

# Export the sequences in a new fasta file
write.fasta(outputSeqs,inputSeqsIndexNames,outputFileName)

cat("\n*** DONE |",date(),"****\n")
}


# 22 Jun 2012 | Benjamin Tovar

Benjamin

To leave a comment for the author, please follow the link and comment on their blog: Computational Biology Blog in fasta format.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)