Generate artificial DNA or protein sequences in R in a single line of code.

April 11, 2012

(This article was first published on Computational Biology Blog in fasta format, and kindly contributed to R-bloggers)

To generate an artificial DNA sequence of  “n” bases long with a fixed composition bias in just one line of code, just open your R prompt and type:

seqX <- sample(c("A","C","G","T"),10000,rep=TRUE,prob=c(0.4,0.1,0.1,0.4))

As you see, the alphabet is the four letter alphabet of the DNA (so, if desired, you can replace that alphabet with any other alphabet that you might need, like the amino acids 20 letter code), the next parameter is the length of the desired output sequence. rep=TRUE means that each letter of the alphabet could be repeated and finally, I think that the most useful parameter of the function sample is the prob parameter because it allows you to select the multinomial model (the proportion or “bias” of each base). For example our simulated sequence is 80% AT rich and 20% GC rich given that for the “A” base we got a probability of 0.4, for the “C” base 0.2 and so on.

Now, to check it out that our artificial sequence follows that bias, a simple plot will tell us more than thousand words:

Lets extract the proportion of each base along the sequence using the table and the length function:

freqX <- table(seqX)/length(seqX)

Now, lets plot it doing:

barplot(freqX,col=1:4,main="Compositional bias of seqX",xlab="Base",ylab="Base proportion")

And finally we got this awesome plot.

So, the barplot shows that effectively each base is well represented by the multinomial model and our artificial sequence is loyal to it. 

To leave a comment for the author, please follow the link and comment on their blog: Computational Biology Blog in fasta format. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)