seqX <- sample(c("A","C","G","T"),10000,rep=TRUE,prob=c(0.4,0.1,0.1,0.4))
As you see, the alphabet is the four letter alphabet of the DNA (so, if desired, you can replace that alphabet with any other alphabet that you might need, like the amino acids 20 letter code), the next parameter is the length of the desired output sequence. rep=TRUE means that each letter of the alphabet could be repeated and finally, I think that the most useful parameter of the function sample is the prob parameter because it allows you to select the multinomial model (the proportion or “bias” of each base). For example our simulated sequence is 80% AT rich and 20% GC rich given that for the “A” base we got a probability of 0.4, for the “C” base 0.2 and so on.
Now, to check it out that our artificial sequence follows that bias, a simple plot will tell us more than thousand words:
Lets extract the proportion of each base along the sequence using the table and the length function:
freqX <- table(seqX)/length(seqX)
Now, lets plot it doing:
barplot(freqX,col=1:4,main="Compositional bias of seqX",xlab="Base",ylab="Base proportion")
And finally we got this awesome plot.