Two R functions for working with DNA alignments

March 28, 2011

(This article was first published on The Praise of Insects, and kindly contributed to R-bloggers)

Recently I wrote a couple of small functions as a result of work done by myself and others in my lab group. The first is a function that determines what sites in a sequence alignment are ambiguous (i.e. not A, G, C or T).


is.ambig <- function(x){
   x <- as.matrix(x)
   bases <- c(136, 72, 40, 24)
   ambig <- apply(x, 2, FUN=function(x) sum(as.numeric(!as.numeric(x) %in% bases)))
   ambig > 0


This function utilises the bit-level coding scheme that Emmanuel Paradis developed for encoding sequences in R. The unambiguous bases A, G, C and T have the numeric values 136, 72, 40 and 24 respectively. This function figures out which sites don’t have these values and returns a vector of TRUEs and FALSEs, TRUEs being ambiguous bases.

The second function is an implementation of Tajima’s K, published as equation A3 in Tajima 1983

tajima.K <- function(x, prop = TRUE){
   res <- mean(dist.dna(x, model="N"))
   if(prop) res <- res/dim(x)[2]


This function calculates the mean number of sites that are different between any two sequences. As a default, it returns the result as a proportion of the length of the alignment. Setting prop = FALSE will return the result as the actual number of sites.

Tajmia F. 1983. Evolutionary relationship of DNA sequences in finite populations. Genetics 105: 437-460.

To leave a comment for the author, please follow the link and comment on their blog: The Praise of Insects. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training



CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)