This function utilises the bit-level coding scheme that Emmanuel Paradis developed for encoding sequences in R. The unambiguous bases A, G, C and T have the numeric values 136, 72, 40 and 24 respectively. This function figures out which sites don't have these values and returns a vector of TRUEs and FALSEs, TRUEs being ambiguous bases.

data(woodmouse)

is.ambig <- function(x){

x <- as.matrix(x)

bases <- c(136, 72, 40, 24)

ambig <- apply(x, 2, FUN=function(x) sum(as.numeric(!as.numeric(x) %in% bases)))

ambig > 0

}

is.ambig(woodmouse)

The second function is an implementation of Tajima's K, published as equation A3 in Tajima 1983

This function calculates the mean number of sites that are different between any two sequences. As a default, it returns the result as a proportion of the length of the alignment. Setting prop = FALSE will return the result as the actual number of sites.

res <- mean(dist.dna(x, model="N"))

if(prop) res <- res/dim(x)[2]

res

}

tajima.K(woodmouse)

References:

Tajmia F. 1983. Evolutionary relationship of DNA sequences in finite populations.

*Genetics*105: 437-460.

