Recently I wrote a couple of small functions as a result of work done by myself and others in my lab group. The first is a function that determines what sites in a sequence alignment are ambiguous (i.e. not A, G, C or T).

require(ape)

data(woodmouse)
is.ambig <- function(x){

x <- as.matrix(x)

bases <- c(136, 72, 40, 24)

ambig <- apply(x, 2, FUN=function(x) sum(as.numeric(!as.numeric(x) %in% bases)))

ambig > 0

}

is.ambig(woodmouse)

This function utilises the bit-level coding scheme that Emmanuel Paradis developed for encoding sequences in R. The unambiguous bases A, G, C and T have the numeric values 136, 72, 40 and 24 respectively. This function figures out which sites don’t have these values and returns a vector of TRUEs and FALSEs, TRUEs being ambiguous bases.

The second function is an implementation of Tajima’s K, published as equation A3 in Tajima 1983

tajima.K <- function(x, prop = TRUE){

res <- mean(dist.dna(x, model="N"))

if(prop) res <- res/dim(x)[2]

res

}
tajima.K(woodmouse)

This function calculates the mean number of sites that are different between any two sequences. As a default, it returns the result as a proportion of the length of the alignment. Setting prop = FALSE will return the result as the actual number of sites.

References:

Tajmia F. 1983. Evolutionary relationship of DNA sequences in finite populations. *Genetics* 105: 437-460.

*Related*

To

**leave a comment** for the author, please follow the link and comment on their blog:

** The Praise of Insects**.

R-bloggers.com offers

**daily e-mail updates** about

R news and

tutorials on topics such as:

Data science,

Big Data, R jobs, visualization (

ggplot2,

Boxplots,

maps,

animation), programming (

RStudio,

Sweave,

LaTeX,

SQL,

Eclipse,

git,

hadoop,

Web Scraping) statistics (

regression,

PCA,

time series,

trading) and more...

If you got this far, why not

__subscribe for updates__ from the site? Choose your flavor:

e-mail,

twitter,

RSS, or

facebook...