Count different positions between two strings of equal length

November 26, 2011
By

(This article was first published on Fabio Marroni's Blog » R, and kindly contributed to R-bloggers)

This is another pretty simple function, written to help me solve the simplest representation of a trivial but tedious task. Most biologist are probably familiar with this task. How many nucleotide differences exist between two given sequences? I only faced the easiest part of the problem, i.e. I do not perform alignment, I just assume that the sequence have a perfect 1 to 1 correspndence, i.e. position 1 of sequence 1 is position 1 of sequence 2, and so on.
The problem is basically to compare two strings of equal length and count the number of positions in which they differ. To facilitate analysis of DNA sequences, I gave the opportunity to ignore case (so that an “a” is equivalent to “A”) and to ignore positions carrying special characters, such as “N” or “?”.

Input:
a: first DNA sequence (or string)
b: second DNA sequence (or string)
exclude: character (or vector of characters) to be excluded from comparison
ignore.case: logical. If TRUE consider “a” equal to “A”. If FALSE consider “a” different from “A”.

Output:
differences: Number of differences between the two DNA sequences (or strings)

As usual, feedback welcome!

Thanks to Larry for suggesting a nice improvement! It is now incorporated in the function.

string.diff.ex<-function(a="ATTCGAN",b="attTGTT",exclude=c("n","N","?"),ignore.case=TRUE)
{
if(nchar(a)!=nchar(b)) stop("Lengths of input strings differ. Please check your input.")
if(ignore.case==TRUE)
{
a<-toupper(a)
b<-toupper(b)
}
diff.a<-unlist(strsplit(a,split=""))
diff.b<-unlist(strsplit(b,split=""))
diff.d<-rbind(diff.a,diff.b)
for(ex.loop in 1:length(exclude))
{
diff.d<-diff.d[,!(diff.d[1,]==exclude[ex.loop]|diff.d[2,]==exclude[ex.loop])]
}
differences<-sum(diff.d[1,]!=diff.d[2,])
return(differences)
}


To leave a comment for the author, please follow the link and comment on his blog: Fabio Marroni's Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...