Estimate Gene Diversity

May 3, 2011
By

(This article was first published on Fabio Marroni's Blog » R, and kindly contributed to R-bloggers)

I provide here an R function to estimate gene diversity of diallelic sites (e.g. SNPs), given allele frequencies at each segregating site.
The function takes three input parameters:

maf: a numeric value (or vector) representing minor allele frequency at each site. Default is 0.5
nreads: size of each resampling experiment. Default is 10000.
nreplicates: Number of resampling experiments to conduct. Default is 200.

The function outputs the median value of estimated gene diversity, together with upper and lower 95% confidence limits.

I tested the function against results provided by the software package Arlequin and found excellent correlation (r>0.95). However, additional testing is welcome!

Once you obtain gene diversity you can easily calculate nucleotide diversity by summing gene diversity over the segregating sites and dividing by the length of your sequence.

References:
Nei, M. and Li, W.H., (1979) Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc Natl Acad Sci U S A, 76(10), 5269-5273.
Tajima, F., (1989) Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Adv Genet, 123(3), 585-595.

If you use this function in a published work, please refer to:
Marroni F, Pinosio S, Di Centa E, Jurman I, Boerjan W, Felice N, Cattonaro F, Morgante M., (2011) Large scale detection of rare variants via pooled multiplexed next generation sequencing: towards next generation Ecotilling. Plant J. doi: 10.1111/j.1365-313X.2011.04627.x.

egd<-function(maf=0.5,nreads=10000,nreplicates=200)
{
total<-rep(NA,nreplicates)
  for(aaa in 1:nreplicates)
  {
  pino1<-sample(c("a","b"),replace=T,size=nreads,prob=c(maf,1-maf))
  pino2<-sample(c("a","b"),replace=T,size=nreads,prob=c(maf,1-maf))
  total[aaa]<-sum(pino1!=pino2)/nreads
  }
  gene.div.median<-median(total)
  gene.div.ucl<-quantile(total,probs=0.975)
  gene.div.lcl<-quantile(total,probs=0.025)
  gd<-c(gene.div.median,gene.div.ucl,gene.div.lcl)
  names(gd)<-c("gene.div.median","gene.div.ucl","gene.div.lcl")
  gd
}


To leave a comment for the author, please follow the link and comment on his blog: Fabio Marroni's Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.