Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Joseph Rickert

One of the key ideas in topological data analysis is to consider a data set to be a sample from a manifold in some high dimensional topological space and then to use the tools of algebraic topology to reconstruct the manifold. It turns out that the converse problem of taking a random sample from a given topological manifold also has some very useful applications in statistics. In their 2012 paper, Sampling From A Manifold, the mathematical statisticians Persi Diaconis, Susan Holmes and Mehrdad Shahshahani develop a general approach to this converse problem using fundamental ideas from geometric measure theory. They show how this topological / geometrical approach can be used not only to construct algorithms for generating test data for topological statistics, but also for testing the goodness of fit of sufficient statistics for exponential family distributions and for other statistical problems that may be conceptualized as sampling from a manifold.

The first example in the paper shows how to correctly sample from a torus:

M = { [(R+rcos(q))cos(y), (R+rcos(q))sin(y),sin(q)]}
where q and y are both in [0,2p)  and R > r > 0

The authors point out that naively drawing q and y from uniformly distributed random variables will lead to sampled points that are denser than they should be in regions of higher curvature such as the inside of the torus. The correct way to sample is to use the theory they develop to derive the joint density, g(q, y), of q and y on the manifold. The theorem that enables this is an extension of the change of variables formula from calculus. Readers who have used Jacobians to work through tricky problems involving the sums and products of random variables in a probability course will recognize what is going on.

As it turns out, g(q, y) factors into into:

g1(q) = (1/2p)(1 + (r/R)cos(q))
where q e [0,2p) and g2(y) is uniform on [0,2p).

To actually sample from g1(q) the authors provide the R code for a rejection sampling algorithm.

reject=function(n=100,r=0.5,R=1){
#Rejection sampler
xvec=runif(n,0,2*pi)
yvec=runif(n,0,1/pi)
fx=(1+(r/R)*cos(xvec))/(2*pi)
return(xvec[yvec<fx]) }

The following plot shows the histogram and density curve for 10,000 draws from g1(q).