Site icon R-bloggers

unbalanced sampling

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


A question from X validated on sampling from an unknown density f when given both a sample from the density f restricted to a (known) interval A , say, and a sample from f restricted to the complement of A, say. Or at least on producing an estimate of the mass of A under f, p(A)

The problem sounds impossible to solve without an ability to compute the density value at a given value, since  any convex combination αf¹+(1-α)f² would return the same two samples. Assuming continuity of the density f at the boundary point a between A and its complement, a desperate solution for p(A)/1-p(A) is to take the ratio of the density estimates at the value a, which turns out not so poor an approximation if seemingly biased. This was surprising to me as kernel density estimates are notoriously bad at boundary points.

If f(x) can be computed [up to a constant] at an arbitrary x, it is obviously feasible to simulate from f and approximate p(A). But the problem is then moot as a resolution would not even need the initial samples. If exploiting those to construct a single kernel density estimate, this estimate can be used as a proposal in an MCMC algorithm. Surprisingly (?), using instead the empirical cdf as proposal does not work.

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.