preserving frequencies without resampling

March 8, 2016
By

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

An interesting question came up on X validated a few days ago: given a probability vector p=(p¹,…,p⁷), is there a way to pick 5 values in {1,…,7} without replacement and still preserve the probability repartition in the resulting sample? In other words, is there a sampling without replacement strategy that leads to

\mathbb{E}[\mathbb{I}_i(X^1)+\cdots+\mathbb{I}_i(X^5)]=5p^i

for i=1,…,7..? Unless those probabilities p¹,…,p⁷ are close enough to 1/7, this is simply impossible as 5 values out of 7 have to be sampled, which imposes some minimal frequency on some of the values.

Hence a generic question:

given a vector p of k probabilities (summing up to 1), what is the constraint on this vector and on the number n of elements of the population one can draw without replacement in order to achieve a expected frequency of np on the resulting vector? That is,

\mathbb{E}[\mathbb{I}_i(X_1)+\ldots+\mathbb{I}_i(X_n)]=np_i

In the cases n=2,3, I managed to find and solve the system of equations satisfied by the sampling probability vector q, but I wondered if there exists a less pedestrian resolution. I then showed the problem to Robin Ryder while at CIRM for the Bayesian week and he quickly pointed out the answer by Brewer’s and Hanif’s book Sampling with unequal probabilities to this question, which does not use sampling with replacement with a fixed probability vector but instead modifies the remaining probabilities after each draw, as in the following R code:

 
kuh=(1:N)/sum((1:N)) #example of target
smpl=sample((1:N),1,rep=FALSE,pro=kuh*(1-kuh)/(1-n*kuh))
for (i in 2:n)
  smpl=c(smpl,sample((1:N)[-smpl],1,rep=FALSE,
    pro=(kuh*(1-kuh)/(1-(n-i+1)*kuh))[-smpl])

Hence the question is not completely solved, since I am still uncertain whether or not there exists a sampling without replacement that achieves the target probability! But at least this shows there is only a solution when all probabilities are less than 1/n, n being the number of draws…

Filed under: Books, Kids, pictures, R, Statistics Tagged: cross validated, fixed-point equation, sampling without replacement

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Mango solutions



RStudio homepage



Zero Inflated Models and Generalized Linear Mixed Models with R

Dommino data lab

Quantide: statistical consulting and training



http://www.eoda.de







ODSC

ODSC

CRC R books series





Six Sigma Online Training





Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)