**MeanMean**, and kindly contributed to R-bloggers)

Hi everyone, I’m at JSM 2018 right now, so feel free to drop by my session or drop by in the halls! Just give me a tweet!

Back to the meat-and-potatoes of this post. A while ago I was running good old `sample`

and comparing its performance to my `lpm2_kdtree`

function in the *BalancedSampling* package (Grafström and Lisic, 2016).

In this comparison I noticed that sample was in some cases slower than my balanced sampling method when using sampling weights.

library(BalancedSampling) library(ggplot2) # sampling proportion f <- 0.1 # population size N_values <- seq(from=5000, to=100000,length=10) set.seed(99) pi <- runif(100000) result <- c() for( N in N_values) { # get sample size n <- round(f*N) ## run lpm2-kdtree start_time <- proc.time()[3] # to make things even we normalize pi pi_tmp <- pi[1:N] * n/sum(pi) # we probably don't want to balance by pi_tmp, but # for checking speed this is fine lpm2_kdtree(pi_tmp,pi_tmp) end_time <- proc.time()[3] - start_time result <- rbind(result, c(f, N, 1, end_time )) ## run sample index <- 1:N start_time <- proc.time()[3] # to make things even we normalize pi sample(index,size=n,prob=pi_tmp) end_time <- proc.time()[3] - start_time result <- rbind(result, c(f, N, 2, end_time )) } colnames(result) <- c('SamplingFraction', 'PopulationSize', 'Method','RunTime') result <- as.data.frame(result) result[c('SamplingFraction','Method')] <- lapply(result[c('SamplingFraction','Method')], as.factor) levels(result$Method) <- c("lpm2_kdtree","sample") ggplot( aes(x=PopulationSize, y=RunTime, class=Method), data=result)+ geom_line(aes(color=Method)) + ylab("Run Time (s)") + xlab("Population Size")

**How could this be? **

Let’s condsider a few possible reasons.

**Are you running an old version of R? **

No, this was done with R 3.5.0, but results hold on 3.5.1.

** Is this because you are a freedom hating Mac user?**

Nope, I get similar results in Linux

**Is this due to the revsort in ProbSampleNoReplace found in **

*random.c*?

How amazingly prescient of you! If you track down the code path, the method used for sampling has a revsort call right at the begining (Line 323). Some basic testing seems to identify this as the likely culprit. Therefore, it might be possible to speed up sampling by checking if the probabilities are sorted.

Keep in mind that there are likely good reasons for R to use this method relative to others, including the alternative method below. One is computational stability, split sampling through the pivotal methods tends to work well when sample sizes aren’t extremely small. However, I don’t have any good comparisons of methods, if anyone knows please let me know, doing a literature search on this is pretty far down on my todo list.

**That’s great, but what can I do to improve my weighted or PPS sampling?**

Well, I wrote a quick pivotal method implementation (think LPM2 without balancing) that seems to work. It’s available as `split_sample`

on my github version of BalancedSampling.

There is also `UPrandompivotal`

in the *sampling* package (Tillé and Matei, 2016).

However, my implementation seems to be considerably faster. We will take a look at this in another blog post.

Finally, keep in mind that split_sample probably isn’t as well tested as `UPrandompivotal`

. So I’m happy to take any feedback.

library(devtools) install_github("jlisic/BalancedSampling") library(BalancedSampling) library(ggplot2) library(sampling) # sampling proportion f <- 0.1 # population size N_values <- seq(from=5000, to=100000,length=10) set.seed(99) pi <- runif(100000) result <- c() for( N in N_values) { # get sample size n <- round(f*N) ## run lpm2-kdtree start_time <- proc.time()[3] # to make things even we normalize pi pi_tmp <- pi[1:N] * n/sum(pi) # we probably don't want to balance by pi_tmp, but # for checking speed this is fine lpm2_kdtree(pi_tmp,pi_tmp) end_time <- proc.time()[3] - start_time result <- rbind(result, c(f, N, 1, end_time )) ## run pivotal method (BalancedSampling) index <- 1:N start_time <- proc.time()[3] # to make things even we normalize pi pi_tmp <- pi[1:N] * n/sum(pi) split_sample(pi_tmp) end_time <- proc.time()[3] - start_time result <- rbind(result, c(f, N, 2, end_time )) ## run pivotal sampling (sampling) index <- 1:N start_time <- proc.time()[3] # to make things even we normalize pi pi_tmp <- pi[1:N] * n/sum(pi) UPrandompivotal(pi_tmp) end_time <- proc.time()[3] - start_time result <- rbind(result, c(f, N, 3, end_time )) } colnames(result) <- c('SamplingFraction', 'PopulationSize', 'Method','RunTime') result <- as.data.frame(result) result[c('SamplingFraction','Method')] <- lapply(result[c('SamplingFraction','Method')], as.factor) levels(result$Method) <- c("lpm2_kdtree","BalancedSampling::split_sample", "sampling::UPrandompivotal") ggplot( aes(x=PopulationSize, y=RunTime, class=Method), data=result)+ geom_line(aes(color=Method)) + ylab("Run Time (s)") + xlab("Population Size")

Have a great JSM!

## References

Grafström, A., Lisic, J.: BalancedSampling: balanced and spatially balanced sampling. R package version 1.5.2. __https://CRAN.R-project.org/package=BalancedSampling__. Accessed August 28, 2018. (2016)

Tillé, Y., Matei, M.: sampling: Survey Sampling. R package version 2.8. __https://CRAN.R-project.org/package=sampling__. Accessed August 28, 2018. (2016)

**leave a comment**for the author, please follow the link and comment on their blog:

**MeanMean**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...