After my last post on Bayesian bootstrap I got a question why the sample from Dirichlet distribution is taken as weights for calculating mean in the procedure and not as weights used for sampling from the original data set. Actually this mistake is subtle and occurs even in textbooks, see example Chernick (2008), page 122. In this post I want to clarify the issue.

In the example I give correct bootstrap and Bayesian bootstrap procedures and wrong ones. The wrong Bayesian bootstrap follows description from Chernick (2008), page 122 (that is equivalent to the comment to my last post).

Here is the code that I used:

library**(**gtools**)**

ok.mean.bb **<-** **function****(**x, n**)** **{**

apply**(**rdirichlet**(**n, rep**(**1,length**(**x**)))**, 1, weighted.mean, x **=** x**)**

**}**

ok.mean.fb **<-** **function****(**x, n**)** **{**

replicate**(**n, mean**(**sample**(**x, length**(**x**)**, **TRUE****)))**

**}**

wrong.mean.bb **<-** **function****(**x, n**)** **{**

replicate**(**n, mean**(**sample**(**x, length**(**x**)**, **TRUE**,

diff**(**c**(**0, sort**(**runif**(**length**(**x**)** **–** 1**))**, 1**)))))**

**}**

wrong.mean.fb **<-** **function****(**x, n**)** **{**

replicate**(**n, mean**(**sample**(**sample**(**x, length**(**x**)**, **TRUE****)**,

length**(**x**)**, **TRUE****)))**

**}**

set.seed**(**1**)**

reps **<-** 10000

x **<-** cars**$**dist

par**(**mar**=**c**(**5,4,1,2**))**

plot**(**density**(**ok.mean.fb**(**x, reps**))**, main **=** “”, xlab **=** “Bootstrap mean”**)**

lines**(**density**(**ok.mean.bb**(**x, reps**))**, col **=** “red”**)**

lines**(**density**(**wrong.mean.fb**(**x, reps**))**, col **=** “blue”**)**

lines**(**density**(**wrong.mean.bb**(**x, reps**))**, col **=** “green”**)**

The figure it produces is:

Black curve is standard bootstrap density, red is Bayesian bootstrap and blue and green are generated by wrong bootstrapping procedures (respectively frequentist and Bayesian).

We can see that wrong Bayesian bootstrap has an equivalent in standard bootstrap approach that is generated by repeating the sampling twice (sampling from a sample) and it clearly increases dispersion of the results.

*Related*

To

**leave a comment** for the author, please follow the link and comment on their blog:

** R snippets**.

R-bloggers.com offers

**daily e-mail updates** about

R news and

tutorials on topics such as:

Data science,

Big Data, R jobs, visualization (

ggplot2,

Boxplots,

maps,

animation), programming (

RStudio,

Sweave,

LaTeX,

SQL,

Eclipse,

git,

hadoop,

Web Scraping) statistics (

regression,

PCA,

time series,

trading) and more...

If you got this far, why not

__subscribe for updates__ from the site? Choose your flavor:

e-mail,

twitter,

RSS, or

facebook...