Visualizing Sampling Distributions

Posted on September 25, 2011 by bayesianbiologist in R bloggers | 0 Comments

[This article was first published on bayesianbiologist » Rstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Teacher: “How variable is your estimate of the mean?”

Student: “Uhhh, it’s not. I took a sample and calculated the sample mean. I only have one number.”

Teacher: “Yes, but what is the standard deviation of sample means?”

Student: “What do you mean means, I only have the one friggin number.”

Statisticians have a habit of talking about single events as though they’ve happened (or could happen) over and over again. This is the basis of the Frequentist paradigm, and I’ve found that it really irks early students of statistics. Questions of the type: “How variable is that estimate?” asked by a statistician translates to “How variable would our collection of estimates be if we were to draw samples of the same size from the population, over and over again?”

As a way to help students get into this way of thinking, I have found simulations to be quite useful. Here is an R script to demonstrate the sampling distribution of means and how we can reproduce the theoretical standard error of the mean.


## This script plots a histogram of sample means from a known population and compares this
## distribution against the theoretical Standard Error of the Means distribution.

## You can play around with sample size (n) to see how the standard error distribution changes.

rm(list=ls())

var_ <- new.env()
n<-20            ## Sample n individuals at a time
p_mean<-0        ## Population mean
p_sd<-1            ## Population standard deviation
N<-500            ## Number of times the experiment (sampling) is replicated

pdf('SE.pdf')

for(i in 1:N)                                ## do the experiment N times
{
smp<-rnorm(n,p_mean,p_sd)                 ## sample n data points from the population

var_$x_bar<-c(var_$x_bar,mean(smp))         ## keep track of the mean (x_bar) from each sample

hist(var_$x_bar,probability=TRUE,col="red",xlim=c(-4,4),xlab="x / x_bar",main="",ylim=c(0,2.2))  # Plot a histogram of x_bar values
points(mean(smp),0,pch=19,cex=1.5,col='black')
curve(dnorm(x,p_mean,p_sd/sqrt(n)),lwd=3,add=TRUE)

text(2.5,1.75,labels=paste('sd/sqrt(n) = ',round(p_sd/sqrt(n),2),sep=''))
text(2.5,1.5,labels=paste('standard deviation of\nsample means = ',round(sd(var_$x_bar),2),sep='') )

curve(dnorm(x,p_mean,p_sd),main="",ylab="",xlim=c(-4,4),xlab="X",col="blue",lwd=3,add=TRUE) ## Plot the sample

text(2.5,0.5,labels=paste('# of means drawn = ',i,sep=''))
text(2.5,0.35,labels=paste('Sample size (n) = ',n,sep=''))
points(smp,rep(0,n),pch=19,cex=1.5,col='purple')
abline(v= mean(smp),col='purple',lwd=4)

legend("topleft",legend=c('Sample points','Population Distribution','Sample mean','Theoretical SE','Empirical SE'),
lty=c(0,1,1,1,1,1,1),lwd=c(0,3,3,3,3,3,3),pch=c(16,NA,NA,NA,NA,NA,NA),col=c('purple','blue','purple','black','red'))

print(paste(i," of ",N))
}
dev.off()

############################################################################################
############################################################################################

The output of the script is a multi-page pdf which can be flipped through to show the building of a histogram of sample means converging on the theoretical sampling distribution.