Visualizing Sampling Distributions

September 25, 2011
By

(This article was first published on bayesianbiologist » Rstats, and kindly contributed to R-bloggers)

Teacher: “How variable is your estimate of the mean?”

Student: “Uhhh, it’s not. I took a sample and calculated the sample mean. I only have one number.”

Teacher: “Yes, but what is the standard deviation of sample means?”

Student: “What do you mean means, I only have the one friggin number.”

Statisticians have a habit of talking about single events as though they’ve happened (or could happen) over and over again.  This is the basis of the Frequentist paradigm, and I’ve found that it really irks early students of statistics. Questions of the type: “How variable is that estimate?” asked by a statistician translates to “How variable would our collection of estimates be if we were to draw samples of the same size from the population, over and over again?”

As a way to help students get into this way of thinking, I have found simulations to be quite useful.  Here is an R script to demonstrate the sampling distribution of means and how we can reproduce the theoretical standard error of the mean.


## This script plots a histogram of sample means from a known population and compares this
## distribution against the theoretical Standard Error of the Means distribution.

## You can play around with sample size (n) to see how the standard error distribution changes.

rm(list=ls())

var_ <- new.env()
n<-20            ## Sample n individuals at a time
p_mean<-0        ## Population mean
p_sd<-1            ## Population standard deviation
N<-500            ## Number of times the experiment (sampling) is replicated

pdf('SE.pdf')

for(i in 1:N)                                ## do the experiment N times
{
smp<-rnorm(n,p_mean,p_sd)                 ## sample n data points from the population

var_$x_bar<-c(var_$x_bar,mean(smp))         ## keep track of the mean (x_bar) from each sample

hist(var_$x_bar,probability=TRUE,col="red",xlim=c(-4,4),xlab="x / x_bar",main="",ylim=c(0,2.2))  # Plot a histogram of x_bar values
points(mean(smp),0,pch=19,cex=1.5,col='black')
curve(dnorm(x,p_mean,p_sd/sqrt(n)),lwd=3,add=TRUE)

text(2.5,1.75,labels=paste('sd/sqrt(n) = ',round(p_sd/sqrt(n),2),sep=''))
text(2.5,1.5,labels=paste('standard deviation of\nsample means = ',round(sd(var_$x_bar),2),sep='') )

curve(dnorm(x,p_mean,p_sd),main="",ylab="",xlim=c(-4,4),xlab="X",col="blue",lwd=3,add=TRUE) ## Plot the sample

text(2.5,0.5,labels=paste('# of means drawn = ',i,sep=''))
text(2.5,0.35,labels=paste('Sample size (n) = ',n,sep=''))
points(smp,rep(0,n),pch=19,cex=1.5,col='purple')
abline(v= mean(smp),col='purple',lwd=4)

legend("topleft",legend=c('Sample points','Population Distribution','Sample mean','Theoretical SE','Empirical SE'),
lty=c(0,1,1,1,1,1,1),lwd=c(0,3,3,3,3,3,3),pch=c(16,NA,NA,NA,NA,NA,NA),col=c('purple','blue','purple','black','red'))

print(paste(i," of ",N))
}
dev.off()

############################################################################################
############################################################################################

The output of the script is a multi-page pdf which can be flipped through to show the building of a histogram of sample means converging on the theoretical sampling distribution.



To leave a comment for the author, please follow the link and comment on his blog: bayesianbiologist » Rstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , ,

Comments are closed.