(This article was first published on

Tim Hesterberg has effectively argued for a larger role for resampling based inference in introductory statistics courses (and statistical practice more generally). While the Central Limit Theorem is a glorious result, and the Student t-test remarkably robust, there are subtleties that Hesterberg, Jones and others have pointed out that are not always well understood.**SAS and R**, and kindly contributed to R-bloggers)In this entry, we explore the robustness of the Student one sample t-test in terms of coverage probabilities where we look whether the alpha level is split evenly between the two tails.

**R**

We begin by defining a function to carry out the simulation study.

runsim = function(FUNCTION=rnorm, n=20, numsim=100000,

label="normal")

{

missupper = numeric(numsim)

misslower = numeric(numsim)

for (i in 1:numsim) {

x = FUNCTION(n)

testval = t.test(x)

missupper[i] = testval$conf.int[2] < 0

misslower[i] = testval$conf.int[1] > 0

}

cat("n=", n,", number of simulations=", numsim, sep="")

cat(", (true data are ", label, ")\n", sep="")

cat(round(100*sum(missupper)/numsim, 1),

"% of simulations had CI below the true value.\n", sep="")

cat(round(100*sum(misslower)/numsim, 1),

"% of simulations had CI above the true value.\n", sep="")

}

We can define three functions to explore: one where the data are actually normally distributed (so the central limit theorem doesn't need to be invoked), a symmetric distribution (where the central limit theorem can be used with smaller sample sizes) and a skewed distribution (where the central limit theorem requires very large sample sizes).

reallynormal = function(n) { return(rnorm(n)) }

symmetric = function(n) { return(runif(n, -1, 1)) }

skewed = function(n) { return(rexp(n, 1) - 1) }

The results (at least to us) are somewhat startling. To get the tail probabilities correct when the underlying distribution is actually skewed, we need a huge sample size:

> runsim(reallynormal, n=20, label="normal")

n=20, number of simulations=1e+05, (true data are normal)

2.5% of simulations had CI below the true value.

2.5% of simulations had CI above the true value.

> runsim(symmetric, n=20, label="symmetric")

n=20, number of simulations=1e+05, (true data are symmetric)

2.6% of simulations had CI below the true value.

2.5% of simulations had CI above the true value.

> runsim(skewed, n=20, label="skewed")

n=20, number of simulations=1e+05, (true data are skewed)

7.7% of simulations had CI below the true value.

0.6% of simulations had CI above the true value.

> runsim(skewed, n=100, label="skewed")

n=100, number of simulations=1e+05, (true data are skewed)

4.5% of simulations had CI below the true value.

1.2% of simulations had CI above the true value.

> runsim(skewed, n=500, label="skewed")

n=500, number of simulations=1e+05, (true data are skewed)

3.4% of simulations had CI below the true value.

1.7% of simulations had CI above the true value.

**SAS**

In the SAS version, we'll write a macro that first generates all of the data, then uses

`by`processing to perform the t tests, and finally tallies the results and prints them to the output.

%macro simt(n=20, numsim=100000, myrand=normal(0), label=normal);

data test;

do sim = 1 to &numsim;

do i = 1 to &n;

x = &myrand;

output;

end;

end;

run;

ods select none;

ods output conflimits=ttestcl;

proc ttest data=test;

by sim;

var x;

run;

ods select all;

data _null_;

set ttestcl end=last;

retain missupper 0 misslower 0;

missupper = missupper + (upperclmean lt 0);

misslower = misslower + (lowerclmean gt 0);

if last then do;

missupper_pct = compress(round(100 * missupper/&numsim,.1));

misslower_pct = compress(round(100 * misslower/&numsim,.1));

file print;

put "n=&n, number of simulations = &numsim, true data are &label";

put missupper_pct +(-1) "% of simulations had CI below the true value";

put misslower_pct +(-1) "% of simulations had CI above the true value";

end;

run;

%mend simt;

Two or three features of this code bear commenting. First, the

`retain`statement and

`end`option allow us to count the number of misses without an additional

`data`step and/or procedure. Next, the

`file print`statement redirects the results from a

`put`statement to the output, rather than the log. While this is not really needed, the output is where we expect the results to appear. Finally, the odd

`+(-1)`in the final two

`put`statements moves the column pointer back one space. This allows the percent sign to immediately follow the number.

The various distributions and sample sizes used in the R code can be called as follows:

%simt(numsim=100000, n = 20, myrand = normal(0));

%simt(numsim=100000, n = 20, myrand = (2 * uniform(0)) - 1, label=symmetric);

%simt(numsim=100000, n = 20, myrand = (ranexp(0) - 1), label=skewed);

%simt(numsim=100000, n = 100, myrand = (ranexp(0) - 1), label=skewed);

%simt(numsim=100000, n = 500, myrand = (ranexp(0) - 1), label=skewed);

where the code to generate the desired data is passed to the macro as a character string. The results of

`%simt(numsim=100000, n = 100, myrand = (ranexp(0) - 1), label=skewed);`

are

n=100, number of simulations = 100000, true data are skewed

4.5% of simulations had CI below the true value

1.2% of simulations had CI above the true value

which agrees nicely with the R results above.

We look forward to Tim's talk at the JSM.

To

**leave a comment**for the author, please follow the link and comment on his blog:**SAS and R**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...