# Example 8.34: lack of robustness of t test with small n

[This article was first published on

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Tim Hesterberg has effectively argued for a larger role for resampling based inference in introductory statistics courses (and statistical practice more generally). While the Central Limit Theorem is a glorious result, and the Student t-test remarkably robust, there are subtleties that Hesterberg, Jones and others have pointed out that are not always well understood.**SAS and R**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this entry, we explore the robustness of the Student one sample t-test in terms of coverage probabilities where we look whether the alpha level is split evenly between the two tails.

**R**

We begin by defining a function to carry out the simulation study.

runsim = function(FUNCTION=rnorm, n=20, numsim=100000, label="normal") { missupper = numeric(numsim) misslower = numeric(numsim) for (i in 1:numsim) { x = FUNCTION(n) testval = t.test(x) missupper[i] = testval$conf.int[2] < 0 misslower[i] = testval$conf.int[1] > 0 } cat("n=", n,", number of simulations=", numsim, sep="") cat(", (true data are ", label, ")\n", sep="") cat(round(100*sum(missupper)/numsim, 1), "% of simulations had CI below the true value.\n", sep="") cat(round(100*sum(misslower)/numsim, 1), "% of simulations had CI above the true value.\n", sep="") }

We can define three functions to explore: one where the data are actually normally distributed (so the central limit theorem doesn’t need to be invoked), a symmetric distribution (where the central limit theorem can be used with smaller sample sizes) and a skewed distribution (where the central limit theorem requires very large sample sizes).

reallynormal = function(n) { return(rnorm(n)) } symmetric = function(n) { return(runif(n, -1, 1)) } skewed = function(n) { return(rexp(n, 1) - 1) }

The results (at least to us) are somewhat startling. To get the tail probabilities correct when the underlying distribution is actually skewed, we need a huge sample size:

> runsim(reallynormal, n=20, label="normal") n=20, number of simulations=1e+05, (true data are normal) 2.5% of simulations had CI below the true value. 2.5% of simulations had CI above the true value. > runsim(symmetric, n=20, label="symmetric") n=20, number of simulations=1e+05, (true data are symmetric) 2.6% of simulations had CI below the true value. 2.5% of simulations had CI above the true value. > runsim(skewed, n=20, label="skewed") n=20, number of simulations=1e+05, (true data are skewed) 7.7% of simulations had CI below the true value. 0.6% of simulations had CI above the true value. > runsim(skewed, n=100, label="skewed") n=100, number of simulations=1e+05, (true data are skewed) 4.5% of simulations had CI below the true value. 1.2% of simulations had CI above the true value. > runsim(skewed, n=500, label="skewed") n=500, number of simulations=1e+05, (true data are skewed) 3.4% of simulations had CI below the true value. 1.7% of simulations had CI above the true value.

**SAS**

In the SAS version, we’ll write a macro that first generates all of the data, then uses

`by`processing to perform the t tests, and finally tallies the results and prints them to the output.

%macro simt(n=20, numsim=100000, myrand=normal(0), label=normal); data test; do sim = 1 to &numsim; do i = 1 to &n; x = &myrand; output; end; end; run; ods select none; ods output conflimits=ttestcl; proc ttest data=test; by sim; var x; run; ods select all; data _null_; set ttestcl end=last; retain missupper 0 misslower 0; missupper = missupper + (upperclmean lt 0); misslower = misslower + (lowerclmean gt 0); if last then do; missupper_pct = compress(round(100 * missupper/&numsim,.1)); misslower_pct = compress(round(100 * misslower/&numsim,.1)); file print; put "n=&n, number of simulations = &numsim, true data are &label"; put missupper_pct +(-1) "% of simulations had CI below the true value"; put misslower_pct +(-1) "% of simulations had CI above the true value"; end; run; %mend simt;

Two or three features of this code bear commenting. First, the

`retain`statement and

`end`option allow us to count the number of misses without an additional

`data`step and/or procedure. Next, the

`file print`statement redirects the results from a

`put`statement to the output, rather than the log. While this is not really needed, the output is where we expect the results to appear. Finally, the odd

`+(-1)`in the final two

`put`statements moves the column pointer back one space. This allows the percent sign to immediately follow the number.

The various distributions and sample sizes used in the R code can be called as follows:

%simt(numsim=100000, n = 20, myrand = normal(0)); %simt(numsim=100000, n = 20, myrand = (2 * uniform(0)) - 1, label=symmetric); %simt(numsim=100000, n = 20, myrand = (ranexp(0) - 1), label=skewed); %simt(numsim=100000, n = 100, myrand = (ranexp(0) - 1), label=skewed); %simt(numsim=100000, n = 500, myrand = (ranexp(0) - 1), label=skewed);

where the code to generate the desired data is passed to the macro as a character string. The results of

`%simt(numsim=100000, n = 100, myrand = (ranexp(0) - 1), label=skewed);`

are

n=100, number of simulations = 100000, true data are skewed 4.5% of simulations had CI below the true value 1.2% of simulations had CI above the true value

which agrees nicely with the R results above.

We look forward to Tim’s talk at the JSM.

To

**leave a comment**for the author, please follow the link and comment on their blog:**SAS and R**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.