Example 9.12: simpler ways to carry out permutation tests

October 31, 2011
By

(This article was first published on SAS and R, and kindly contributed to R-bloggers)



In a previous entry, as well as section 2.4.3 of the book, we describe how to carry out a 2 group permutation test in SAS as well as with the coin package in R. We demonstrate with comparing the ages of the female and male subjects in the HELP study.

In this entry, we revisit the permutation test using other functions.

R

We describe a simpler interface to carry out and visualize permutation tests using the functions from the mosaic package.

We begin by replicating our previous example (section 2.6.4, p. 87).

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
library(coin)
numsim = 1000
oneway_test(age ~ as.factor(female),
distribution=approximate(B=numsim-1), data=ds)

which yields the following output:

Approximative 2-Sample Permutation Test

data: age by as.factor(female) (0, 1)
Z = -0.9194, p-value = 0.3894
alternative hypothesis: true mu is not equal to 0

We conclude that there is minimal evidence to contradict the null hypothesis that the two groups have the same ages back in their respective populations.

Now we demonstrate another way to run this test in a more general form, using the mosaic package's do() function combined with the * operator to repeatedly carry out fitting a linear model with a parameter for female which will calculate our test statistic (difference in means between females and males) repeatedly after shuffling the group indicators. The shuffle() function permutes the group labels, and then the summary statistic is calculated.

> library(mosaic)
> obsdiff = with(ds, mean(age[female==1]) - mean(age[female==0]))
> obsdiff
mean
0.7841284
> summary(age ~ female, data=ds, fun=mean)
age N=453

+-------+---+---+--------+
| | |N |mean |
+-------+---+---+--------+
|female |No |346|35.46821|
| |Yes|107|36.25234|
+-------+---+---+--------+
|Overall| |453|35.65342|
+-------+---+---+--------+

Now we can run the permutation test, then display the results on a souped-up histogram with different shading for values larger in magnitude than the observed statistic (see above).

res = do(numsim) * lm(age ~ shuffle(female), data=ds)
pvalue = sum(abs(res$female) > abs(obsdiff)) / numsim
xhistogram(~ female, groups = abs(female) > abs(obsdiff),
n=50, density=TRUE, data=res, xlab="difference between groups",
main=paste("Permutation test result: p=", round(pvalue, 3)))

The results are similar to those from the previous test: there is little evidence to contradict the null hypothesis.

SAS

In SAS, we'll take another approach, delving into the capabilities of proc iml to make a manual permutation test. We begin by reading the data and replicating the example in the book.

libname k 'c:\book';
proc npar1way data = k.help;
class female;
var age;
exact scores=data / mc n= 9999 alpha = .05;
run;

Data Scores One-Way Analysis

Chi-Square 0.8453
DF 1
Pr > Chi-Square 0.3579

Permuting data is a very awkward thing to do in data steps. But it turns out to be easy in proc iml (the built-in SAS matrix language). Here we read in the key variables from the data set (use and read). Then we generate the permutations (ranperm). However, this generates row for each permuted data set, while we need a column for each, so we transpose the matrix (t) before saving it. Then we save the resulting data in a SAS data set with the female variable. Note that we permuted the ages only, as opposed to the R example-- it doesn't matter which is permuted, of course. Much of the proc iml code used here can be found in section 1.9 of the book-- however, note that curly braces are required in the read statement, as shown below.

proc iml;
use k.help;
read all var{female age} into x;
p = t(ranperm(x[,2],1000));
paf = x[,1]||p;
create newds from paf;
append from paf;
quit;

With the permuted data in hand, we use proc ttest (section 2.4.1) with the ODS system to generate and save the differences. Note that the default variable names from proc iml are fairly nondescript. With the 1000 permuted statistics in hand, we can generate a histogram of the statistics and a p-value with proc univariate.

ods output conflimits=diff;
proc ttest data=newds plots=none;
class col1;
var col2 - col1001;
run;

proc univariate data=diff;
where method = "Pooled";
var mean;
histogram mean / normal;
run;

data diff2;
set diff;
absdiff = abs(mean);
run;

proc univariate data=diff2
loccount mu0 = 0.7841284;
where method = "Pooled";
var absdiff;
run;

Location Counts: Mu0=0.78

Count Value

Num Obs > Mu0 357
Num Obs ^= Mu0 1000
Num Obs < Mu0 643

To leave a comment for the author, please follow the link and comment on his blog: SAS and R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , ,

Comments are closed.