R for Ecologists: Permutation Analysis – t-tests

[This article was first published on Climate Change Ecology » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

You’ve carefully designed your experiment, you’ve meticulously collected your data, and you have a hypothesis to test. Unfortunately, your data is typical of ecology data: small sample sizes, messy, and non-normal. Your ideal test, the t-test, won’t work because of the non-normality and sample size is too small to invoke the central limit theorem. All hope is not lost!

Permutational analysis are a fantastic way to analyze data from designed experiments where experimental units have been randomly placed among treatments (see Anderson 2001 Canadian Journal of Fisheries and Aquatic Science for a thorough discussion of permutational analyses). In fact, permutational analysis is great for these carefully designed experiments for two main reasons: 1) It frees us from the stringent assumption of normality and equal variances (update, not necessarily true – see comments); there are few distributional assumptions and 2) We can analyze any derived metric we like. Note that the assumption of independent observations still applies. There is just no getting around pseudoreplication.

First, let’s simulate some data. We’ll start with normal data so we can see the equivalence of permutational analyses and parametric tests when assumptions are met.

Let’s imagine you’re interested in phenotypic plasticity of Gambusia and you raise Gambusia under different predation regimes: predator-rich and predator poor. You suspect the Gambusia will have fast growth rates in predator-rich environments (they want to mature faster and have more, smaller offspring when predators are present). We’ll simulate the data on 15 mosquitofish from the two populations: Predator-Present (PP) and Predator-Absent (PA).

# SET RANDOM NUMBER GENERATOR SEED
set.seed(2001)

# SIMULATE NORMAL DATA
PA <- rnorm(15, 4.5, 2)
PP <- rnorm(15, 7.5, 2)

# PLOT HISTOGRAMS
mf <- par(mfrow=c(1,2))
hist(PA, breaks=5)
hist(PP, breaks=5)
par(mf)

Even though the data from one group or the other may look non-normal, we know the underlying distribution is normal because we simulated from a normal distribution. We can use a basic t-test to determine if there are differences between the two populations

t.test(PA, PP)

You get a very significant p-value and a t-statistic of -5.95.

Now we’ll analyze it with a permutational test. Inherent in our experimental design was the random assignment of individuals to PP or PA treatments. We can view our observed data as just one of many possible arrangements of the data. We can shuffle the observations around and ask “If the observations are randomly assigned treatments, what is the probability of observing our particular arrangement of the data” (note that this is still the frequentist version of hypothesis, what is the probability of our data if the null hypothesis of randomness is true).

We first need to know just how many permutations we’re capable of so we don’t wind up repeating combinations. We have 30 observations, split evenly into two groups. There are 30-choose-15 = 155117520 ways to permute the data uniquely. I think we’re safe. (In R, this is choose(30, 15) ).

We’ll follow this pattern: 1) pool all data, 2) randomly assign 15 observations to PA, 3) assign the rest to PP, 4) calculate the difference in means between the groups, 5) repeat for 9999 iterations, 6) compare our observed mean difference to the distribution of possible mean differences under the randomized procedure.

# POOL DATA
pooledData <- c(PA, PP)
# SET THE NUMBER OF ITERATIONS
nIter <- 9999
# SET UP A CONTAINER FOR PERMUTED DIFFERENCES. ADD IN A SLOT FOR THE OBSERVED VALUE
meanDiff <- numeric(nIter+1)
# CALCULATE THE OBSERVED MEAN DIFFERENCE
meanDiff[1] <- mean(PP) - mean(PA)
# RUN THE ITERATION IN A FOR() LOOP
for(i in 2:length(meanDiff)){ # start from 2 to avoid overwriting the observed difference
 index <- sample(1:30, size=15, replace=F) # Sample numbers 1-30 15 times and store in an index
 PAperm <- pooledData[index] # Assign the sampled values to PA
 PPperm <- pooledData[-index] # Assign everything else to PP
 meanDiff[i] <- mean(PPperm) - mean(PAperm) # Calcualte and store the difference in means
}

# PLOT HISTORGRAM OF DIFFERENCES IN MEANS
hist(meanDiff, xlab='Difference in PP and PA means', prob=T, main='')
# ADD IN A LINE FOR OUR OBSERVED VALUE
abline(v=meanDiff[1], lty=2, col='red')

# CALCULATE THE P-VALUE. USE THE ABSOLUTE VALUE FOR A TWO-TAILED TEST
mean(abs(meanDiff) >= abs(meanDiff[1]))

The last line is a little trick where you get R to assign all differences greater than our observed difference a value of 1 and everything else gets assigned 0. Then you add up the 1′s and divide by the number of observations to get the proportion of observations greater than our observed statistic. This is the same as taking a mean of a vector of 1′s and 0′s. For example, if I have two 1′s and two 0′s, then (1+1+0+0)/4 = 0.5, which is the same as mean(1,1,0,0).

The p-value is incredibly low, similar to what we saw with the t-test.

You can try the same thing with log-normal data

# SIMULATE LOG-NORMAL DATA
PAlnorm <- rlnorm(n=15, mean=log(4.5), sd=log(2))
PPlnorm <- rlnorm(n=15, mean=log(7.5), sd=log(2))

mf <- par(mfrow=c(1,2))
hist(PAlnorm);hist(PPlnorm)
par(mf)

# t-Test
t.test(PAlnorm, PPlnorm)

# POOL DATA
pooledData <- c(PAlnorm, PPlnorm)
# SET THE NUMBER OF ITERATIONS
nIter <- 9999
# SET UP A CONTAINER FOR PERMUTED DIFFERENCES. ADD IN A SLOT FOR THE OBSERVED VALUE
meanDiff <- numeric(nIter+1)
# CALCULATE THE OBSERVED MEAN DIFFERENCE
meanDiff[1] <- mean(PPlnorm) - mean(PAlnorm)
# RUN THE ITERATION IN A FOR() LOOP
for(i in 2:length(meanDiff)){ # start from 2 to avoid overwriting the observed difference
 index <- sample(1:30, size=15, replace=F) # Sample numbers 1-30 15 times and store in an index
 PAperm <- pooledData[index] # Assign the sampled values to PA
 PPperm <- pooledData[-index] # Assign everything else to PP
 meanDiff[i] <- mean(PPperm) - mean(PAperm) # Calcualte and store the difference in means
}

# PLOT HISTORGRAM OF DIFFERENCES IN MEANS
hist(meanDiff, xlab='Difference in PP and PA means', prob=T, main='')
# ADD IN A LINE FOR OUR OBSERVED VALUE
abline(v=meanDiff[1], lty=2, col='red')

mean(abs(meanDiff) >= abs(meanDiff[1]))

The results of the t-test and permutational analysis are again equivalent, although this will not always be the case. (My personal experience is that it is usually the case, so parametric tests are probably more robust to departures from assumptions than we’ve been led to believe).

This code described a t-test. Doing an ANOVA requires a few extra lines of code but is conceptually similar.


To leave a comment for the author, please follow the link and comment on their blog: Climate Change Ecology » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)