# What is a Permutation Test?

**Econometrics Beat: Dave Giles' Blog**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Let’s begin with some background discussion to set the scene. This might seem a bit redundant, but it will help us to see how permutation tests differ from the sort of tests that we usually use in econometrics.

**Background Motivation**

When you took your first course in economic statistics, or econometrics, no doubt you encountered some of the basic concepts associated with testing hypotheses. I’m sure that the first exposure that you had to this was actually in terms of “classical”, Neyman-Pearson, testing.

It probably wasn’t described to you in so many words. It would have just been “statistical hypothesis testing”. The whole procedure would have been presented, more or less, along the following lines:

- We want to test the validity of statement (“null hypothesis”) about a parameter associated with a well-defined underlying population.
- We also state clearly what situation will prevail if the hypothesis to be tested is
*not true*. We call this the “alternative hypothesis”. - Then we take a
*carefully constructed*sample from the population of interest. For example, we might use simple random sampling, so that all sample values are mutually independent of each other. - We combine the sample values into a single statistic. Usually, this would be a statistic that had already been found to be a “good” estimator of the parameter under test. Typically, this estimator would have to be transformed (
*e.g.*, “standardized”) to make it “pivotal” – that is, having a sampling distribution that does not depend on any other unknown parameters. - Then, we ask – “if in fact the null hypothesis
*were actually true*, is the value of our test statistic “surprising”, or not? - If it
*is surprising*, we would*reject*the null hypothesis.*If not*, we would*not reject*the null hypothesis.

**– by checking if the test statistic’s value was “more extreme” than a critical value (or values) obtained by choosing an acceptable significance level in advance; and using the sampling distribution of your test statistic,**

*Either**based on the assumptions about the underlying population and the sampling method that was used*.

**– by asking, “if the null hypothesis really were true, how likely is it that I’d see a value for my test statistic that’s as extreme (or more extreme) than what I’ve actually observed here (again,**

*Or**taking into account*

*the assumptions about the underlying population and the sampling method that was used*). The answer to this question is, of course, the familiar p-value.

Now, you probably didn’t need to be reminded of all of this. However, there are some key aspects to this hypothesis testing story that I’ve highlighted above, and which are worth remembering when we consider (below) a different possible way of proceeding.

- First, notice that there’s this idea that we know (or can reasonably assume) quite a lot about the underlying population. For instance, we might assume that the population values follow a Normal distribution, with a known variance, but a mean that is unknown and whose value we’re testing. That’s quite a stretch, in many cases. (Yes, we can think about our test’s “robustness” to these assumptions, but that’s a different issue.)
- Second, the type of sampling that we use is crucial. Usually, simple random sampling is assumed. In reality, much (most) of the data that we use have been obtained using “complex surveys”, involving multi-level cluster sampling and stratified sampling. This changes everything!

If we mess up with either of these two things, then the sampling distribution of our test statistic will differ from what we think it is, and the probabilities that are associated with the implementation of Step 6 will be wrong.

- Finally, we have to be able to precisely specify the null and alternative hypotheses (
*in advance of seeing the sample data*), and these hypotheses have to relate to some parameter(s) in the population.

Let’s consider a simple, specific, example.

Suppose that the underlying population is N[μ, σ

^{2}], where σ^{2}is known, and we obtain a sample of n observations using simple random sampling. Then, you’ll recall that the sample average, x* is the most efficient unbiased estimator of μ. Using x* as our test statistic sounds like a good idea, but its sampling distribution is N[μ, σ^{2}/n], so it depends on the unknown value of μ. This test statistic isn’t “pivotal”.

However, if we transform x* into z = (x* – μ) / (σ / √(n)), this new statistic has a sampling distribution that is N[0 , 1].

It is now pivotal. (That’s what matters – not that we have a standardized distribution!)

If all of the assumptions that we’ve made are correct, then we can easily test the null hypothesis H_{0}: μ = 1 (say) against the alternative hypothesis, H_{A}: μ > 1 (say). If we assume that H_{0}is true, or μ = 1, then we can compute a value for z, because now it’s as if everything is known.

Now we can proceed as outlined above, because we have a

valuefor our test statistic, and we know its sampling distribution if H_{0}were true.

Let’s think about a totally different way of testing some hypothesis that interests us. There are three “bulleted” points just before the example above. Let’s toss them all out of the window and start afresh!**Permutation Tests**

Let’s suppose that we want to test some hypothesis, and we have a sample of size n that we plan to use. This sample could be very small, and just how we obtained it is not that important. In particular, * we’re not going to assume that it was obtained randomly*. Moreover, the (null) hypothesis that we’re wanting to test

*. It can just be some general statement. In fact, we’re not going to that much interested in “populations” as such, so*

**doesn’t have to be expressed in terms of a statement about some parameter associated with a population****.**

*we certainly don’t have to make a whole bunch of heroic assumptions involving properties such as “normality”, or “constant variance”*This sounds different, already!

We also have in mind some alternative hypothesis, and we’ll see shortly why this is important. Then, using our sample of data, {x_{1}, x_{2}, x_{3}, …., x_{n}}, we calculate a statistic that seems to be relevant when it comes to examining the possible truth of our null hypothesis. For instance we may want it to be one that increases monotonically as we move further from H_{0} in the direction of H_{A}. * This statistic does not have to be pivotal.* W’ell call the value of our statistic from this original sample the “observed statistic”, and label it S

_{obs}.

Now comes the interesting part. We “shuffle” (*i.e*., permute) the n sample observations. Perhaps they’re now ordered {x_{3}, x_{2}, x_{n},….., x_{1}}. We re-calculate the test statistic. Then we repeat the process for every possible permutation of the sample. This will give us n! test statistics (including S_{obs}). Let’s call their values S_{π}, where π refers to a particular permutation of the sample.

Finally, we ask the question, “Is S_{obs} very different from the other S_{π} values?” Or, more particularly, what fraction of the n! values of S_{π} are as “extreme”, or “more extreme”, than our original S_{obs}?”

This is where the alternative hypothesis comes in. The meaning of “extreme” is determined by H_{A} being one-sided or two-sided.

*The fraction of the S _{π} values that are as extreme, or more extreme, than S_{obs} is the p-value for our test.*

What we have here is a testing method that is “non-parametric”, or “distribution free”. However, it’s more than that. We can show that the the test will be “exact”, in the sense that * it will have exactly the significance level that we decide we want, no matter how small the sample is*!

Now, let’s be clear. All of this comes at a cost. Actually, there are three types of “cost” involved.

First, we’re going to be making an assumption about our data. This is actually quite a mild assumption, especially when we compare it with the various strong assumptions that underlie classical hypothesis testing. And often it will clear from the context of our testing problem that this assumption is obviously satisfied.

This key assumption is usually described as “*exchangeability*” of the data. That is, shuffling the observed data points keeps the data-set just as likely as the original one.

Second, this procedure that I’ve outlined isn’t going to be applicable to all testing problems, including some obvious and simple ones. For example, if we have a single sample (as above) and our test statistic is chosen to be the sample average, x* (or the sample median, say), then every S_{π} will have the same value of S_{obs}, and we’ll get nowhere.

Third, you can see immediately that even with very modest sample sizes, the number of possible permutations becomes extremely large, very quickly. For example, if n = 10 there are 10! = 3,628,800 permutations; and if n = 100 there are 9.33×10^{157}.

Yikes!

We can’t do much about the second of these “costs”. We’ll have to focus on testing problems that are suited to being solved using this particular tool. However, the *computational cost* can be handled quite easily, as we’ll see below.

Now, why does a permutation test actually work at all?

Basically, if the data are exchangeable then every permuted sample will be as likely to occur as any other one. So every S_{π}will be as likely to arise as any other one. Any observed differences will be small and random if H_{0}is true. If S_{obs}is “unusual”, then something is amiss and we would be inclined to reject our null hypothesis.

Permutation tests date back to the 1930’s, and were first proposed by Fisher (1935), Pitman (1937a, 1937b, 1938) and others. This is certainly not a new idea! However, as you can imagine, in the 1930’s these tests could be used only with very small samples and this limited their appeal to some degree.

Even with today’s computing power, the large number of permutations associated with even modest sample sizes is still a challenge. For this reason, in practice we typically just randomly sample a large number of permutations from the total number that are available. This is easy to do, and although the resulting test is no longer (literally) “exact”, we can make it as accurate as wish by choosing a suitable number of random permutations.

*intensive*. But they’re very manageable. There are several R packages for various types of randomization tests. These include the “coin” package; the “jmuOutlier” package; and the “RATest” package.

As I’ve noted already, I think that students of econometrics need to be more aware of permutation tests, even though they aren’t well-suited for some testing problems. A brief historical perspective is given by David (2008).

A lot has happened in the associated literature since 1995, of course, especially when it comes to applying these tests in the context of multiple regression analysis. I’ll say a bit more about the latter in the follow-up post that I’m currently preparing.

**Some Examples**Let’s take a look at a couple of simple examples that will illustrate the use of permutation/randomization tests. The first involves a “treatment effect” – a classic situation where permutation tests are used.

Example 1: Suppose that two econometrics students, Jane and John, sit a pre-semester test and a final exam. Here are their % scores:

__Jane__

__John__

__(Sample) Average__

*etc*. So, the conditions that are needed for the t-test may not hold, and our inferences could be badly distorted. We certainly can’t appeal to standard asymptotic results, given the very small sample size that we have here.

*no improvement*in knowledge, and the alternative hypothesis that there is a

*positive*improvement. We don’t need a fancy test statistic. For instance, we can simply use the difference between the “after” and “before” sample average scores. Then, the

*observed*value of our test statistic is S

_{obs}= (74.0 – 72.5) = 1.5. Yes, this is positive, but is it different from zero in any important sense? (I’m trying to avoid the term “statistically significant” here!)

Notice that there are 4 scores in the table above, and the 2 scores associated with the pre-semester test can be assigned in [4! / (2! 2!)] = 6 ways. Here are those 6 possible permutations:

__70 72 75 76 __ S_{π} = (X^{*}_{F} – X^{*}_{P})**π**

1 P P F F 4.5

2 P F P F 1.5 (S_{obs})

3 P F F P 0.5

4 F P P F -0.5

5 F P F P -1.5

6 F F P P -4.5

^{*}

_{P}and X

^{*}

_{F }are the sample average (permuted) scores in the test and in the final exam. S

_{π}denotes our test statistic for permutation π (= 1 to 6).

_{π}values is greater than or equal to S

_{obs}?” The answer is (2/6) = 0.333.

*associated with our permutation test. On the basis of its value I, personally,*

**This is the p-value***wouldn’t reject*the null hypothesis. I’d conclude that the course led to

*no improvement*in students’ knowledge.

*no change*in performance, then the p-value would be 0.667. Yes, we’re doubling the one-sided p-value in this case, because the distribution of the S

_{π}values happens to be symmetric. More importantly, we get this answer by noting that four out of the six | S

_{π}| values are greater than or equal to | S

_{obs }|.

_{π}statistics into (pseudo-) “t-statistics”, using the usual standardization, and then proceeded as above, we’get exactly the same p-value for the permutation test. S should be monotone with respect to changes in the data that make the null less likely. Choice may (often will) affect power of the test.

**Example 2:**

*no correlation*between the supply (Q) of a particular good and its market price, P. We’d probably want to have a one-sided alternative hypothesis, saying that there is a positive such correlation.

Suppose that we have a *sample* of 20 prices and the corresponding 20 quantities supplied:

**P:**5.0, 4.8, 4.7, 4.0, 5.3, 4.1, 5.5, 4.7, 3.3, 4.0, 4.0, 4.6, 5.3, 3.0, 3.5, 3.9, 4.7, 5.0, 5.2, 4.6

**Q:** 60, 59, 58, 47, 65, 48, 67, 70, 55, 63, 62, 65, 71, 56, 59, 60, 74, 77, 78, 62

We’re not going to assume that the population(s) from which we’ve sampled are normal, or anything else for that matter. We’re going to use the sample (Pearson) correlation coefficient as our test statistic. We’ll keep the ordering of the Q values just as they are in the observed sample, and we’ll permute the 20 observations on P. Even with this small sample there are 2.43×10^{18} possible permutations of the observations on P. So, we’ll draw a random selection of 50,000 of these permutations in constructing our randomization test.

The sample correlation for the 20 original values of P and Q is r_{obs} = 0.6183. We’ll denote the sample correlation associated with Q and a particular set of permuted P observations by r_{π} (π = 1, …, 50000). The (one-sided) p-value is the fraction of these 50,000 values of r_{π} that are greater than or equal to r_{obs}. You’ll find both EViews and R code for this example on the code page of this blog.

The p-values that I got using EViews and R were 0.0017 and 0.0016 respectively. The distribution of the 50,000 correlation coefficients generated in R looks like this:

Again, if we wanted to compute a two-sided p-value this would be the fraction of the 50,000 values of | r_{π} | that are greater than or equal to | r_{obs }|. Modifying the R code, we get a value of 0.00348 for this p-value.

In any case, on the basis of the very small p-value I’d (personally) *reject the hypothesis* that there is no correlation between price and quantity supplied.

Finally, although it was quite natural to use the sample correlation coefficient, r, as the test statistic for our hypothesis, again this is not actually necessary. Recalling the formula for r, we can use the test statistic, S_{π} = Σ(P_{i}Q_{i}). (The range of summation is over the sample size, n.) It differs from r_{π}, only by a location shift and a scaling. Constructing a permutation test of H_{0} using S_{π} instead of r_{π}, we get *exactly the same p-value*(s). The EViews program code that I’ve supplied illustrates this. (If you’re not an EViews user, you can open the program file with any text editor, and there are enough comments in the code for you to see what’s going on.)

The important things to note about these two examples are:

- The permutation/randomization tests were constructed and applied without the need for any particular knowledge or assumptions about some underlying population.
- There was no need for the sampling to be strictly random.
- It was legitimate to “shuffle” (permute) the observations, treating them as if they were “exchangeable”. (This will not be the case for some testing problems.)
- The tests are very simple, conceptually, to construct.
- Although they are modestly computer-intensive, this “cost” is on a par with bootstrapping, for example.

*exact*when it comes to their significance level.

**References**

Fisher, R. A., 1935. *The Design of Experiments*. Oliver & Boyd, Edinburgh.

*Journal of Business and Economic Statistics*, 13, 85-94.

*Journal of the Royal Statistical Society, B*, 4, 119-130.

*Journal of the Royal Statistical Society, B*, 4, 225-232.

*Biometrika*, 29, 322-335.

**leave a comment**for the author, please follow the link and comment on their blog:

**Econometrics Beat: Dave Giles' Blog**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.