In a 2X2X2 design, there are three main effects, three two-way interactions, and one three-way interaction to test. That’s 7 statistical tests.The probability of making at least one Type 1 error in a single ANOVA is 1-(0.95)^7=30%.
There are earlier blog posts on this, but my eyes were not opened until I read this paper by Angelique Cramer and colleagues (put it on your reading list, if you haven’t read it yet). Because I prefer to provide solutions to problems, I want to show how to control Type 1 error rates in ANOVA’s in R, and repeat why it’s necessary if you don’t want to fool yourself. Please be aware that if you continue reading, you will lose the bliss of ignorance if you hadn’t thought about this issue before now, and it will reduce the amount of p <0.05 you’ll find in exploratory ANOVA's.
Simulating Type 1 errors in 3-way ANOVA’s
Let’s simulate 250000 2x2x2 ANOVAs where all factors are manipulated between individuals, with 50 participants in each condition, and without any true effect (all group means are equal).The R code is at the bottom of this page. We store the p-values of the 7 tests. The total p-value distribution has the by now familiar uniform shape we see if the null hypothesis is true.
If we count the number of significant findings (even though there is no real effect), we see that from 250000 2x2x2 ANOVA’s, approximately 87.500 p-values were smaller than 0.05 (the left most bar in the Figure). This equals 250.000 ANOVA’s x 0.05 Type 1 errors x 7 tests. If we split up the p-values for each of the 7 tests, we see in the table below that as expected, each test has it’s own 5% error rate, which together add up to a 30% error rate due to multiple testing. With a 2x2x2x2 ANOVA, the Type 1 errors you’ll a massive 54%, making you about as accurate as a scientist as a coin-flipping toddler.
Let’s fix this. We need to adjust the error rate. The Bonferroni correction (divide your alpha level by the number of tests, so for 7 tests and alpha = 0.05 use 0.05/7-= 0.007 for each test) communicates the basic idea very well, but the Holm-Bonferroni correction is slightly better. In fields outside of psychology (e.g., economics, gene discovery) work on optimal Type 1 error control procedures continues. I’ve used the mutoss package in R in my simulations to check a wide range of corrections, and came to the conclusion that unless the number of tests is huge, we don’t need anything more fancy than the Holm-Bonferroni (or sequential Bonferroni) correction (please correct me if I’m wrong in the comments!). It orders p-values from lowest to highest, and tests them sequentially against an increasingly more lenient alpha level. If you prefer a spreadsheet, go here.
In a 2x2x2 ANOVA, we can test three main effects, three 2-way interactions, and one 3-way interaction. The table below shows the error rate for each of these 7 tests is 5% (for a total of 1-0.95^7=30%) but after the Holm-Bonferroni correction, the Type 1 error rate nicely controlled.
However, another challenge is to not let Type 1 error control increase the Type 2 errors too much. To examine this, I’ve simulated 2x2x2 ANOVA’s where there is a true effect. One of the eight cells has a small positive difference, and one has a small negative difference. As a consequence, with sufficient power, we should find 4 significant effects (a main effect, two 2-way interactions, and the 3-way interaction).
Let’s first look at the p-value distribution. I’ve added a horizontal and vertical line. The horizontal line indicates the null-distribution caused by the four null-effects. The vertical line indicates the significance level of 0.05. The two lines create four quarters. Top left are the true positives, bottom left are the false positives, top right are the false negatives (not significant due to a lack of power) and the bottom right are the true negatives.
Now let’s plot the adjusted p-values using Holm’s correction (instead of changing the alpha level for each test, we can also keep the alpha fixed, but adjust the p-value).
We see a substantial drop in the left-most column, and this drop is larger than the false height due to false positives. We also see a peculiarly high bar on the right, caused by the Holm correction adjusting a large number of p-values to 1. We can see this drop in power in the Table below as well. It’s substantial: From 87% power to 68% power.
If you perform a 2x2x2 ANOVA, we might expect you are not really interested in the main effects (if you were, a simply t-test would have sufficed). The power cost is already much lower if the exploratory analysis focusses on only four tests, the three 2-way interactions, and the 3-way interaction (see the third row in the Table below). Even exploratory 2x2x2 ANOVA’s are typically not 100% exploratory. If so, preregistering the subset of all tests you are interesting in, and controlling the error rate for this subset of tests, provides an important boost in power.
Oh come on you silly methodological fetishist!
If you think Type 1 error control should not endanger the discovery of true effects, here’s what you should not do. You should not wave your hands at controlling Type 1 error rates, saying it is ‘methodological fetishism’ (Ellemers, 2013). It ain’t gonna work. If you choose to report p-values (by all means, don’t), and want to do quantitative science (by all means, don’t) than the formal logic you are following (even if you don’t realize this) is the Neyman-Pearson approach. It allows you to say: ‘In the long run, I’m not saying there’s something, when there is nothing, more than X% of the time’. If you don’t control error rates, your epistemic foundation of making statements reduces to ‘In the long run, I’m not saying there’s something, when there is nothing, more than … uhm … nice weather for the time of the year, isn’t it?’.
Now just because you need to control error rates, doesn’t mean you need to use a Type 1 error rate of 5%. If you plan to replicate any effect you find in an exploratory study, and you set the alpha to 0.2, the probability of making a Type 1 error twice in a row is 0.2*0.2 = 0.04. If you want to explore four different interactions in a 2x2x2 ANOVA you intend to replicate in any case, setting you overall Type 1 error across two studies to 0.2, and then using an alpha of 0.05 for each of the 4 tests might be a good idea. If some effects would be costlier to miss, but others less costly, you can use an alpha of 0.8 for two effects, and an alpha of 0.02 for the other two. This is just one example. It’s your party. You can easily pre-register the choices you make to the OSFor AsPredicted to transparently communicate them.
You can also throw error control out of the window. There are approximately 1.950.000 hits in Google Scholar when I search for ‘An Exploratory Analysis Of’. Put these words in the title, list all your DV’s in the main test (e.g., in a table), add Bayesian statistics and effect sizes with their confidence intervals, and don’t draw strong conclusions (Bender & Lange, 2001).
Obviously, the tricky thing is always what to do if your prediction was not confirmed. I think you enter a Lakatosian degenerative research line (as opposed to the progressive research line you’d be in if your predictions were confirmed). With some luck, there’s an easy fix. The same study, but using a larger sample, (or, if you designed a study using sequential analyses, simply continue the data collection after the first look at the data, Lakens, 2014) might get you back in a progressive research line after an update in the predicted effect size. Try again, with a better manipulation of dependent variable. Giving up on a research idea after a single failed confirmation is not how science works, in general. Statistical inferences tell you how to interpret the data without fooling yourself. Type 1 error control matters, and in most psychology experiments, is relatively easy to do. But it’s only one aspect of the things you take into account when you decide which research you want to do.
My main point here is that there are many possible solutions, and all you have to do is choose one that best fits your goals. Since your goal is very unlikely to be a 30% Type 1 error rate in a single study which you interpret as a 5% Type 1 error rate, you have to do something. There’s a lot of room between 100% exploratory and 100% confirmatory research, and there are many reasonable ideas about what the ‘family’ of errors is you want to control (for a good discussion on this, see Bender & Lange, 2001). I fully support their conclusion (p. 344): “Whatever the decision is, it should clearly be stated why and how the chosen analyses are performed, and which error rate is controlled for”. Clear words, no hand waving.
Thanks to @RogierK for correcting an error in an earlier version of this blog post.
Bender, R., & Lange, S. (2001). Adjusting for multiple testing—when and how? Journal of Clinical Epidemiology, 54(4), 343–349.
Cramer, A. O., van Ravenzwaaij, D., Matzke, D., Steingroever, H., Wetzels, R., Grasman, R. P., … Wagenmakers, E.-J. (2014). Hidden multiplicity in multiway ANOVA: Prevalence, consequences, and remedies. arXiv Preprint arXiv:1412.3416. Retrieved from http://arxiv.org/abs/1412.3416
Ellemers, N. (2013). Connecting the dots: Mobilizing theory to reveal the big picture in social psychology (and why we should do this): The big picture in social psychology. European Journal of Social Psychology, 43(1), 1–8. http://doi.org/10.1002/ejsp.1932
Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses: Sequential analyses. European Journal of Social Psychology, 44(7), 701–710. http://doi.org/10.1002/ejsp.2023