Beyond normality: the bootstrap method for hypothesis testing

[This article was first published on R on Alejandro Morales' Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

tl;dr: Parametric bootstrap methods can be used to test hypothesis and calculate p values while assuming any particular population distribution we may want. Non-parametric bootstrapping methods can be used to test hypotheses and calculate p values without having to assume any particular population as long as the sample can be assumed to be representative of the population and one can transform the data adequately to take into account the null hypothesis. The p values from bootstrap methods may differ from those from classical methods, especially when the assumptions of the classical methods do not hold. The different methods of calculation can push a p value beyond the 0.05 threshold which means that statements of statistical significance are sensitive to all the assumptions used in the test.


In this article I show how to use parametric and non-parametric bootstrapping to test null hypotheses, with special emphasis on situations when the assumption of normality may not hold. To make it more relevant, I will use real data (from my own research) illustrate the application of these methods. If you get lost somewhere in this article, you may want to take a look at my previous post, where I introduced the basic concepts behind hypothesis testing and sampling distributions. As in the previous post, the analysis will be done in R, so before we get into the details, it is important to properly setup our R session:

for(name in c("ggplot2", "plotly","furrr", "distr6"))
  library(name, character.only = TRUE)
plan(multiprocess) # Turns on parallel computation
set.seed(2019) # Reproducible Monte Carlo simulation

The data I will use consists of measurements of individual plant biomass (i.e. the weight of a plant after we have remove all the water) exposed to a control treatment (C), drought (D), high temperature (HT) and high temperature and drought (HTD). First, let’s take a look at the data:

Biomass = data.frame(Treatment = rep(c("C", "D", "HT", "HTD"), each = 18),
                 Biomass = c(2.03,  4.49,   3.84,   2.66,   7.4,    3.04,   2.63,   7,  5.84,   6.99,   4.15,   5.74,   10.49,  23.3,   14.21,  16.97,  11.56,  17.94, 6.01,    6.94,   6.05,   5.23,   2.47,   6.24,   3.96,   4.47,   2.35,   4.37,   3.33,   6.04,   7.98,   11.44,  10.02,  9.64,   11.19,  12.71, 5.22,    4.61,   7.58,   4.7,    6.68,   4.88,   4.11,   4.28,   5.77,   1.54,   2.79,   7.64,   8.68,   7.68,   12, 7.06,   9.9,    17.94, 3.8, 3.8,    5.14,   6.06,   2.78,   2.63,   3.91,   4.65,   5.62,   4.5,    4.45,   5.44,   8.53,   5.59,   6.14,   4.92,   6.54,   7.01))

p = ggplot(data = Biomass, aes(x = Treatment, y = Biomass, colour = Treatment)) + geom_point() + geom_jitter(width = 0.25) + geom_violin(fill = NA) + theme(legend.position = "none")

To leave a comment for the author, please follow the link and comment on their blog: R on Alejandro Morales' Blog. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)