Beyond Univariate, Single-Sample Data with MCHT

November 5, 2018
By

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

Introduction

I’ve spent the past few weeks writing about MCHT, my new package for Monte Carlo and bootstrap hypothesis testing. After discussing how to use MCHT safely, I discussed how to use it for maximized Monte Carlo (MMC) testing, then bootstrap testing. One may think I’ve said all I want to say about the package, but in truth, I’ve only barely passed the halfway point!

Today I’m demonstrating how general MCHT is, allowing one to use it for multiple samples and on non-univariate data. I’ll be doing so with two examples: a permutation test and the F test for significance of a regression model.

Permutation Test

The idea of the permutation test dates back to Fisher (see [1]) and it forms the basis of computational testing for difference in mean. Let’s suppose that we have two samples with respective means \mu_X and \mu_Y, respectively. Suppose we wish to test

H_0: \mu_X = \mu_Y

against

using samples X_1, \ldots, X_m and Y_1, \ldots, Y_n, respectively.

If the null hypothesis is true and we also make the stronger assumption that the two samples were drawn from distributions that could differ only in their means, then the labelling of the two samples is artificial, and if it were removed the two samples would be indistinguishable. Relabelling the data and artificially calling one sample the X sample and the other the Y sample would produce highly similar statistics to the one we actually observed. This observation suggests the following procedure:

  1. Generate N new datasets by randomly assigning labels to the combined sample of X_1, \ldots, X_m and Y_1, \ldots, Y_n.
  2. Compute copies of the test statistic on each of the new samples; suppose that the test statistic used is the difference in means, \bar{X} -  \bar{Y}.
  3. Compute the test statistic on the actual sample and compare to the simulated statistics. If the actual statistic is relatively large compared to the simulated statistics, then reject the null hypothesis in favor of the alternative; otherwise, don’t reject.

In practice step 3 is done by computing a p-value representing the proportion of simulated statistics larger than the one actually computed.

Permutation Tests Using MCHT

The permutation test is effectively a bootstrap test, so it is supported by MCHT, though one may wonder how that’s the case when the parameters test_stat, stat_gen, and rand_gen only accept one parameter, x, representing the dataset (as opposed to, say, t.test(), which has an x and an optional y parameter). But MCHTest() makes very few assumptions about what object x actually is; if your object is either a vector or tabular, then the MCHTest object should not have a problem with it (it’s even possible a loosely structured list would be fine, but I have not tested this; tabular formats should cover most use cases).

In this case, putting our data in long-form format makes doing a permutation test fairly simple. One column will contain the group an observation belongs to while the other contains observation values. The test_stat function will split the data according to group, compute group-wise means, and finally compute the test statistic. rand_gen generates new dataset by permuting the labels in the data frame. stat_gen merely serves as the glue between the two.

The result is the following test.

library(MCHT)
library(doParallel)

registerDoParallel(detectCores())

ts <- function(x) {
  grp_means <- aggregate(value ~ group, data = x, FUN = mean)

  grp_means$value[1] - grp_means$value[2]
}

rg <- function(x) {
  x$group <- sample(x$group)
  x
}

sg <- function(x) {
  test_stat(x)
}

permute.test <- MCHTest(ts, sg, rg, seed = 123, N = 1000,
                        localize_functions = TRUE)

df <- data.frame("value" = c(rnorm(5, 2, 1), rnorm(10, 0, 1)),
                 "group" = rep(c("x", "y"), times = c(5, 10)))

permute.test(df)
## 
## 	Monte Carlo Test
## 
## data:  df
## S = 1.3985, p-value = 0.036

Linear Regression F Test

Suppose for each observation in our dataset there is an outcome of interest, y_i, and there are variables x_{1,i}, \ldots, x_{p,i} that could together help predict the value of y_i if they are known. Consider then the following linear regression model (with E[\epsilon_i] = 0):

y_i = \beta_0 + \beta_1 x_{1,i} + \ldots + \beta_p x_{p,i} + \epsilon_i

The first question someone should asked when considering a regression model is whether it’s worth anything at all. An alternative approach to predicting y_i is simply to predict its mean value. That is, the model

y_i = \beta_0 + \epsilon_i

is much simpler and should be preferred to the more complicated model listed above if it’s just as good at explaining the behavior of y_i for all i. Notice the second model is simply the first model with all the coefficients \beta_1, \ldots, \beta_p identically equal to zero.

The F-test (described in more detail here) can help us decide between these two competing models. Under the null hypothesis, the second model is the true model.

H_0: \beta_1 = \ldots = \beta_p = 0

The alternative says that at least one of the regressors is helpful in predicting y_i.

H_A: H_0 \text{ is false}

We can use the F statistic to decide between the two models:

F = \frac{\frac{\text{RSS}_2 - \text{RSS}_1}{p - 1}}{\frac{\text{RSS}_1}{n - p}}

\text{RSS}_1 and \text{RSS}_2 are the residual sum of squares of models 1 and
2, respectively.

This test is called the F-test because usually the F-distribution is used to compute p-values (as this is the distributiont the F statistic should follow when certain conditions hold, at least asymptotically if not exactly). What then would a bootstrap-based procedure look like?

If the null hypothesis is true then the best model for the data is this:

y_i = \bar{y} + \hat{\epsilon}_i

\bar{y} is the sample mean of y and \hat{epsilon}_i = y_i - \bar{y} is the residual. This suggests the following procedure:

  1. Shuffle y_i over all rows of the input dataset, with replacement, to generate new datasets.
  2. Compute F statistics for each of the generated datasets.
  3. Compare the F statistic of the actual dataset to the generated datasets’ statistics.

F Test Using MCHT

Let’s perform the F test on a subset of the iris dataset. We will see if there is a relationship between the sepal length and sepal width among iris setosa flowers. Below is an initial split and visualization:

library(dplyr)

setosa <- iris %>% filter(Species == "setosa") %>%
            select(Sepal.Length, Sepal.Width)
plot(Sepal.Width ~ Sepal.Length, data = setosa)

plot of chunk unnamed-chunk-2

There is an obvious relationship between the variables. Thus we should expect the test to reject the null hypothesis. That is what we would conclude if we were to run the conventional F test:

res <- lm(Sepal.Width ~ Sepal.Length, data = setosa)
summary(res)
## 
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length, data = setosa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72394 -0.18273 -0.00306  0.15738  0.51709 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.5694     0.5217  -1.091    0.281    
## Sepal.Length   0.7985     0.1040   7.681 6.71e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2565 on 48 degrees of freedom
## Multiple R-squared:  0.5514,	Adjusted R-squared:  0.542 
## F-statistic: 58.99 on 1 and 48 DF,  p-value: 6.71e-10

Let’s now implement the procedure I described with MCHTest().

ts <- function(x) {
  res <- lm(Sepal.Width ~ Sepal.Length, data = x)
  summary(res)$fstatistic[[1]]  # Only way I know to automatically compute the
                                # statistic
}

# rand_gen's function can use both x and n, and n will be the number of rows of
# the dataset
rg <- function(x, n) {
  x$Sepal.Width <- sample(x$Sepal.Width, replace = TRUE, size = n)
  x
}

b.f.test.1 <- MCHTest(ts, ts, rg, seed = 123, N = 1000)

b.f.test.1(setosa)
## 
## 	Monte Carlo Test
## 
## data:  setosa
## S = 58.994, p-value < 2.2e-16

Excellent! It reached the correct conclusion.

Conclusion

One may naturally ask whether we can write functions a bit more general than what I’ve shown here at least in the regression context. For example, one may want parameters specifying a formula so that the regression model isn’t hard-coded into the test. In short, the answer is yes; MCHTest objects try to pass as many parameters to the input functions as they can.

Here is the revised example that works for basically any formula:

ts <- function(x, formula) {
  res <- lm(formula = formula, data = x)
  summary(res)$fstatistic[[1]]
}

rg <- function(x, n, formula) {
  dep_var <- all.vars(formula)[1]  # Get the name of the dependent variable
  x[[dep_var]] <- sample(x[[dep_var]], replace = TRUE, size = n)
  x
}

b.f.test.2 <- MCHTest(ts, ts, rg, seed = 123, N = 1000)

b.f.test.2(setosa, formula = Sepal.Width ~ Sepal.Length)
## 
## 	Monte Carlo Test
## 
## data:  setosa
## S = 58.994, p-value < 2.2e-16

This shows that you can have a lot of control over how MCHTest objects handle their inputs, giving you considerable flexibility.

Next post: time series and MCHT

References

  1. R. A. Fisher, The design of experiments (1935)

Packt Publishing published a book for me entitled Hands-On Data Analysis with NumPy and Pandas, a book based on my video course Unpacking NumPy and Pandas. This book covers the basics of setting up a Python environment for data analysis with Anaconda, using Jupyter notebooks, and using NumPy and pandas. If you are starting out using Python for data analysis or know someone who is, please consider buying my book or at least spreading the word about it. You can buy the book directly or purchase a subscription to Mapt and read it there.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)!

To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)