# Beyond Univariate, Single-Sample Data with MCHT

**R – Curtis Miller's Personal Website**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Introduction

I’ve spent the past few weeks writing about **MCHT**, my new package for Monte Carlo and bootstrap hypothesis testing. After discussing how to use **MCHT** safely, I discussed how to use it for maximized Monte Carlo (MMC) testing, then bootstrap testing. One may think I’ve said all I want to say about the package, but in truth, I’ve only barely passed the halfway point!

Today I’m demonstrating how general **MCHT** is, allowing one to use it for multiple samples and on non-univariate data. I’ll be doing so with two examples: a permutation test and the test for significance of a regression model.

## Permutation Test

The idea of the permutation test dates back to Fisher (see [1]) and it forms the basis of computational testing for difference in mean. Let’s suppose that we have two samples with respective means and , respectively. Suppose we wish to test

against

If the null hypothesis is true and we also make the stronger assumption that the two samples were drawn from distributions that could differ only in their means, then the labelling of the two samples is artificial, and if it were removed the two samples would be indistinguishable. Relabelling the data and artificially calling one sample the

- Generate
new datasets by randomly assigning labels to the combined sample of and . - Compute copies of the test statistic on each of the new samples; suppose that the test statistic used is the difference in means,
. - Compute the test statistic on the actual sample and compare to the simulated statistics. If the actual statistic is relatively large compared to the simulated statistics, then reject the null hypothesis in favor of the alternative; otherwise, don’t reject.

In practice step 3 is done by computing a

## Permutation Tests Using MCHT

The permutation test is effectively a bootstrap test, so it is supported by **MCHT**, though one may wonder how that’s the case when the parameters `test_stat`

, `stat_gen`

, and `rand_gen`

only accept one parameter, `x`

, representing the dataset (as opposed to, say, `t.test()`

, which has an `x`

and an optional `y`

parameter). But `MCHTest()`

makes very few assumptions about what object `x`

actually is; if your object is either a vector or tabular, then the `MCHTest`

object should not have a problem with it (it’s even possible a loosely structured `list`

would be fine, but I have not tested this; tabular formats should cover most use cases).

In this case, putting our data in long-form format makes doing a permutation test fairly simple. One column will contain the group an observation belongs to while the other contains observation values. The `test_stat`

function will split the data according to group, compute group-wise means, and finally compute the test statistic. `rand_gen`

generates new dataset by permuting the labels in the data frame. `stat_gen`

merely serves as the glue between the two.

The result is the following test.

library(MCHT) library(doParallel) registerDoParallel(detectCores()) ts <- function(x) { grp_means <- aggregate(value ~ group, data = x, FUN = mean) grp_means$value[1] - grp_means$value[2] } rg <- function(x) { x$group <- sample(x$group) x } sg <- function(x) { test_stat(x) } permute.test <- MCHTest(ts, sg, rg, seed = 123, N = 1000, localize_functions = TRUE) df <- data.frame("value" = c(rnorm(5, 2, 1), rnorm(10, 0, 1)), "group" = rep(c("x", "y"), times = c(5, 10))) permute.test(df) ## ## Monte Carlo Test ## ## data: df ## S = 1.3985, p-value = 0.036

## Linear Regression F Test

Suppose for each observation in our dataset there is an outcome of interest,

The first question someone should asked when considering a regression model is whether it’s worth anything at all. An alternative approach to predicting

is much simpler and should be preferred to the more complicated model listed above if it’s just as good at explaining the behavior of

The

The alternative says that at least one of the regressors is helpful in predicting

We can use the

2, respectively.

This test is called the

If the null hypothesis is true then the best model for the data is this:

- Shuffle
over all rows of the input dataset, with replacement, to generate new datasets. - Compute
statistics for each of the generated datasets. - Compare the
statistic of the actual dataset to the generated datasets’ statistics.

## F Test Using MCHT

Let’s perform the `iris`

dataset. We will see if there is a relationship between the sepal length and sepal width among *iris setosa* flowers. Below is an initial split and visualization:

library(dplyr) setosa <- iris %>% filter(Species == "setosa") %>% select(Sepal.Length, Sepal.Width) plot(Sepal.Width ~ Sepal.Length, data = setosa)

There is an obvious relationship between the variables. Thus we should expect the test to reject the null hypothesis. That is what we would conclude if we were to run the conventional

res <- lm(Sepal.Width ~ Sepal.Length, data = setosa) summary(res) ## ## Call: ## lm(formula = Sepal.Width ~ Sepal.Length, data = setosa) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.72394 -0.18273 -0.00306 0.15738 0.51709 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.5694 0.5217 -1.091 0.281 ## Sepal.Length 0.7985 0.1040 7.681 6.71e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2565 on 48 degrees of freedom ## Multiple R-squared: 0.5514, Adjusted R-squared: 0.542 ## F-statistic: 58.99 on 1 and 48 DF, p-value: 6.71e-10

Let’s now implement the procedure I described with `MCHTest()`

.

ts <- function(x) { res <- lm(Sepal.Width ~ Sepal.Length, data = x) summary(res)$fstatistic[[1]] # Only way I know to automatically compute the # statistic } # rand_gen's function can use both x and n, and n will be the number of rows of # the dataset rg <- function(x, n) { x$Sepal.Width <- sample(x$Sepal.Width, replace = TRUE, size = n) x } b.f.test.1 <- MCHTest(ts, ts, rg, seed = 123, N = 1000) b.f.test.1(setosa) ## ## Monte Carlo Test ## ## data: setosa ## S = 58.994, p-value < 2.2e-16

Excellent! It reached the correct conclusion.

## Conclusion

One may naturally ask whether we can write functions a bit more general than what I’ve shown here at least in the regression context. For example, one may want parameters specifying a formula so that the regression model isn’t hard-coded into the test. In short, the answer is yes; `MCHTest`

objects try to pass as many parameters to the input functions as they can.

Here is the revised example that works for basically any formula:

ts <- function(x, formula) { res <- lm(formula = formula, data = x) summary(res)$fstatistic[[1]] } rg <- function(x, n, formula) { dep_var <- all.vars(formula)[1] # Get the name of the dependent variable x[[dep_var]] <- sample(x[[dep_var]], replace = TRUE, size = n) x } b.f.test.2 <- MCHTest(ts, ts, rg, seed = 123, N = 1000) b.f.test.2(setosa, formula = Sepal.Width ~ Sepal.Length) ## ## Monte Carlo Test ## ## data: setosa ## S = 58.994, p-value < 2.2e-16

This shows that you can have a lot of control over how `MCHTest`

objects handle their inputs, giving you considerable flexibility.

Next post: time series and **MCHT**

## References

- R. A. Fisher,
*The design of experiments*(1935)

Packt Publishing published a book for me entitled *Hands-On Data Analysis with NumPy and Pandas*, a book based on my video course *Unpacking NumPy and Pandas*. This book covers the basics of setting up a Python environment for data analysis with Anaconda, using Jupyter notebooks, and using NumPy and pandas. If you are starting out using Python for data analysis or know someone who is, please consider buying my book or at least spreading the word about it. You can buy the book directly or purchase a subscription to Mapt and read it there.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)!

**leave a comment**for the author, please follow the link and comment on their blog:

**R – Curtis Miller's Personal Website**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.