Universal inference in R

[This article was first published on Bluecology blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Universal inference in R

When we perform a statistical test we’d like to have confidence what the
Type I error rate is what we say it is (rate of false positive
findings). Often scientists choose a rate (AKA ‘alpha’) of 0.05 for
rejecting the null hypothesis.

We’d like some guarantees that our statistical test will actually have a
type I rate of 5% over many, many repeated trials. For many statistical
tests theory does in fact guarantee a 5% error rate over many many
repeated trials.

The paper Universal
inference

from Wasserman et al. published in PNAS proposes a new test that
guarantees a given Type I error rate for some types of statistical tests
that previously didn’t have an appropriate test.

Here I’ll attempt to code their method in R.

I say ‘attempt’ because I’m not at all sure I interpreted their paper
correctly. So please don’t take this blog as a ‘how to’ for the method.
Instead, treat it as an exploration of type I and type II errors and how
we can check them.

Feel free to email me ([email protected]) if you have
suggestions on this blog.

The method is based on likelihood ratio tests, so at the very least
we’ll get to learn about LRTs in this blog.

If you are seriously interested in using universal inference, I suggest
you consult a statistician.

Testing the null that a mean equals zero

Let’s first test if a some data drawn from a normal distribution are
consistent with a null that the mean = 0.

We’ll define the data:

mu <- 0.4
n <- 100
set.seed(42)
y <- rnorm(n, mean = mu)

Now, do a likelihood ratio test the usual way:

(test_stat <- 2*(sum(dnorm(y, mean(y), sd(y), log = TRUE)) - sum(dnorm(y, 0, sd(y), log = TRUE))))

## [1] 17.25054

1 - pchisq(test_stat, 1)

## [1] 3.276042e-05

We just compared the likelihoods of the data given a mean of zero versus
the maximum likelihood estimate of the mean (which is just the mean of
the data). Then we take that ratio and find its quantile on the chisq,
that’s our (very small) p-value = 3.27E-5.

If we were using linear models, we could do the same thing like this:

m1 <- lm(y~1) #intercept only, ie a mean
m0 <- lm(y~0) #no intercept, so mean = -
anova(m0, m1, test = "Chisq")

## Analysis of Variance Table
##
## Model 1: y ~ 0
## Model 2: y ~ 1
##   Res.Df    RSS Df Sum of Sq  Pr(>Chi)    
## 1    100 126.06                           
## 2     99 107.36  1    18.707 3.276e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Split LRT

Wasserman et al. propose a ‘split LRT’ where we split the data in half.
This has some pleasing similarities to out of sample validation. As I
understand it, this is how we’d do the test:

First split the data at random:

i <- sample(1:n, n/2, replace = FALSE)
y0 <- y[i]
y1 <- y[-i]

Now calculate the MLEs for both splits:

mu0 <- mean(y0)
sd0 <- sd(y0)

mu1 <- mean(y1)
sd1 <- sd(y1)

Then our split test statistic is (note we are taking likelihood of the y0 split
split using the y1 split’s mean):

split_test_stat0 <- sum(dnorm(y0, mu1, sd1, log = TRUE)) -
                         sum(dnorm(y0, 0, sd1, log = TRUE))

And we can ask if its significant like this:

exp(split_test_stat0) >= 1/0.05

## [1] TRUE

TRUE, so reject the null, which is the same result as our Chi square
test above.

They also propose a cross-fit test, that is just the average of the two
split tests:

split_test_stat1 <- sum(dnorm(y1, mu0, sd0, log = TRUE)) -
  sum(dnorm(y1, 0, sd0, log = TRUE))
split_test_stat <- (split_test_stat1 + split_test_stat0)/2
exp(split_test_stat) >= 1/0.05

## [1] TRUE

Power

What about the test’s power? Well it seems a shortcoming is that the
split LRT can have lower power (higher type II rate, or chance of
missing real differences) than some other tests. So for a simple test
like that above we are better of doing the test the regular way.

Let’s check its power for our simple test vs a chisq. I’ll write a
function to do this, then iterate it.

splitLRT <- function(seed, n, mu){
  set.seed(seed)
  y <- rnorm(n, mean = mu)
  i <- sample(1:n, n/2, replace = FALSE)

  y0 <- y[i]
  y1 <- y[-i]

  mu0 <- mean(y0)
  sd0 <- sd(y0)

  mu1 <- mean(y1)
  sd1 <- sd(y1)

  #split test stat
  split_test_stat0 <- sum(dnorm(y0, mu1, sd1, log = TRUE)) -
                         sum(dnorm(y0, 0, sd1, log = TRUE))
  split_test_stat1 <- sum(dnorm(y1, mu0, sd0, log = TRUE)) -
  sum(dnorm(y1, 0, sd0, log = TRUE))
  split_test_stat <- (split_test_stat1 + split_test_stat0)/2


  #regular Chisq LRT
  test_stat <- 2*(sum(dnorm(y, mean(y), sd(y), log = TRUE)) - sum(dnorm(y, 0, sd(y), log = TRUE)))
  chisqtest <- 1 - pchisq(test_stat, 1)

  #output results as a dataframe
  data.frame(splitLRT = exp(split_test_stat), chisq = chisqtest)
}

Now let’s use our function:

xout <- lapply(1:1000, splitLRT, n = 50, mu = 0.5)
dfout <- do.call("rbind", xout)
sum(dfout$splitLRT >= (1/0.05))/1000

## [1] 0.415

sum(dfout$chisq <= 0.05)/1000

## [1] 0.932

So the split test only rejects the null 41.5% of the time, whereas the
chisq rejects it 93% of the time. In other words the split test comes at
the cost of lower power, as is explained in the paper.

It would be worth trying the suggestion in the paper of using k-fold
cross-validation to do the splits too, maybe that would improve the
power.

For larger sample sizes, the split test method does better:

xout <- lapply(1:1000, splitLRT, n = 150, mu = 0.5)
dfout <- do.call("rbind", xout)
sum(dfout$splitLRT >= (1/0.05))/1000

## [1] 0.974

sum(dfout$chisq <= 0.05)/1000

## [1] 1

What next?

The example I give above is illustrative only, there are theoretical
reasons not to use the regular Chi sq test for such a simple model.

It would be interesting to try the split LRT for more sophisticated
models where tests don’t currently exist. Wasserman et al. suggest its
use for a number of models that ecologists use regularly, including
mixing models (where data are generated from a mix of distributions),
testing latent variables (such as in the multispecies hierarchical
models, like BORAL, that we love here) and for structural equation
models.

To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)