Jeffreys’ Substitution Posterior for the Median: A Nice Trick to Non-parametrically Estimate the Median

Posted on May 3, 2014 by Rasmus Bååth in R bloggers | 0 Comments

[This article was first published on Publishable Stuff, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

While reading up on quantile regression I found a really nice hack described in Bayesian Quantile Regression Methods (Lancaster & Jae Jun, 2010). It is called Jeffreys’ substitution posterior for the median, first described by Harold Jeffreys in his Theory of Probability, and is a non-parametric method for approximating the posterior of the median. What makes it cool is that it is really easy to understand and pretty simple to compute, while making no assumptions about the underlying distribution of the data. The method does not strictly produce a posterior distribution, but has be shown to produce a conservative approximation to a valid posterior (Lavine, 1995). In this post I will try to explain Jeffreys’ substitution posterior, give R-code that implements it and finally compare it with a classical non-parametric test, the Wilcoxon signed-rank test. But first a picture of Sir Harold Jeffreys:

Say we have a vector $x$ and we are interested in the median of the underlying distribution. The property of the median is that it sits smack in the middle of the distribution. We would expect that, on average, half of the datapoints of a vector generated from a distribution would fall to the left of the median (and so the other half would fall to the right). Now, given the vector $x$ of length $n$ and a candidate for the median $m$, let $n_l$ be the number of values that fall to the left of $m$ and $n_r$ be the number of values that fall to the right. If $m$ was the actual median the probability of getting the $n_l$ and $n_r$ that we actually got could be calculated as

$$ \text{The number of ways of selecting } n_l \text{ out of } n \text{ values in } x\over \text{The number of ways of placing the } n \text{ values in x into two categories } (n_l\text{ and } n_r) $$

Or using notation:

$${{n}\choose{n_l}} \big / 2^n$$

Here is an example:

Jeffreys proposed to use the probability ${{n}\choose{n_l}} \big / 2^n$ as a substitution for the likelihood, $p(x \mid m)$, and then calculate $p(m \mid x) \propto p(x \mid m) \cdot p(m)$.

By doing this we get something posterior-like we can do the usual stuff with, calculate credible intervals, calculate the probability that the median is smaller than this and that, etc. Before doing this we will just rewrite the substitution likelihood $p(x \mid m)$ to make it easier to work with. We can rewrite the $n_l$-combinations statement as

$${{n}\choose{n_l}} = {n! \over n_l!n_r!}$$

resulting in

$${{{n}\choose{n_l}} \big / 2^n} = {{{n! \over n_l!n_r!2^n}} }$$

If we change the candidate for the median $m$ the only numbers that are going to change are $n_l!$ and $n_r!$, $n!$ and $2^n$ stays the same. So, if we are interested in an expression that is proportional to the likelihood we could just use

$$ p(x \mid m) \propto {1 \over n_l!n_r!}$$

If we assume a uniform prior over $m$ we get that

$$p(m \mid x) \propto p(x \mid m) \propto {1 \over n_l!n_r!}$$

Great! Now we have a method to go from a vector of data, $x$, to a posterior probability for the median, $p(m \mid x)$.

Doing it in R

Using Jeffreys’ substitution posterior in R is pretty simple, but we have to be cautious not to run into numerical trouble. Directly calculating $1 / (n_l!n_r!)$ will only work for small datasets, already factorial(180) results in values so large that you’ll get an error in R. To overcome this problem we’ll do all the calculations on the log-scale as far as possible, using lfactorial instead. We’ll also use the fact that all candidate $m$ between two datapoints have the same likelihood, and so it suffices to calculate the likelihood once for each interval between two datapoints. First, here is the code for plotting the posterior:

# Generating some test data 
x <- rnorm(20, mean=0, sd=1)

x <- sort(x)
n <- length(x)
# Now calculating the "likelihood" for the median being
# in each of the intervals between the xs. Easily done with the following:
# 1 / (factorial(1:(n - 1)) * factorial((n - 1):1))
# but on the log-scale to avoid numerical problems
loglike <- 1 - ( lfactorial(1:(n - 1)) + lfactorial((n - 1):1) )
# Going back to the likelihood scale, subtracting max(loglike) to avoid 
# floating-point underflow.
like <- exp(loglike - max(loglike))
# Plotting with type="s" which plots a step function.
plot(x, c(like, 0), type="s", xlim=c(-2, 2), ylab="Likelihood (sort of)", xlab="median")

Nice, we got a plot of the posterior which looks sort of like a skyscraper, but this is alright, the posterior is supposed to be a step function. Looking at the plot we can see that, most likely, the median is between -0.5 and 1.0 (which seems alright as we actually know that the true median is 0.0). However, the representation of the posterior that we currently have in R is not very useful if we would want to calculate credible intervals or the probability of the median being higher than, for example 0.0. A more useful format would be to have a vector of random samples drawn from the posterior. This is also pretty easy to do in R, but first we need a helper function:

# A more numerically stable way of calculating log( sum( exp( x ))) Source:
# http://r.789695.n4.nabble.com/logsumexp-function-in-R-td3310119.html
logsumexp <- function(x) {
    xmax <- which.max(x)
    log1p(sum(exp(x[-xmax] - x[xmax]))) + x[xmax]
}