Jeffreys’ Substitution Posterior for the Median: A Nice Trick to Non-parametrically Estimate the Median

[This article was first published on Publishable Stuff, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


While reading up on quantile regression I found a really nice hack described in Bayesian Quantile Regression Methods (Lancaster & Jae Jun, 2010). It is called Jeffreys’ substitution posterior for the median, first described by Harold Jeffreys in his Theory of Probability, and is a non-parametric method for approximating the posterior of the median. What makes it cool is that it is really easy to understand and pretty simple to compute, while making no assumptions about the underlying distribution of the data. The method does not strictly produce a posterior distribution, but has be shown to produce a conservative approximation to a valid posterior (Lavine, 1995). In this post I will try to explain Jeffreys’ substitution posterior, give R-code that implements it and finally compare it with a classical non-parametric test, the Wilcoxon signed-rank test. But first a picture of Sir Harold Jeffreys:

Say we have a vector $x$ and we are interested in the median of the underlying distribution. The property of the median is that it sits smack in the middle of the distribution. We would expect that, on average, half of the datapoints of a vector generated from a distribution would fall to the left of the median (and so the other half would fall to the right). Now, given the vector $x$ of length $n$ and a candidate for the median $m$, let $n_l$ be the number of values that fall to the left of $m$ and $n_r$ be the number of values that fall to the right. If $m$ was the actual median the probability of getting the $n_l$ and $n_r$ that we actually got could be calculated as

$$ \text{The number of ways of selecting } n_l \text{ out of } n \text{ values in } x\over \text{The number of ways of placing the } n \text{ values in x into two categories } (n_l\text{ and } n_r) $$

Or using notation:

$${{n}\choose{n_l}} \big / 2^n$$

Here is an example:

Jeffreys proposed to use the probability ${{n}\choose{n_l}} \big / 2^n$ as a substitution for the likelihood, $p(x \mid m)$, and then calculate $p(m \mid x) \propto p(x \mid m) \cdot p(m)$.

By doing this we get something posterior-like we can do the usual stuff with, calculate credible intervals, calculate the probability that the median is smaller than this and that, etc. Before doing this we will just rewrite the substitution likelihood $p(x \mid m)$ to make it easier to work with. We can rewrite the $n_l$-combinations statement as

$${{n}\choose{n_l}} = {n! \over n_l!n_r!}$$

resulting in

$${{{n}\choose{n_l}} \big / 2^n} = {{{n! \over n_l!n_r!2^n}} }$$

If we change the candidate for the median $m$ the only numbers that are going to change are $n_l!$ and $n_r!$, $n!$ and $2^n$ stays the same. So, if we are interested in an expression that is proportional to the likelihood we could just use

$$ p(x \mid m) \propto {1 \over n_l!n_r!}$$

If we assume a uniform prior over $m$ we get that

$$p(m \mid x) \propto p(x \mid m) \propto {1 \over n_l!n_r!}$$

Great! Now we have a method to go from a vector of data, $x$, to a posterior probability for the median, $p(m \mid x)$.

Doing it in R

Using Jeffreys’ substitution posterior in R is pretty simple, but we have to be cautious not to run into numerical trouble. Directly calculating $1 / (n_l!n_r!)$ will only work for small datasets, already factorial(180) results in values so large that you’ll get an error in R. To overcome this problem we’ll do all the calculations on the log-scale as far as possible, using lfactorial instead. We’ll also use the fact that all candidate $m$ between two datapoints have the same likelihood, and so it suffices to calculate the likelihood once for each interval between two datapoints. First, here is the code for plotting the posterior:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic"># Generating some test data </span>
x <span style="color: #666666"><-</span> rnorm(<span style="color: #666666">20</span>, mean<span style="color: #666666">=0</span>, sd<span style="color: #666666">=1</span>)

x <span style="color: #666666"><-</span> sort(x)
n <span style="color: #666666"><-</span> length(x)
<span style="color: #408080; font-style: italic"># Now calculating the "likelihood" for the median being</span>
<span style="color: #408080; font-style: italic"># in each of the intervals between the xs. Easily done with the following:</span>
<span style="color: #408080; font-style: italic"># 1 / (factorial(1:(n - 1)) * factorial((n - 1):1))</span>
<span style="color: #408080; font-style: italic"># but on the log-scale to avoid numerical problems</span>
loglike <span style="color: #666666"><-</span> <span style="color: #666666">1</span> <span style="color: #666666">-</span> ( lfactorial(<span style="color: #666666">1:</span>(n <span style="color: #666666">-</span> <span style="color: #666666">1</span>)) <span style="color: #666666">+</span> lfactorial((n <span style="color: #666666">-</span> <span style="color: #666666">1</span>)<span style="color: #666666">:1</span>) )
<span style="color: #408080; font-style: italic"># Going back to the likelihood scale, subtracting max(loglike) to avoid </span>
<span style="color: #408080; font-style: italic"># floating-point underflow.</span>
like <span style="color: #666666"><-</span> exp(loglike <span style="color: #666666">-</span> max(loglike))
<span style="color: #408080; font-style: italic"># Plotting with type="s" which plots a step function.</span>
plot(x, c(like, <span style="color: #666666">0</span>), type<span style="color: #666666">=</span><span style="color: #BA2121">"s"</span>, xlim<span style="color: #666666">=</span>c(<span style="color: #666666">-2</span>, <span style="color: #666666">2</span>), ylab<span style="color: #666666">=</span><span style="color: #BA2121">"Likelihood (sort of)"</span>, xlab<span style="color: #666666">=</span><span style="color: #BA2121">"median"</span>)
</pre></div>

plot of chunk unnamed-chunk-2

Nice, we got a plot of the posterior which looks sort of like a skyscraper, but this is alright, the posterior is supposed to be a step function. Looking at the plot we can see that, most likely, the median is between -0.5 and 1.0 (which seems alright as we actually know that the true median is 0.0). However, the representation of the posterior that we currently have in R is not very useful if we would want to calculate credible intervals or the probability of the median being higher than, for example 0.0. A more useful format would be to have a vector of random samples drawn from the posterior. This is also pretty easy to do in R, but first we need a helper function:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic"># A more numerically stable way of calculating log( sum( exp( x ))) Source:</span>
<span style="color: #408080; font-style: italic"># http://r.789695.n4.nabble.com/logsumexp-function-in-R-td3310119.html</span>
logsumexp <span style="color: #666666"><-</span> <span style="color: #008000; font-weight: bold">function</span>(x) {
    xmax <span style="color: #666666"><-</span> which.max(x)
    log1p(sum(exp(x[<span style="color: #666666">-</span>xmax] <span style="color: #666666">-</span> x[xmax]))) <span style="color: #666666">+</span> x[xmax]
}
</pre></div>

Then we calculate the proportional likelihood of the median being in each interval, transforms this into a probability, and finally use these probabilities to sample from the posterior. I’m going to be slightly sloppy and disregard the posterior that lies outside the range of the data, if we have more than a dozen datapoints this sloppiness is negligible.

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic"># like * diff(x), but on the log scale. Calculates the proportional </span>
<span style="color: #408080; font-style: italic"># likelihood of the median being in each interval like:</span>
<span style="color: #408080; font-style: italic"># c( like[1] * (x[2] - x[1]), like[2] * (x[3] - x[2]), ... )</span>
interval_loglike <span style="color: #666666"><-</span> loglike <span style="color: #666666">+</span> log(diff(x))
<span style="color: #408080; font-style: italic"># Normalizing the likelihood to get to the probability scale</span>
interval_prob <span style="color: #666666"><-</span> exp(interval_loglike <span style="color: #666666">-</span> logsumexp(interval_loglike))
<span style="color: #408080; font-style: italic"># Sampling intervals proportional to their probability</span>
n_samples <span style="color: #666666"><-</span> <span style="color: #666666">10000</span>
samp_inter <span style="color: #666666"><-</span> sample.int(n <span style="color: #666666">-</span> <span style="color: #666666">1</span>, n_samples, replace<span style="color: #666666">=</span><span style="color: #008000; font-weight: bold">TRUE</span>, prob <span style="color: #666666">=</span> interval_prob)
<span style="color: #408080; font-style: italic"># Then, within each sampled interval, draw a uniform sample.</span>
s <span style="color: #666666"><-</span> runif(n_samples, x[samp_inter], x[samp_inter <span style="color: #666666">+</span> <span style="color: #666666">1</span>])
</pre></div>

Now that we have a couple of samples in s we can do all the usual stuff such as calculating a 95% credible interval:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">quantile(s, c(<span style="color: #666666">0.025</span>, <span style="color: #666666">0.975</span>))
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">##    2.5%   97.5% </span>
<span style="color: #408080; font-style: italic">## -0.4378  0.6978</span>
</pre></div>

or plotting the posterior as a histogram with a superimposed 95% CI:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">hist(s, <span style="color: #666666">30</span>, freq <span style="color: #666666">=</span> <span style="color: #008000; font-weight: bold">FALSE</span>, main <span style="color: #666666">=</span> <span style="color: #BA2121">"Posterior"</span>, xlab <span style="color: #666666">=</span> <span style="color: #BA2121">"median"</span>)
lines(quantile(s, c(<span style="color: #666666">0.025</span>, <span style="color: #666666">0.975</span>)), c(<span style="color: #666666">0</span>, <span style="color: #666666">0</span>), col <span style="color: #666666">=</span> <span style="color: #BA2121">"red"</span>, lwd <span style="color: #666666">=</span> <span style="color: #666666">6</span>)
legend(<span style="color: #BA2121">"topright"</span>, legend <span style="color: #666666">=</span> <span style="color: #BA2121">"95% CI"</span>, col <span style="color: #666666">=</span> <span style="color: #BA2121">"red"</span>, lwd <span style="color: #666666">=</span> <span style="color: #666666">6</span>)
</pre></div>

plot of chunk unnamed-chunk-6

A Handy Function to Copy-n-Paste

Here is a handy function you should be able to copy-n-paste directly into R. It packages all the calculations done above into one function that plots the posterior, calculates a 95% CI and returns a vector of samples that can be further inspected. I haven’t tested this function extensively so use it at your own risk and please tell me if you find any bugs.

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic"># A more numerically stable way of calculating log( sum( exp( x ))) Source:</span>
<span style="color: #408080; font-style: italic"># http://r.789695.n4.nabble.com/logsumexp-function-in-R-td3310119.html</span>
logsumexp <span style="color: #666666"><-</span> <span style="color: #008000; font-weight: bold">function</span>(x) {
    xmax <span style="color: #666666"><-</span> which.max(x)
    log1p(sum(exp(x[<span style="color: #666666">-</span>xmax] <span style="color: #666666">-</span> x[xmax]))) <span style="color: #666666">+</span> x[xmax]
}

<span style="color: #408080; font-style: italic"># Silently returns samples from Jeffreys’ substitution posterior given a</span>
<span style="color: #408080; font-style: italic"># vector of data (x). Also produces a histogram of the posterior and prints</span>
<span style="color: #408080; font-style: italic"># out a 95% quantile credible interval. Assumes a non-informative uniform</span>
<span style="color: #408080; font-style: italic"># [-Inf, Inf] prior over the median.</span>
jeffreys_median <span style="color: #666666"><-</span> <span style="color: #008000; font-weight: bold">function</span>(x, n_samples <span style="color: #666666">=</span> <span style="color: #666666">10000</span>, draw_plot <span style="color: #666666">=</span> <span style="color: #008000; font-weight: bold">TRUE</span>) {
    x <span style="color: #666666"><-</span> sort(x)
    n <span style="color: #666666"><-</span> length(x)
    loglike <span style="color: #666666"><-</span> <span style="color: #666666">1</span> <span style="color: #666666">-</span> lfactorial(<span style="color: #666666">1:</span>(n <span style="color: #666666">-</span> <span style="color: #666666">1</span>)) <span style="color: #666666">-</span> lfactorial((n <span style="color: #666666">-</span> <span style="color: #666666">1</span>)<span style="color: #666666">:1</span>)
    interval_loglike <span style="color: #666666"><-</span> loglike <span style="color: #666666">+</span> log(diff(x))
    interval_prob <span style="color: #666666"><-</span> exp(interval_loglike <span style="color: #666666">-</span> logsumexp(interval_loglike))
    samp_inter <span style="color: #666666"><-</span> sample.int(n <span style="color: #666666">-</span> <span style="color: #666666">1</span>, n_samples, replace <span style="color: #666666">=</span> <span style="color: #008000; font-weight: bold">TRUE</span>, prob <span style="color: #666666">=</span> interval_prob)
    s <span style="color: #666666"><-</span> runif(n_samples, x[samp_inter], x[samp_inter <span style="color: #666666">+</span> <span style="color: #666666">1</span>])
    cat(<span style="color: #BA2121">"\n  Jeffreys’ Substitution Posterior for the median\n\n"</span>)
    cat(<span style="color: #BA2121">"median\n  "</span>, median(x), <span style="color: #BA2121">"\n"</span>)
    cat(<span style="color: #BA2121">"95% CI\n  "</span>, quantile(s, c(<span style="color: #666666">0.025</span>, <span style="color: #666666">0.975</span>)), <span style="color: #BA2121">"\n"</span>)
    <span style="color: #008000; font-weight: bold">if</span> (draw_plot) {
        hist(s, <span style="color: #666666">30</span>, freq <span style="color: #666666">=</span> <span style="color: #008000; font-weight: bold">FALSE</span>, main <span style="color: #666666">=</span> <span style="color: #BA2121">"Posterior of the median"</span>, xlab <span style="color: #666666">=</span> <span style="color: #BA2121">"median"</span>)
        lines(quantile(s, c(<span style="color: #666666">0.025</span>, <span style="color: #666666">0.975</span>)), c(<span style="color: #666666">0</span>, <span style="color: #666666">0</span>), col <span style="color: #666666">=</span> <span style="color: #BA2121">"red"</span>, lwd <span style="color: #666666">=</span> <span style="color: #666666">5</span>)
        legend(<span style="color: #BA2121">"topright"</span>, legend <span style="color: #666666">=</span> <span style="color: #BA2121">"95% CI"</span>, col <span style="color: #666666">=</span> <span style="color: #BA2121">"red"</span>, lwd <span style="color: #666666">=</span> <span style="color: #666666">5</span>)
    }
    invisible(s)
}
</pre></div>

jeffreys_median vs wilcox.test

Finally, I thought I would compare Jeffreys’ substitution posterior with one of the most common classical non-parametric tests used, Wilcoxon signed-rank test. From the documentation for wilcox.test:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">## Hollander & Wolfe (1973), 29f.  Hamilton depression scale factor</span>
<span style="color: #408080; font-style: italic">## measurements in 9 patients with mixed anxiety and depression, taken at the</span>
<span style="color: #408080; font-style: italic">## first (x) and second (y) visit after initiation of a therapy</span>
<span style="color: #408080; font-style: italic">## (administration of a tranquilizer).</span>
x <span style="color: #666666"><-</span> c(<span style="color: #666666">1.83</span>, <span style="color: #666666">0.5</span>, <span style="color: #666666">1.62</span>, <span style="color: #666666">2.48</span>, <span style="color: #666666">1.68</span>, <span style="color: #666666">1.88</span>, <span style="color: #666666">1.55</span>, <span style="color: #666666">3.06</span>, <span style="color: #666666">1.3</span>)
y <span style="color: #666666"><-</span> c(<span style="color: #666666">0.878</span>, <span style="color: #666666">0.647</span>, <span style="color: #666666">0.598</span>, <span style="color: #666666">2.05</span>, <span style="color: #666666">1.06</span>, <span style="color: #666666">1.29</span>, <span style="color: #666666">1.06</span>, <span style="color: #666666">3.14</span>, <span style="color: #666666">1.29</span>)
wilcox.test(y <span style="color: #666666">-</span> x, alternative <span style="color: #666666">=</span> <span style="color: #BA2121">"less"</span>)
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">##  Wilcoxon signed rank test</span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## data:  y - x</span>
<span style="color: #408080; font-style: italic">## V = 5, p-value = 0.01953</span>
<span style="color: #408080; font-style: italic">## alternative hypothesis: true location is less than 0</span>
</pre></div>

So wilcox.test gives us a pretty small p-value, which could be taken to support that the distribution of y - x is not symmetric about 0.0, but not much more. Let’s rerun the analysis but using the R function we defined earlier.

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">s <span style="color: #666666"><-</span> jeffreys_median(y <span style="color: #666666">-</span> x)
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">##   Jeffreys’ Substitution Posterior for the median</span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## median</span>
<span style="color: #408080; font-style: italic">##    -0.49 </span>
<span style="color: #408080; font-style: italic">## 95% CI</span>
<span style="color: #408080; font-style: italic">##    -0.9133 0.04514</span>
</pre></div>

plot of chunk unnamed-chunk-9

Looking at the posterior we see that the most likely median difference is around -0.5, but the difference is also likely to be as negative as -1 and as positive as 0.1. Using the samples s we can calculate the probability that the median difference is less than 0:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">mean(s <span style="color: #666666"><</span> <span style="color: #666666">0</span>)
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">## [1] 0.9523</span>
</pre></div>

So, given the data and Jeffreys’ substitution posterior there is more than 95% probability that the median difference is less than 0. Compared with Wilcoxon signed-rank test I think Jeffreys’ substitution posterior is a good alternative when you are interested in the median but otherwise want to make as few assumptions as possible. However, I don’t want to generally recommend Jeffreys’ substitution posterior, as it doesn’t make strong assumptions almost any other model based on reasonable assumptions will be more powerful and will make better use of the data. Still, I think Jeffreys’ substitution posterior is a cool method with a clean, intuitive definition, and I just wonder: Why isn’t it better known?

References

Jeffreys, H. (1961). Theory of Probability. Clarendon Press: Oxford. Amazon link

Lancaster, T., & Jae Jun, S. (2010). Bayesian quantile regression methods. Journal of Applied Econometrics, 25(2), 287-307.
pdf

Lavine, M. (1995). On an approximate likelihood for quantiles. Biometrika, 82(1), 220-222. link (unfortunately behind paywall)

To leave a comment for the author, please follow the link and comment on their blog: Publishable Stuff.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)