Bayesian Estimation of Correlation – Now Robust!

[This article was first published on Publishable Stuff, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


So in the last post I showed how to run the Bayesian counterpart of Pearson’s correlation test by estimating the parameters of a bivariate normal distribution. A problem with assuming normality is that the normal distribution isn’t robust against outliers. Let’s see what happens if we take the data from the last post with the finishing times and weights of the runners in the men’s 100 m semi-finals in the 2013 World Championships in Athletics and introduce an outlier. This is how the original data looks:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">d
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">##                    runner  time weight</span>
<span style="color: #408080; font-style: italic">## 1              Usain Bolt  9.92     94</span>
<span style="color: #408080; font-style: italic">## 2           Justin Gatlin  9.94     79</span>
<span style="color: #408080; font-style: italic">## 3            Nesta Carter  9.97     78</span>
<span style="color: #408080; font-style: italic">## 4       Kemar Bailey-Cole  9.93     83</span>
<span style="color: #408080; font-style: italic">## 5         Nickel Ashmeade  9.90     77</span>
<span style="color: #408080; font-style: italic">## 6            Mike Rodgers  9.93     76</span>
<span style="color: #408080; font-style: italic">## 7     Christophe Lemaitre 10.00     74</span>
<span style="color: #408080; font-style: italic">## 8           James Dasaolu  9.97     87</span>
<span style="color: #408080; font-style: italic">## 9           Zhang Peimeng 10.00     86</span>
<span style="color: #408080; font-style: italic">## 10           Jimmy Vicaut 10.01     83</span>
<span style="color: #408080; font-style: italic">## 11         Keston Bledman 10.08     75</span>
<span style="color: #408080; font-style: italic">## 12       Churandy Martina 10.09     74</span>
<span style="color: #408080; font-style: italic">## 13         Dwain Chambers 10.15     92</span>
<span style="color: #408080; font-style: italic">## 14           Jason Rogers 10.15     69</span>
<span style="color: #408080; font-style: italic">## 15          Antoine Adams 10.17     79</span>
<span style="color: #408080; font-style: italic">## 16        Anaso Jobodwana 10.17     71</span>
<span style="color: #408080; font-style: italic">## 17       Richard Thompson 10.19     80</span>
<span style="color: #408080; font-style: italic">## 18          Gavin Smellie 10.30     80</span>
<span style="color: #408080; font-style: italic">## 19          Ramon Gittens 10.31     77</span>
<span style="color: #408080; font-style: italic">## 20 Harry Aikines-Aryeetey 10.34     87</span>
</pre></div>

What if Usain Bolt had eaten mutant cheese burgers both gaining in weight and ability to run faster?

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">d[d<span style="color: #666666">$</span>runner <span style="color: #666666">==</span> <span style="color: #BA2121">"Usain Bolt"</span>, c(<span style="color: #BA2121">"time"</span>, <span style="color: #BA2121">"weight"</span>)] <span style="color: #666666"><-</span> c(<span style="color: #666666">9.5</span>, <span style="color: #666666">115</span>)
plot(d<span style="color: #666666">$</span>time, d<span style="color: #666666">$</span>weight)
</pre></div>

plot of chunk unnamed-chunk-3

Usain Bolt now looks like a real outlier, and he is, he’s not representative of the majority of runners that are not eating mutant cheeseburgers and that we would like to make inferences about. How does this single outlier influence the estimated correlation between finishing time and weight? Let’s rerun the correlation analysis from the last post and compare:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">data_list <span style="color: #666666">=</span> list(x <span style="color: #666666">=</span> d[, c(<span style="color: #BA2121">"time"</span>, <span style="color: #BA2121">"weight"</span>)], n <span style="color: #666666">=</span> nrow(d))
<span style="color: #408080; font-style: italic"># Use classical estimates of the parameters as initial values</span>
inits_list <span style="color: #666666">=</span> list(mu <span style="color: #666666">=</span> c(mean(d<span style="color: #666666">$</span>time), mean(d<span style="color: #666666">$</span>weight)), rho <span style="color: #666666">=</span> cor(d<span style="color: #666666">$</span>time, d<span style="color: #666666">$</span>weight),
    sigma <span style="color: #666666">=</span> c(sd(d<span style="color: #666666">$</span>time), sd(d<span style="color: #666666">$</span>weight)))
jags_model <span style="color: #666666"><-</span> jags.model(textConnection(model_string), data <span style="color: #666666">=</span> data_list, inits <span style="color: #666666">=</span> inits_list,
    n.adapt <span style="color: #666666">=</span> <span style="color: #666666">500</span>, n.chains <span style="color: #666666">=</span> <span style="color: #666666">3</span>, quiet <span style="color: #666666">=</span> <span style="color: #008000; font-weight: bold">TRUE</span>)
update(jags_model, <span style="color: #666666">500</span>)
mcmc_samples <span style="color: #666666"><-</span> coda.samples(jags_model, c(<span style="color: #BA2121">"rho"</span>, <span style="color: #BA2121">"x_rand[1]"</span>, <span style="color: #BA2121">"x_rand[2]"</span>),
    n.iter <span style="color: #666666">=</span> <span style="color: #666666">5000</span>)
samples_mat <span style="color: #666666"><-</span> as.matrix(mcmc_samples)
plot(d<span style="color: #666666">$</span>time, d<span style="color: #666666">$</span>weight)
dataEllipse(samples_mat[, c(<span style="color: #BA2121">"x_rand[1]"</span>, <span style="color: #BA2121">"x_rand[2]"</span>)], levels <span style="color: #666666">=</span> c(<span style="color: #666666">0.5</span>, <span style="color: #666666">0.95</span>),
    plot.points <span style="color: #666666">=</span> <span style="color: #008000; font-weight: bold">FALSE</span>)
</pre></div>

plot of chunk unnamed-chunk-4

The estimated distribution of the data indicates that there now is a strong negative correlation between finishing time and weight! Looking at the distribution of the rho parameter…

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">quantile(samples_mat[, <span style="color: #BA2121">"rho"</span>], c(<span style="color: #666666">0.025</span>, <span style="color: #666666">0.5</span>, <span style="color: #666666">0.975</span>))
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">##    2.5%     50%   97.5% </span>
<span style="color: #408080; font-style: italic">## -0.7773 -0.5208 -0.1127</span>
</pre></div>

… shows that the correlation is estimated to be -0.52 (95% CI: [-0.78,-0.11]) which is very different from the estimated correlation for the sans outlier data which was -0.10 (95% CI: [-0.51, 0.35]). I think it’s easy to argue that the influence of this one, single outlier is not reasonable.

What you actually often want to assume is that the bulk of the data is normally distributed while still allowing for the occasional outlier. One of the great benefits of Bayesian statistics is that it is easy to assume any parametric distribution for the data and a good distribution when we want robustness is the t-distribution. This distribution is similar to the normal distribution except that it has an extra parameter $\nu$ (also called the degrees of freedom) that governs how heavy the tails of the distribution are. The smaller $\nu$ is, the heavier the tails are and the less the estimated mean and standard deviation of the distribution are influenced by outliers. When $\nu$ is large (say $\nu$ > 30) the t-distribution is very similar to the normal distribution. So in order to make our Bayesian correlation analysis more robust we’ll replace the multivariate normal distribution (dmnorm) from the last post with a multivariate t-distribution (dmt). To that we’ll add a prior on the $\nu$ parameter nu that both allows for completely normally distributed data or normally distributed data with an occasional outlier (I’m using the same prior as in Kruschke’s Bayesian estimation supersedes the t test).

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">robust_model_string <span style="color: #666666"><-</span> <span style="color: #BA2121">"</span>
<span style="color: #BA2121">  model {</span>
<span style="color: #BA2121">    for(i in 1:n) {</span>
<span style="color: #BA2121">      # We've replaced dmnorm with and dmt ...</span>
<span style="color: #BA2121">      x[i,1:2] ~ dmt(mu[], prec[ , ], nu) </span>
<span style="color: #BA2121">    }</span>

<span style="color: #BA2121">    prec[1:2,1:2] <- inverse(cov[,])</span>
<span style="color: #BA2121">    </span>
<span style="color: #BA2121">    cov[1,1] <- sigma[1] * sigma[1]</span>
<span style="color: #BA2121">    cov[1,2] <- sigma[1] * sigma[2] * rho</span>
<span style="color: #BA2121">    cov[2,1] <- sigma[1] * sigma[2] * rho</span>
<span style="color: #BA2121">    cov[2,2] <- sigma[2] * sigma[2]</span>
<span style="color: #BA2121">    </span>
<span style="color: #BA2121">    sigma[1] ~ dunif(0, 1000) </span>
<span style="color: #BA2121">    sigma[2] ~ dunif(0, 1000)</span>
<span style="color: #BA2121">    rho ~ dunif(-1, 1)</span>
<span style="color: #BA2121">    mu[1] ~ dnorm(0, 0.0001)</span>
<span style="color: #BA2121">    mu[2] ~ dnorm(0, 0.0001)</span>
<span style="color: #BA2121">    </span>
<span style="color: #BA2121">    # ... and added a prior on the degree of freedom parameter nu.</span>
<span style="color: #BA2121">    nu <- nuMinusOne+1</span>
<span style="color: #BA2121">    nuMinusOne ~ dexp(1/29)</span>

<span style="color: #BA2121">    x_rand ~ dmt(mu[], prec[ , ], nu)</span>
<span style="color: #BA2121">  }</span>
<span style="color: #BA2121">"</span>
</pre></div>

Now let’s run the model using JAGS…

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">data_list <span style="color: #666666">=</span> list(x <span style="color: #666666">=</span> d[, c(<span style="color: #BA2121">"time"</span>, <span style="color: #BA2121">"weight"</span>)], n <span style="color: #666666">=</span> nrow(d))
<span style="color: #408080; font-style: italic"># Use robust estimates of the parameters as initial values</span>
inits_list <span style="color: #666666">=</span> list(mu <span style="color: #666666">=</span> c(median(d<span style="color: #666666">$</span>time), median(d<span style="color: #666666">$</span>weight)), rho <span style="color: #666666">=</span> cor(d<span style="color: #666666">$</span>time,
    d<span style="color: #666666">$</span>weight, method <span style="color: #666666">=</span> <span style="color: #BA2121">"spearman"</span>), sigma <span style="color: #666666">=</span> c(mad(d<span style="color: #666666">$</span>time), mad(d<span style="color: #666666">$</span>weight)))
jags_model <span style="color: #666666"><-</span> jags.model(textConnection(robust_model_string), data <span style="color: #666666">=</span> data_list,
    inits <span style="color: #666666">=</span> inits_list, n.adapt <span style="color: #666666">=</span> <span style="color: #666666">500</span>, n.chains <span style="color: #666666">=</span> <span style="color: #666666">3</span>, quiet <span style="color: #666666">=</span> <span style="color: #008000; font-weight: bold">TRUE</span>)
update(jags_model, <span style="color: #666666">500</span>)
mcmc_samples <span style="color: #666666"><-</span> coda.samples(jags_model, c(<span style="color: #BA2121">"mu"</span>, <span style="color: #BA2121">"rho"</span>, <span style="color: #BA2121">"sigma"</span>, <span style="color: #BA2121">"nu"</span>, <span style="color: #BA2121">"x_rand"</span>),
    n.iter <span style="color: #666666">=</span> <span style="color: #666666">5000</span>)
</pre></div>

Here are the resulting trace plots and density plots…

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">par(mar <span style="color: #666666">=</span> rep(<span style="color: #666666">2.2</span>, <span style="color: #666666">4</span>))
plot(mcmc_samples)
</pre></div>

plot of chunk unnamed-chunk-8plot of chunk unnamed-chunk-8

.. and here is a summary of the parameter estimates:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">summary(mcmc_samples)
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## Iterations = 1001:6000</span>
<span style="color: #408080; font-style: italic">## Thinning interval = 1 </span>
<span style="color: #408080; font-style: italic">## Number of chains = 3 </span>
<span style="color: #408080; font-style: italic">## Sample size per chain = 5000 </span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## 1. Empirical mean and standard deviation for each variable,</span>
<span style="color: #408080; font-style: italic">##    plus standard error of the mean:</span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">##             Mean      SD Naive SE Time-series SE</span>
<span style="color: #408080; font-style: italic">## mu[1]     10.065  0.0403 0.000329       0.000510</span>
<span style="color: #408080; font-style: italic">## mu[2]     79.641  1.9707 0.016091       0.030294</span>
<span style="color: #408080; font-style: italic">## nu        13.703 19.9326 0.162749       1.426841</span>
<span style="color: #408080; font-style: italic">## rho       -0.270  0.2441 0.001993       0.003985</span>
<span style="color: #408080; font-style: italic">## sigma[1]   0.163  0.0359 0.000293       0.000665</span>
<span style="color: #408080; font-style: italic">## sigma[2]   7.786  1.9440 0.015872       0.040905</span>
<span style="color: #408080; font-style: italic">## x_rand[1] 10.065  0.2105 0.001719       0.001732</span>
<span style="color: #408080; font-style: italic">## x_rand[2] 79.523 10.0099 0.081731       0.085990</span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">## 2. Quantiles for each variable:</span>
<span style="color: #408080; font-style: italic">## </span>
<span style="color: #408080; font-style: italic">##             2.5%    25%    50%    75%  97.5%</span>
<span style="color: #408080; font-style: italic">## mu[1]      9.984 10.039 10.065 10.091 10.144</span>
<span style="color: #408080; font-style: italic">## mu[2]     75.976 78.350 79.572 80.857 83.778</span>
<span style="color: #408080; font-style: italic">## nu         2.381  4.547  7.515 14.350 61.519</span>
<span style="color: #408080; font-style: italic">## rho       -0.691 -0.448 -0.285 -0.108  0.241</span>
<span style="color: #408080; font-style: italic">## sigma[1]   0.105  0.138  0.160  0.184  0.245</span>
<span style="color: #408080; font-style: italic">## sigma[2]   4.718  6.374  7.546  8.952 12.198</span>
<span style="color: #408080; font-style: italic">## x_rand[1]  9.654  9.948 10.066 10.183 10.471</span>
<span style="color: #408080; font-style: italic">## x_rand[2] 59.926 73.963 79.466 85.092 99.637</span>
</pre></div>

Using our robust model we get a correlation of -0.28 (95% CI:[-0.69, 0.24]) which is much closer to our initial -0.10 (95% CI: [-0.51, 0.35]). Looking at the estimate for the nu parameter we see that there is evidence for non-normality as the median nu is quite small: 7.5 (95% CI: [2.4, 61.5]).

What about data with no outliers?

So now we have two Bayesian versions of Pearson’s correlation test, one normal and one robust. Do we always have to make a choice which of these two models to use? No! I’ll go for the robust version any day, you see, it also estimates the heaviness of the tails of the bivariate t-distribution and if there is sufficient evidence in the data for normality the estimated t-distribution will be very close to a normal distribution. We can have the cake and eat it!

To show that this works I will now apply the robust model to the same data that I used in the last post which are 30 random draws from a bivariate normal distribution.

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">plot(x, xlim <span style="color: #666666">=</span> c(<span style="color: #666666">-125</span>, <span style="color: #666666">125</span>), ylim <span style="color: #666666">=</span> c(<span style="color: #666666">-100</span>, <span style="color: #666666">150</span>))
</pre></div>

plot of chunk unnamed-chunk-10

Running both the standard normal model and the robust t-distribution model on this data results in very similar estimates of the correlation:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">quantile(rho_samples, c(<span style="color: #666666">0.025</span>, <span style="color: #666666">0.5</span>, <span style="color: #666666">0.975</span>))
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">##    2.5%     50%   97.5% </span>
<span style="color: #408080; font-style: italic">## -0.8780 -0.7637 -0.5578</span>
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">quantile(robust_rho_samples, c(<span style="color: #666666">0.025</span>, <span style="color: #666666">0.5</span>, <span style="color: #666666">0.975</span>))
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">##    2.5%     50%   97.5% </span>
<span style="color: #408080; font-style: italic">## -0.8842 -0.7644 -0.5518</span>
</pre></div>

Looking at the estimate of nu we see that it is quite high which is what we would expect since the data is normal.

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">quantile(nu_samples, c(<span style="color: #666666">0.025</span>, <span style="color: #666666">0.5</span>, <span style="color: #666666">0.975</span>))
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">##    2.5%     50%   97.5% </span>
<span style="color: #408080; font-style: italic">##   5.903  30.216 120.186</span>
</pre></div>

To leave a comment for the author, please follow the link and comment on their blog: Publishable Stuff.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)