[This article was first published on Publishable Stuff, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
So in the last post I showed how to run the Bayesian counterpart of Pearson’s correlation test by estimating the parameters of a bivariate normal distribution. A problem with assuming normality is that the normal distribution isn’t robust against outliers. Let’s see what happens if we take the data from the last post with the finishing times and weights of the runners in the men’s 100 m semi-finals in the 2013 World Championships in Athletics and introduce an outlier. This is how the original data looks:
Usain Bolt now looks like a real outlier, and he is, he’s not representative of the majority of runners that are not eating mutant cheeseburgers and that we would like to make inferences about. How does this single outlier influence the estimated correlation between finishing time and weight? Let’s rerun the correlation analysis from the last post and compare:
The estimated distribution of the data indicates that there now is a strong negative correlation between finishing time and weight! Looking at the distribution of the rho parameter…
… shows that the correlation is estimated to be -0.52 (95% CI: [-0.78,-0.11]) which is very different from the estimated correlation for the sans outlier data which was -0.10 (95% CI: [-0.51, 0.35]). I think it’s easy to argue that the influence of this one, single outlier is not reasonable.
What you actually often want to assume is that the bulk of the data is normally distributed while still allowing for the occasional outlier. One of the great benefits of Bayesian statistics is that it is easy to assume any parametric distribution for the data and a good distribution when we want robustness is the t-distribution. This distribution is similar to the normal distribution except that it has an extra parameter $\nu$ (also called the degrees of freedom) that governs how heavy the tails of the distribution are. The smaller $\nu$ is, the heavier the tails are and the less the estimated mean and standard deviation of the distribution are influenced by outliers. When $\nu$ is large (say $\nu$ > 30) the t-distribution is very similar to the normal distribution. So in order to make our Bayesian correlation analysis more robust we’ll replace the multivariate normal distribution (dmnorm) from the last post with a multivariate t-distribution (dmt). To that we’ll add a prior on the $\nu$ parameter nu that both allows for completely normally distributed data or normally distributed data with an occasional outlier (I’m using the same prior as in Kruschke’s Bayesian estimation supersedes the t test).
robust_model_string <- "
model {
for(i in 1:n) {
# We've replaced dmnorm with and dmt ...
x[i,1:2] ~ dmt(mu[], prec[ , ], nu)
}
prec[1:2,1:2] <- inverse(cov[,])
cov[1,1] <- sigma[1] * sigma[1]
cov[1,2] <- sigma[1] * sigma[2] * rho
cov[2,1] <- sigma[1] * sigma[2] * rho
cov[2,2] <- sigma[2] * sigma[2]
sigma[1] ~ dunif(0, 1000)
sigma[2] ~ dunif(0, 1000)
rho ~ dunif(-1, 1)
mu[1] ~ dnorm(0, 0.0001)
mu[2] ~ dnorm(0, 0.0001)
# ... and added a prior on the degree of freedom parameter nu.
nu <- nuMinusOne+1
nuMinusOne ~ dexp(1/29)
x_rand ~ dmt(mu[], prec[ , ], nu)
}
"
Here are the resulting trace plots and density plots…
par(mar = rep(2.2, 4))
plot(mcmc_samples)
.. and here is a summary of the parameter estimates:
summary(mcmc_samples)
##
## Iterations = 1001:6000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## mu[1] 10.065 0.0403 0.000329 0.000510
## mu[2] 79.641 1.9707 0.016091 0.030294
## nu 13.703 19.9326 0.162749 1.426841
## rho -0.270 0.2441 0.001993 0.003985
## sigma[1] 0.163 0.0359 0.000293 0.000665
## sigma[2] 7.786 1.9440 0.015872 0.040905
## x_rand[1] 10.065 0.2105 0.001719 0.001732
## x_rand[2] 79.523 10.0099 0.081731 0.085990
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## mu[1] 9.984 10.039 10.065 10.091 10.144
## mu[2] 75.976 78.350 79.572 80.857 83.778
## nu 2.381 4.547 7.515 14.350 61.519
## rho -0.691 -0.448 -0.285 -0.108 0.241
## sigma[1] 0.105 0.138 0.160 0.184 0.245
## sigma[2] 4.718 6.374 7.546 8.952 12.198
## x_rand[1] 9.654 9.948 10.066 10.183 10.471
## x_rand[2] 59.926 73.963 79.466 85.092 99.637
Using our robust model we get a correlation of -0.28 (95% CI:[-0.69, 0.24]) which is much closer to our initial -0.10 (95% CI: [-0.51, 0.35]). Looking at the estimate for the nu parameter we see that there is evidence for non-normality as the median nu is quite small: 7.5 (95% CI: [2.4, 61.5]).
What about data with no outliers?
So now we have two Bayesian versions of Pearson’s correlation test, one normal and one robust. Do we always have to make a choice which of these two models to use? No! I’ll go for the robust version any day, you see, it also estimates the heaviness of the tails of the bivariate t-distribution and if there is sufficient evidence in the data for normality the estimated t-distribution will be very close to a normal distribution. We can have the cake and eat it!
To show that this works I will now apply the robust model to the same data that I used in the last post which are 30 random draws from a bivariate normal distribution.
plot(x, xlim = c(-125, 125), ylim = c(-100, 150))
Running both the standard normal model and the robust t-distribution model on this data results in very similar estimates of the correlation: