Bayesian Estimation of Correlation – Now Robust!

August 28, 2013
By

(This article was first published on Publishable Stuff, and kindly contributed to R-bloggers)

So in the last post I showed how to run the Bayesian counterpart of Pearson’s correlation test by estimating the parameters of a bivariate normal distribution. A problem with assuming normality is that the normal distribution isn’t robust against outliers. Let’s see what happens if we take the data from the last post with the finishing times and weights of the runners in the men’s 100 m semi-finals in the 2013 World Championships in Athletics and introduce an outlier. This is how the original data looks:

d
##                    runner  time weight
## 1              Usain Bolt  9.92     94
## 2           Justin Gatlin  9.94     79
## 3            Nesta Carter  9.97     78
## 4       Kemar Bailey-Cole  9.93     83
## 5         Nickel Ashmeade  9.90     77
## 6            Mike Rodgers  9.93     76
## 7     Christophe Lemaitre 10.00     74
## 8           James Dasaolu  9.97     87
## 9           Zhang Peimeng 10.00     86
## 10           Jimmy Vicaut 10.01     83
## 11         Keston Bledman 10.08     75
## 12       Churandy Martina 10.09     74
## 13         Dwain Chambers 10.15     92
## 14           Jason Rogers 10.15     69
## 15          Antoine Adams 10.17     79
## 16        Anaso Jobodwana 10.17     71
## 17       Richard Thompson 10.19     80
## 18          Gavin Smellie 10.30     80
## 19          Ramon Gittens 10.31     77
## 20 Harry Aikines-Aryeetey 10.34     87

What if Usain Bolt had eaten mutant cheese burgers both gaining in weight and ability to run faster?

d[d$runner == "Usain Bolt", c("time", "weight")] <- c(9.5, 115)
plot(d$time, d$weight)

plot of chunk unnamed-chunk-3

Usain Bolt now looks like a real outlier, and he is, he’s not representative of the majority of runners that are not eating mutant cheeseburgers and that we would like to make inferences about. How does this single outlier influence the estimated correlation between finishing time and weight? Let’s rerun the correlation analysis from the last post and compare:

data_list = list(x = d[, c("time", "weight")], n = nrow(d))
# Use classical estimates of the parameters as initial values
inits_list = list(mu = c(mean(d$time), mean(d$weight)), rho = cor(d$time, d$weight),
    sigma = c(sd(d$time), sd(d$weight)))
jags_model <- jags.model(textConnection(model_string), data = data_list, inits = inits_list,
    n.adapt = 500, n.chains = 3, quiet = TRUE)
update(jags_model, 500)
mcmc_samples <- coda.samples(jags_model, c("rho", "x_rand[1]", "x_rand[2]"),
    n.iter = 5000)
samples_mat <- as.matrix(mcmc_samples)
plot(d$time, d$weight)
dataEllipse(samples_mat[, c("x_rand[1]", "x_rand[2]")], levels = c(0.5, 0.95),
    plot.points = FALSE)

plot of chunk unnamed-chunk-4

The estimated distribution of the data indicates that there now is a strong negative correlation between finishing time and weight! Looking at the distribution of the rho parameter…

quantile(samples_mat[, "rho"], c(0.025, 0.5, 0.975))
##    2.5%     50%   97.5% 
## -0.7773 -0.5208 -0.1127

… shows that the correlation is estimated to be -0.52 (95% CI: [-0.78,-0.11]) which is very different from the estimated correlation for the sans outlier data which was -0.10 (95% CI: [-0.51, 0.35]). I think it’s easy to argue that the influence of this one, single outlier is not reasonable.

What you actually often want to assume is that the bulk of the data is normally distributed while still allowing for the occasional outlier. One of the great benefits of Bayesian statistics is that it is easy to assume any parametric distribution for the data and a good distribution when we want robustness is the t-distribution. This distribution is similar to the normal distribution except that it has an extra parameter $\nu$ (also called the degrees of freedom) that governs how heavy the tails of the distribution are. The smaller $\nu$ is, the heavier the tails are and the less the estimated mean and standard deviation of the distribution are influenced by outliers. When $\nu$ is large (say $\nu$ > 30) the t-distribution is very similar to the normal distribution. So in order to make our Bayesian correlation analysis more robust we’ll replace the multivariate normal distribution (dmnorm) from the last post with a multivariate t-distribution (dmt). To that we’ll add a prior on the $\nu$ parameter nu that both allows for completely normally distributed data or normally distributed data with an occasional outlier (I’m using the same prior as in Kruschke’s Bayesian estimation supersedes the t test).

robust_model_string <- "
  model {
    for(i in 1:n) {
      # We've replaced dmnorm with and dmt ...
      x[i,1:2] ~ dmt(mu[], prec[ , ], nu) 
    }

    prec[1:2,1:2] <- inverse(cov[,])
    
    cov[1,1] <- sigma[1] * sigma[1]
    cov[1,2] <- sigma[1] * sigma[2] * rho
    cov[2,1] <- sigma[1] * sigma[2] * rho
    cov[2,2] <- sigma[2] * sigma[2]
    
    sigma[1] ~ dunif(0, 1000) 
    sigma[2] ~ dunif(0, 1000)
    rho ~ dunif(-1, 1)
    mu[1] ~ dnorm(0, 0.0001)
    mu[2] ~ dnorm(0, 0.0001)
    
    # ... and added a prior on the degree of freedom parameter nu.
    nu <- nuMinusOne+1
    nuMinusOne ~ dexp(1/29)

    x_rand ~ dmt(mu[], prec[ , ], nu)
  }
"

Now let’s run the model using JAGS…

data_list = list(x = d[, c("time", "weight")], n = nrow(d))
# Use robust estimates of the parameters as initial values
inits_list = list(mu = c(median(d$time), median(d$weight)), rho = cor(d$time,
    d$weight, method = "spearman"), sigma = c(mad(d$time), mad(d$weight)))
jags_model <- jags.model(textConnection(robust_model_string), data = data_list,
    inits = inits_list, n.adapt = 500, n.chains = 3, quiet = TRUE)
update(jags_model, 500)
mcmc_samples <- coda.samples(jags_model, c("mu", "rho", "sigma", "nu", "x_rand"),
    n.iter = 5000)

Here are the resulting trace plots and density plots…

par(mar = rep(2.2, 4))
plot(mcmc_samples)

plot of chunk unnamed-chunk-8plot of chunk unnamed-chunk-8

.. and here is a summary of the parameter estimates:

summary(mcmc_samples)
## 
## Iterations = 1001:6000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##             Mean      SD Naive SE Time-series SE
## mu[1]     10.065  0.0403 0.000329       0.000510
## mu[2]     79.641  1.9707 0.016091       0.030294
## nu        13.703 19.9326 0.162749       1.426841
## rho       -0.270  0.2441 0.001993       0.003985
## sigma[1]   0.163  0.0359 0.000293       0.000665
## sigma[2]   7.786  1.9440 0.015872       0.040905
## x_rand[1] 10.065  0.2105 0.001719       0.001732
## x_rand[2] 79.523 10.0099 0.081731       0.085990
## 
## 2. Quantiles for each variable:
## 
##             2.5%    25%    50%    75%  97.5%
## mu[1]      9.984 10.039 10.065 10.091 10.144
## mu[2]     75.976 78.350 79.572 80.857 83.778
## nu         2.381  4.547  7.515 14.350 61.519
## rho       -0.691 -0.448 -0.285 -0.108  0.241
## sigma[1]   0.105  0.138  0.160  0.184  0.245
## sigma[2]   4.718  6.374  7.546  8.952 12.198
## x_rand[1]  9.654  9.948 10.066 10.183 10.471
## x_rand[2] 59.926 73.963 79.466 85.092 99.637

Using our robust model we get a correlation of -0.28 (95% CI:[-0.69, 0.24]) which is much closer to our initial -0.10 (95% CI: [-0.51, 0.35]). Looking at the estimate for the nu parameter we see that there is evidence for non-normality as the median nu is quite small: 7.5 (95% CI: [2.4, 61.5]).

What about data with no outliers?

So now we have two Bayesian versions of Pearson’s correlation test, one normal and one robust. Do we always have to make a choice which of these two models to use? No! I’ll go for the robust version any day, you see, it also estimates the heaviness of the tails of the bivariate t-distribution and if there is sufficient evidence in the data for normality the estimated t-distribution will be very close to a normal distribution. We can have the cake and eat it!

To show that this works I will now apply the robust model to the same data that I used in the last post which are 30 random draws from a bivariate normal distribution.

plot(x, xlim = c(-125, 125), ylim = c(-100, 150))

plot of chunk unnamed-chunk-10

Running both the standard normal model and the robust t-distribution model on this data results in very similar estimates of the correlation:

quantile(rho_samples, c(0.025, 0.5, 0.975))
##    2.5%     50%   97.5% 
## -0.8780 -0.7637 -0.5578
quantile(robust_rho_samples, c(0.025, 0.5, 0.975))
##    2.5%     50%   97.5% 
## -0.8842 -0.7644 -0.5518

Looking at the estimate of nu we see that it is quite high which is what we would expect since the data is normal.

quantile(nu_samples, c(0.025, 0.5, 0.975))
##    2.5%     50%   97.5% 
##   5.903  30.216 120.186

To leave a comment for the author, please follow the link and comment on his blog: Publishable Stuff.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.