Bayesian Estimation of Correlation – Now Robust!

Posted on August 28, 2013 by Rasmus Bååth in R bloggers | 0 Comments

[This article was first published on Publishable Stuff, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

So in the last post I showed how to run the Bayesian counterpart of Pearson’s correlation test by estimating the parameters of a bivariate normal distribution. A problem with assuming normality is that the normal distribution isn’t robust against outliers. Let’s see what happens if we take the data from the last post with the finishing times and weights of the runners in the men’s 100 m semi-finals in the 2013 World Championships in Athletics and introduce an outlier. This is how the original data looks:

## runner time weight ## 1 Usain Bolt 9.92 94 ## 2 Justin Gatlin 9.94 79 ## 3 Nesta Carter 9.97 78 ## 4 Kemar Bailey-Cole 9.93 83 ## 5 Nickel Ashmeade 9.90 77 ## 6 Mike Rodgers 9.93 76 ## 7 Christophe Lemaitre 10.00 74 ## 8 James Dasaolu 9.97 87 ## 9 Zhang Peimeng 10.00 86 ## 10 Jimmy Vicaut 10.01 83 ## 11 Keston Bledman 10.08 75 ## 12 Churandy Martina 10.09 74 ## 13 Dwain Chambers 10.15 92 ## 14 Jason Rogers 10.15 69 ## 15 Antoine Adams 10.17 79 ## 16 Anaso Jobodwana 10.17 71 ## 17 Richard Thompson 10.19 80 ## 18 Gavin Smellie 10.30 80 ## 19 Ramon Gittens 10.31 77 ## 20 Harry Aikines-Aryeetey 10.34 87

data_list = list(x = d[, c("time", "weight")], n = nrow(d)) # Use classical estimates of the parameters as initial values inits_list = list(mu = c(mean(d$time), mean(d$weight)), rho = cor(d$time, d$weight), sigma = c(sd(d$time), sd(d$weight))) jags_model <- jags.model(textConnection(model_string), data = data_list, inits = inits_list, n.adapt = 500, n.chains = 3, quiet = TRUE) update(jags_model, 500) mcmc_samples <- coda.samples(jags_model, c("rho", "x_rand[1]", "x_rand[2]"), n.iter = 5000) samples_mat <- as.matrix(mcmc_samples) plot(d$time, d$weight) dataEllipse(samples_mat[, c("x_rand[1]", "x_rand[2]")], levels = c(0.5, 0.95), plot.points = FALSE)

robust_model_string <- " model { for(i in 1:n) { # We've replaced dmnorm with and dmt ... x[i,1:2] ~ dmt(mu[], prec[ , ], nu) } prec[1:2,1:2] <- inverse(cov[,]) cov[1,1] <- sigma[1] * sigma[1] cov[1,2] <- sigma[1] * sigma[2] * rho cov[2,1] <- sigma[1] * sigma[2] * rho cov[2,2] <- sigma[2] * sigma[2] sigma[1] ~ dunif(0, 1000) sigma[2] ~ dunif(0, 1000) rho ~ dunif(-1, 1) mu[1] ~ dnorm(0, 0.0001) mu[2] ~ dnorm(0, 0.0001) # ... and added a prior on the degree of freedom parameter nu. nu <- nuMinusOne+1 nuMinusOne ~ dexp(1/29) x_rand ~ dmt(mu[], prec[ , ], nu) } "

data_list = list(x = d[, c("time", "weight")], n = nrow(d)) # Use robust estimates of the parameters as initial values inits_list = list(mu = c(median(d$time), median(d$weight)), rho = cor(d$time, d$weight, method = "spearman"), sigma = c(mad(d$time), mad(d$weight))) jags_model <- jags.model(textConnection(robust_model_string), data = data_list, inits = inits_list, n.adapt = 500, n.chains = 3, quiet = TRUE) update(jags_model, 500) mcmc_samples <- coda.samples(jags_model, c("mu", "rho", "sigma", "nu", "x_rand"), n.iter = 5000)

What about data with no outliers?

So now we have two Bayesian versions of Pearson’s correlation test, one normal and one robust. Do we always have to make a choice which of these two models to use? No! I’ll go for the robust version any day, you see, it also estimates the heaviness of the tails of the bivariate t-distribution and if there is sufficient evidence in the data for normality the estimated t-distribution will be very close to a normal distribution. We can have the cake and eat it!

To show that this works I will now apply the robust model to the same data that I used in the last post which are 30 random draws from a bivariate normal distribution.

plot(x, xlim = c(-125, 125), ylim = c(-100, 150))

Running both the standard normal model and the robust t-distribution model on this data results in very similar estimates of the correlation:

quantile(rho_samples, c(0.025, 0.5, 0.975))

Looking at the estimate of nu we see that it is quite high which is what we would expect since the data is normal.

quantile(nu_samples, c(0.025, 0.5, 0.975))

To leave a comment for the author, please follow the link and comment on their blog: Publishable Stuff.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Bayesian Estimation of Correlation – Now Robust!

What about data with no outliers?

Related

What about data with no outliers?

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)