The Bayesian Counterpart of Pearson’s Correlation Test

Posted on August 19, 2013 by Rasmus Bååth in R bloggers | 0 Comments

[This article was first published on Publishable Stuff, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Except for maybe the t test, a contender for the title “most used and abused statistical test” is Pearson’s correlation test. Whenever someone wants to check if two variables relate somehow it is a safe bet (at least in psychology) that the first thing to be tested is the strength of a Pearson’s correlation. Only if that doesn’t work a more sophisticated analysis is attempted (“That p-value is still to big, maybe a mixed models logistic regression will make it smaller…”). One reason for this is that the Pearson’s correlation test is conceptually quite simple and have assumption that makes it applicable in many situations (but it is definitely also used in many situations where underlying assumption are violated).

Since I’ve converted to “Bayesianism” I’ve been trying to figure out what Bayesian analyses correspond to the classical analyses. For t tests, chi-square tests and Anovas I’ve found Bayesian versions that, at least conceptually, are testing the same thing. Here are links to Bayesian versions of the t test, a chi-square test and an Anova, if you’re interested. But for some reason I’ve never encountered a discussion of what a Pearson’s correlation test would correspond to in a Bayesian context. Maybe this is because regression modeling often can fill the same roll as correlation testing (quantifying relations between continuous variables) or perhaps I’ve been looking in the wrong places.

The aim of this post is to explain how one can run the Bayesian counterpart of Pearson’s correlation test using R and JAGS. The model that a classical Pearson’s correlation test assumes is that the data follows a bivariate normal distribution. That is, if we have a list $x$ of pairs of data points $[[x_{1,1},x_{1,2}],[x_{2,1},x_{2,2}],[x_{3,1},x_{3,2}],…]$ then the $x_{i,1} \text{s}$ and the $x_{i,2} \text{s}$ are each assumed to be normally distributed with a possible linear dependency between them. This dependency is quantified by the correlation parameter $\rho$ which is what we want to estimate in a correlation analysis. A good visualization of a bivariate normal distribution with $\rho = 0.3$ can be found on the the wikipedia page on the multivariate normal distribution :

We will assume the same model and our Bayesian correlation analysis then reduces to estimating the parameters of a bivariate normal distribution given some data. One problem is that the bivariate normal distribution and, more general, the multivariate normal distribution isn’t parameterized using $\rho$, that is, we cannot estimate $\rho$ directly. The bivariate normal is parameterized using $\mu_1$ and $\mu_2$ the means of the two marginal distributions (the red and blue normal distribution in the graph above) and a covariance matrix $\Sigma$ which defines $ \sigma_1^2 $ and $\sigma_2^2$, the variances of the two marginal distributions, and the covariance, how much the marginal distributions vary together. So the covariance is another measure of how much two variables vary together and the covariance corresponding to a correlation of $\rho$ can be calculated as $\rho \cdot \sigma_1 \cdot \sigma_2$. So here is the model we want to estimate:

$$[x_{i,1}, x_{i,2}] \sim MulitivariateNormal([\mu_1,\mu_2], \Sigma)$$

$$%

Add some flat priors on this (which could, of course, be made more informative) and we’re ready to roll:

$$\mu_1,\mu_2 \sim Normal(0, 1000)$$

$$\sigma_1,\sigma_2 \sim Uniform(0, 1000)$$

$$\rho \sim Uniform(-1, 1)$$

Implementation

So, how to implement this model? I’m going to do it with R and the JAGS sampler interfaced with R using the rjags package. First I’m going to simulate some data with a correlation of 0.7 to test the model with.

library(rjags)
library(mvtnorm) # to generate correlated data with rmvnorm.
library(car) # To plot the estimated bivariate normal distribution.
set.seed(31415)

mu <- c(10, 30)
sigma <- c(20, 40)
rho <- -0.7
cov_mat <- rbind(c(     sigma[1]^2       , sigma[1]*sigma[2]*rho ),
                 c( sigma[1]*sigma[2]*rho,      sigma[2]^2       ))
x <- rmvnorm(30, mu, cov_mat)
plot(x, xlim=c(-125, 125), ylim=c(-100, 150))

The simulated data looks quite correlated and a classical Pearson’s correlation test confirms this:

cor.test(x[, 1], x[, 2])

model_string <- " model { for(i in 1:n) { x[i,1:2] ~ dmnorm(mu[], prec[ , ]) } # Constructing the covariance matrix and the corresponding precision matrix. prec[1:2,1:2] <- inverse(cov[,]) cov[1,1] <- sigma[1] * sigma[1] cov[1,2] <- sigma[1] * sigma[2] * rho cov[2,1] <- sigma[1] * sigma[2] * rho cov[2,2] <- sigma[2] * sigma[2] # Flat priors on all parameters which could, of course, be made more informative. sigma[1] ~ dunif(0, 1000) sigma[2] ~ dunif(0, 1000) rho ~ dunif(-1, 1) mu[1] ~ dnorm(0, 0.001) mu[2] ~ dnorm(0, 0.001) # Generate random draws from the estimated bivariate normal distribution x_rand ~ dmnorm(mu[], prec[ , ]) } "

data_list = list(x = x, n = nrow(x)) # Use classical estimates of the parameters as initial values inits_list = list(mu = c(mean(x[, 1]), mean(x[, 2])), rho = cor(x[, 1], x[, 2]), sigma = c(sd(x[, 1]), sd(x[, 1]))) jags_model <- jags.model(textConnection(model_string), data = data_list, inits = inits_list, n.adapt = 500, n.chains = 3, quiet = T) update(jags_model, 500) mcmc_samples <- coda.samples(jags_model, c("mu", "rho", "sigma", "x_rand"), n.iter = 5000)

Analysis of Some “Real” Data

So let’s use this model on some real data. The data.frame below contains the names, weights in kg and finishing times for all participants of the men’s 100 m semi-finals in the 2013 World Championships in Athletics. Well, those I could find the weights of anyway…

d <- data.frame(runner = c("Usain Bolt", "Justin Gatlin", "Nesta Carter", "Kemar Bailey-Cole", "Nickel Ashmeade", "Mike Rodgers", "Christophe Lemaitre", "James Dasaolu", "Zhang Peimeng", "Jimmy Vicaut", "Keston Bledman", "Churandy Martina", "Dwain Chambers", "Jason Rogers", "Antoine Adams", "Anaso Jobodwana", "Richard Thompson", "Gavin Smellie", "Ramon Gittens", "Harry Aikines-Aryeetey"), time = c(9.92, 9.94, 9.97, 9.93, 9.9, 9.93, 10, 9.97, 10, 10.01, 10.08, 10.09, 10.15, 10.15, 10.17, 10.17, 10.19, 10.3, 10.31, 10.34), weight = c(94, 79, 78, 83, 77, 76, 74, 87, 86, 83, 75, 74, 92, 69, 79, 71, 80, 80, 77, 87)) d

So, I know nothing about running (and I’m not sure this is a very representative data set…) but my hypothesis is that there should be a positive correlation between weight and finishing time. That is, the more you weigh the slower you run. Sounds like logic, right? Let’s look at the data…

plot(d$time, d$weight)

At a first glance it seems like my hypothesis is not supported by the data. I wonder what our model has to say about that?

data_list = list(x = d[, c("time", "weight")], n = nrow(d)) # Use classical estimates of the parameters as initial values inits_list = list(mu = c(mean(d$time), mean(d$weight)), rho = cor(d$time, d$weight), sigma = c(sd(d$time), sd(d$weight))) jags_model <- jags.model(textConnection(model_string), data = data_list, inits = inits_list, n.adapt = 500, n.chains = 3, quiet = T) update(jags_model, 500) mcmc_samples <- coda.samples(jags_model, c("rho"), n.iter = 5000)

summary(mcmc_samples)

Seems like there is no support for my hypothesis, the posterior distribution of rho is centered around zero and, if anything, there might be a tiny negative correlation. So your weight doesn’t seem to influence how fast you run (if you are a runner in the 100 m semi finals, at least).

To leave a comment for the author, please follow the link and comment on their blog: Publishable Stuff.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

The Bayesian Counterpart of Pearson’s Correlation Test

Implementation

Analysis of Some “Real” Data

Related

Implementation

Analysis of Some “Real” Data

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)