# The Bayesian Counterpart of Pearson’s Correlation Test

**Publishable Stuff**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Except for maybe the t test, a contender for the title “most used and abused statistical test” is Pearson’s correlation test. Whenever someone wants to check if two variables relate somehow it is a safe bet (at least in psychology) that the first thing to be tested is the strength of a Pearson’s correlation. Only if that doesn’t work a more sophisticated analysis is attempted (“That p-value is still to big, maybe a mixed models logistic regression will make it smaller…”). One reason for this is that the Pearson’s correlation test is conceptually quite simple and have assumption that makes it applicable in many situations (but it is definitely also used in many situations where underlying assumption are violated).

Since I’ve converted to “Bayesianism” I’ve been trying to figure out what Bayesian analyses correspond to the classical analyses. For t tests, chi-square tests and Anovas I’ve found Bayesian versions that, at least conceptually, are testing the same thing. Here are links to Bayesian versions of the t test, a chi-square test and an Anova, if you’re interested. But for some reason I’ve never encountered a discussion of what a Pearson’s correlation test would correspond to in a Bayesian context. Maybe this is because regression modeling often can fill the same roll as correlation testing (quantifying relations between continuous variables) or perhaps I’ve been looking in the wrong places.

The aim of this post is to explain how one can run the Bayesian counterpart of Pearson’s correlation test using R and JAGS. The model that a classical Pearson’s correlation test assumes is that the data follows a bivariate normal distribution. That is, if we have a list $x$ of pairs of data points $[[x_{1,1},x_{1,2}],[x_{2,1},x_{2,2}],[x_{3,1},x_{3,2}],…]$ then the $x_{i,1} \text{s}$ and the $x_{i,2} \text{s}$ are each assumed to be normally distributed with a possible linear dependency between them. This dependency is quantified by the correlation parameter $\rho$ which is what we want to estimate in a correlation analysis. A good visualization of a bivariate normal distribution with $\rho = 0.3$ can be found on the the wikipedia page on the multivariate normal distribution :

We will assume the same model and our Bayesian correlation analysis then reduces to estimating the parameters of a bivariate normal distribution given some data. One problem is that the bivariate normal distribution and, more general, the multivariate normal distribution isn’t parameterized using $\rho$, that is, we cannot estimate $\rho$ directly. The bivariate normal is parameterized using $\mu_1$ and $\mu_2$ the means of the two marginal distributions (the red and blue normal distribution in the graph above) and a *covariance matrix* $\Sigma$ which defines $ \sigma_1^2 $ and $\sigma_2^2$, the variances of the two marginal distributions, and the covariance, how much the marginal distributions vary together. So the covariance is another measure of how much two variables vary together and the covariance corresponding to a correlation of $\rho$ can be calculated as $\rho \cdot \sigma_1 \cdot \sigma_2$. So here is the model we want to estimate:

$$[x_{i,1}, x_{i,2}] \sim MulitivariateNormal([\mu_1,\mu_2], \Sigma)$$

$$%

Add some flat priors on this (which could, of course, be made more informative) and we’re ready to roll:

$$\mu_1,\mu_2 \sim Normal(0, 1000)$$

$$\sigma_1,\sigma_2 \sim Uniform(0, 1000)$$

$$\rho \sim Uniform(-1, 1)$$

## Implementation

So, how to implement this model? I’m going to do it with R and the JAGS sampler interfaced with R using the `rjags`

package. First I’m going to simulate some data with a correlation of 0.7 to test the model with.

library(rjags) library(mvtnorm) # to generate correlated data with rmvnorm. library(car) # To plot the estimated bivariate normal distribution. set.seed(31415) mu <- c(10, 30) sigma <- c(20, 40) rho <- -0.7 cov_mat <- rbind(c( sigma[1]^2 , sigma[1]*sigma[2]*rho ), c( sigma[1]*sigma[2]*rho, sigma[2]^2 )) x <- rmvnorm(30, mu, cov_mat) plot(x, xlim=c(-125, 125), ylim=c(-100, 150))

The simulated data looks quite correlated and a classical Pearson’s correlation test confirms this:

cor.test(x[, 1], x[, 2])