Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Use of the Pearson correlation co-efficient is common in genomics and bioinformatics, which is OK as it goes (I have used it extensively myself), but it has some major drawbacks – the major one being that Pearson can produce large coefficients in the presence of very large measurements.

This is best shown via example in R:

```# let's correlate some random data
g1 <- rnorm(50)
g2 <- rnorm(50)

cor(g1, g2)
#  -0.1486646```

So we get a small, -ve correlation from correlating two sets of 50 random values. If we ran this 1000 times we would get a distribution around zero, as expected.

Let’s add in a single, large value:

```# let's correlate some random data with the addition of a single, large value
g1 <- c(g1, 10)
g2 <- c(g2, 11)

cor(g1, g2)
#  0.6040776```

Holy smokes, all of a sudden my random datasets are positively correlated with r>=0.6!

It’s also significant.

```> cor.test(g1,g2, method="pearson")

Pearsons product-moment correlation

data:  g1 and g2
t = 5.3061, df = 49, p-value = 2.687e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3941015 0.7541199
sample estimates:
cor
0.6040776```

So if you have used Pearson in large datasets, you will almost certainly have some of these spurious correlations in your data.

How can you solve this? By using Spearman, of course:

```> cor(g1, g2, method="spearman")
 -0.0961086
> cor.test(g1, g2, method="spearman")

Spearmans rank correlation rho

data:  g1 and g2
S = 24224, p-value = 0.5012
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.0961086```