Dependence and Correlation

June 13, 2011
By

(This article was first published on mickeymousemodels, and kindly contributed to R-bloggers)

In everyday life I hear the word "correlation" thrown around far more often than "dependence." What's the difference? Correlation, in its most common form, is a measure of linear dependence; the catch is that not all dependencies are linear. The set of correlated random variables lies entirely within of the larger set of dependent random variables; correlation implies dependence, but not the other way around. Here are some silly (but hopefully interesting) examples to illustrate that point:

n <- 5000
df <- data.frame(x=rnorm(n), y=rnorm(n, mean=5, sd=2))
plot(df, xlim=c(-6, 6), ylim=c(-2, 12), main="A Beehive")
mtext("X and Y are independent (and therefore uncorrelated)")
savePlot("beehive.png")


n <- 2500
df <- data.frame(x=rexp(n), y=rexp(n, rate=2))
plot(df, xlim=c(-0.05, 10), ylim=c(-0.05, 5), main="A B-2 Bomber")
mtext("X and Y are independent (and therefore uncorrelated)")
savePlot("bomber.png")


n <- 5000
df <- data.frame(x=runif(10000))
df$y <- runif(10000, -abs(0.5 - df$x), abs(0.5 - df$x))
plot(df, xlim=c(-0.05, 1.05), ylim=c(-0.55, 0.55), main="A Bowtie / Butterfly")
mtext("X and Y are dependent but uncorrelated")
savePlot("bowtie.png")


n <- 20000
df <- data.frame(x=runif(n, -1, 1), y=runif(n, -1, 1))
df <- subset(df, (x^2 + y^2 <= 1 & x^2 + y^2 >= 0.5) | x^2 + y^2 <= 0.25)
plot(df, main="Saturn")
mtext("X and Y are dependent but uncorrelated")
savePlot("saturn.png")


n <- 5000
df <- data.frame(x=rnorm(n))
df$y <- with(df, x * (2 * as.integer(abs(x) > 1.54) - 1))
plot(df, xlim=c(-4, 4), ylim=c(-4, 4), main="A Swing Bridge")
mtext("X and Y are dependent but uncorrelated")
savePlot("bridge.png")


n <- 1000
df <- data.frame(x=rnorm(n), z=sample(c(-1, 1), size=n, replace=TRUE))
df$y <- with(df, z * x)
df <- df[ , c("x", "y")]
plot(df, xlim=c(-4, 4), ylim=c(-4, 4), main="A Treasure Map")
mtext("X and Y are dependent but uncorrelated")
savePlot("treasure.png")


The last two are classic examples: X and Y are normally distributed, but (X, Y) is not a bivariate normal.

I'll admit that the two exponentials are a bit counterintuitive to me, at least visually. (They're in the second plot from the top, which looks vaguely like a B-2.) The variables are independent; if you regressed Y on X you'd end up with a flat line. Yet, somehow, if I were to look at that plot without knowing how the variables were generated, I'd want to draw a diagonal line pointing up and to the right. If anything, it goes to show that I should probably not run regressions "by inspection."

To leave a comment for the author, please follow the link and comment on his blog: mickeymousemodels.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.