[This article was first published on Nor Talk Too Wise » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been thinking a lot about what it means for two variables to be correlated.  Scientists throw around the term like it’s uniformly understood, but I fear that an understanding of the concept is elusive to substantive researchers who aren’t interested in empirical methods, except as a means by which we can demonstrate that our hypotheses somehow reflect reality.

In particular, I’m thinking of my classmates from my graduate Empirical Methods course.  None of them were actually interested in the mathematics of statistical analysis, and it didn’t help that our professor really wasn’t either.  To me, this represents a huge problem in the training of students of political science.  For example, we were taught that if a linear regression did not yield a beta coeffecient which was clearly distinct from zero, then the two variables exhibited no correlation.

The problem with this approach is that it restricts said students to searching for fundamentally linear relationships.  Nassim Nicholas Taleb makes a damning argument against the linear model in The Black Swan.  The gist of it was that there is no such thing as a real linear correlation.  Consider the case of a person who is lost in the desert and thirsty.  Such a person will value a small quantity of water very highly.  However, as this person encounters more and more water, this person will value the water less and less.  Eventually, the wander stumbles into an  ocean and drowns.  The wanderer’s utility valuation of the water went from zero (how much she values water she doesn’t have) to some high amount, back down to zero (where the stuff is common and effectively free), and then below zero when it begins to harm her.  This is a fundamentally non-linear relationship.  Taleb then prompts the reader to try to think of a relationship which exhibits true, perfect linearity.  Of course, there is none.

I’m not a huge fan of Taleb’s.  His esoteric writing style rubs me the wrong way, and while I’ll be the first to call him a very smart guy with some excellent points, this one is off the deep end in the opposite direction.  Linear correlation exists in cases for which linear relationships exist.  However, linear modeling is perfectly appropriate for analyzing a wide array of non-linear relationships, given appropriate mathematical handling.

Take, for example, this graph I made a few weeks ago concerning the relationship between State Failure (Failed States Index) and Democracy (Polity Score).

There are two things I can tell you from a casual look at this plot: first, there exists some relationship between the polity score and the failed state score.  Second, there is no linear correlation between the two.  For those in my program, the presence of a linear relationship would constitute a publishable result.  In fact, these data happen to have just such a relationship:

Call:
lm(formula = Total ~ polity2)

Residuals:
Min      1Q  Median      3Q     Max
-47.190 -13.393   3.519  16.036  35.625

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  78.6750     1.9792  39.750  < 2e-16 ***
polity2      -1.6577     0.2693  -6.155 6.23e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21 on 154 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.1974,	Adjusted R-squared: 0.1922
F-statistic: 37.88 on 1 and 154 DF,  p-value: 6.228e-09

Because our p-value is below the .05 publication threshold, this is a result worth discussing.  But note the small R-squared value: This tells us (the non-methodologist researchers) that we need to get back to the drawing board on variables to incorporate.

This is the point at which I must call “Bullshit!”  The small R-squared tells us that a linear model doesn’t describe this relationship very well, but if we incorporate any other variable with a similarly curved relationship to state failure, it should go through the roof.  This approach tells us very little about what is actually going on, and reinforces the belief that poor showings of r-squared are a product of ill-selected variables.

Instead of going down this road, I tried to find some deeper relationship by looking at the component democracy and autocracy scores, which also yielded nonlinearities.  This was a better approach, but it overlooked a fundamental point about the data: there is a perfectly adequate mathematical description of this curve.  We can apply linear regression as a to use this variable as two variables: linearly (i.e., y = x) and quadratically (y = x^2).

Call:
lm(formula = Total ~ polity2 + I(polity2^2))

Residuals:
Min      1Q  Median      3Q     Max
-62.759 -11.054   0.989  11.902  35.855

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  98.82394    2.39812  41.209   <2e-16 ***
polity2      -0.30748    0.23939  -1.284    0.201
I(polity2^2) -0.47004    0.04368 -10.760   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.89 on 153 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.5431,	Adjusted R-squared: 0.5372
F-statistic: 90.95 on 2 and 153 DF,  p-value: < 2.2e-16

This is a considerably better fit, don’t you think?  Using that same approach I’ve been using to plot this function, we get this plot:

Better, no?  Given a State’s Polity2 score, we can determine with reasonable confidence a range into which it likely falls in the Failed State Index, though we cannot do so with any degree of predictive accuracy using the simpler linear model.  If we simply must use a simple linear model, there is no reason we cannot still accomplish our goals by first applying the function to our polity2 variable, like so:

z <- (-.30748 * polity2) + (-.47004 * polity2^2) + 98.82394

We now have a variable, z, which is the polity2 score transformed using the second linear model I described.  We could graph this:

And determine that there is, indeed, a linear relationship to be seen.  Now, this is strictly an exercise in correlation demonstration: any correlation can be described as a linear model of the dependent variable over the independent variable using whatever mathematical transformation is appropriate.  However, if you already know that transformation, why bother displaying it linearly?

Depending upon Linear regression is not a crime.  However, believing that the absence of a linear relationship is the same thing as the absence of a relationship is preposterous.  As researchers, we must strive to remember that correlation is a simple concept with some very complex caveats, and that an understanding of correlation comes from spending a lot of time pondering correlation.  Normal people don’t invest that kind of time and effort into mathematical thinking.  It is thus incumbent upon us to recognize the linear relationship for its utility, and then promptly shelve it to begin examining the real world.  Relationships between directly causal variables need not be linear, and that must be clear.

PS) Here’s my code:

FailedStateIndexPolity.Merge <- read.csv("http://nortalktoowise.com/wp-content/uploads/2011/07/FailedStateIndexPolity.Merge_1.csv")
attach(FailedStateIndexPolity.Merge)
plot(polity2, Total, main="Failed States Index over Polity2 Score")

model1 <- lm(Total ~ polity2)
summary(model1)
abline(model1)

model2 <- lm (Total ~ polity2 + I(polity2^2))
x <- seq(-10,10)
y <- model2\$coef %*% rbind(1,x,I(x^2))
plot(polity2, Total, main="Failed States Index over Polity2 Score")
lines(x,y,lwd=2)

z <- (-.30748 * polity2) + (-.47004 * polity2^2) + 98.82394
model3 <- lm(Total ~ z)
plot(z, Total, main="Failed States Index over Polity2 Transformed")
abline(model3)