So the other day’s experiment with the Failed States Index and the Polity Data didn’t yield the linear trend I had originally expected. After all, the two measure fundamentally distinct things. But perhaps there’s another dataset which will match linearly. The same people who made polity also put out a dataset called the State Fragility Index. Perhaps their definition of Fragility will be similar to the Fund for Peace’s definition of Failure. I merged the 2009 data to take a peek.
SFImergeFSI <- read.csv("http://nortalktoowise.com/wp-content/uploads/2011/07/SFImergeFSI.csv") attach(SFImergeFSI)
OK! Let’s start this off with a good old-fashioned Linear Model. After all, if these two are measuring the same thing (and they’re both doing it well), they should be very linearly correlated.
model1 <- lm (sfi ~ Total) summary(model1)
Call: lm(formula = sfi ~ Total) Residuals: Min 1Q Median 3Q Max -6.6865 -2.4735 -0.2238 2.3175 6.7389 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -8.31988 0.79689 -10.44 <2e-16 *** Total 0.23536 0.01039 22.65 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.118 on 159 degrees of freedom Multiple R-squared: 0.7634, Adjusted R-squared: 0.7619 F-statistic: 513.1 on 1 and 159 DF, p-value: < 2.2e-16
Alright, that’s not so bad. The R-squared is smaller than expected, but we’ll talk about that in a minute. Also, there’s a small beta of .23, but that’s largely a byproduct of the differing magnitudes of the scales. We could normalize that to just about anything by multiplying the State Fragility Index. If we cared enough, we could jump straight to an approximation of one (which would be sort of ideal) by making the magnitudes of the scales match. State Fragility Index measures from zero to 25, and the Failed States Index measures from zero to 120, but has not yet spanned either of its extremes (and probably never will) . So, if I cared about normalization, I might multiply the State Fragility score by (120/25). Better yet, I’d subtract out the min of the Failed States Index from each value of the FSI, and then multiply the State Fragility score by the Ratio of max(Failed States Index)/max(State Fragility Index). But I don’t care that much, because doing stuff like that doesn’t change the topology of the data. If it does, then you’ve done something horribly wrong.
Getting back to the data, let’s have a look at the scatterplot.
plot(Total, sfi, main="State Fragility Index over Failed States Index") abline(model1)
Huh. The Model’s OK in numbers, but the scatterplot tells us a great deal more. It doesn’t actually look linear, frankly. I can think of two possible explanations: one, there’s something weird about the way the State Fragility Index assigns values less than 3; or two, it’s just not linear. Let’s look at the nonlinearity possibility.
To do this, we’ll make another linear model, only this one will be multivariate. Let’s explain the State Fragility Index using the Failed States Index, and the square of the Failed States Index. This is cool because the computer can figure it out like a normal linear model of three variables, and then we can use the coefficients to graph the quadratic function.
model2 <- lm(sfi ~ Total + I(Total^2)) summary(model2)
Call: lm(formula = sfi ~ Total + I(Total^2)) Residuals: Min 1Q Median 3Q Max -7.1186 -1.6911 -0.1213 1.4091 8.2868 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.9030666 1.4915584 0.605 0.5457 Total -0.0938886 0.0479218 -1.959 0.0518 . I(Total^2) 0.0025158 0.0003595 6.998 6.97e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.733 on 158 degrees of freedom Multiple R-squared: 0.8194, Adjusted R-squared: 0.8171 F-statistic: 358.4 on 2 and 158 DF, p-value: < 2.2e-16
Our R-squared has improved a bit, which means this is definitely a better fit. The coefficients may look a little whacky, but the make sense when you graph them. By the way, we can’t graph this with abline(), since it’s got mulitple variables, so we have a little hack for that, picking points uniformly across the x axis and binding them with their appropriate y values.
plot(Total, sfi, main="State Fragility Index over Failed States Index") x <- seq(min(Total),max(Total)) y <- model2$coef %*% rbind(1,x,x^2) lines(x,y,lwd=2)
Well, that certainly looks a lot better. It also lacks a strong theoretical basis. Why in the world should the Failed States Index define its distribution linearly and the State Fragility Index define it quadratically? Makes no sense to me. That’s why I’m inclined to like the other explanation better: perhaps there’s a methodological thing (technical term) which causes the State Fragility Index to assign values under three differently than all the other values. To test this, let’s break up the dataset.
sub1 <- subset(SFImergeFSI, sfi>3)
And now we should be able to build our linear model!
model3 <- lm(sub1$sfi ~ sub1$Total) summary(model3)
Call: lm(formula = sub1$sfi ~ sub1$Total) Residuals: Min 1Q Median 3Q Max -7.3079 -2.4196 0.0314 2.4322 7.9236 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -13.88426 1.90157 -7.301 3.86e-11 *** sub1$Total 0.30152 0.02216 13.607 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.108 on 116 degrees of freedom Multiple R-squared: 0.6148, Adjusted R-squared: 0.6115 F-statistic: 185.1 on 1 and 116 DF, p-value: < 2.2e-16
This r-squared is 6ish. But the r-squared of our first model was .75ish. We select the subset of the data that looked linear, run our regression, and the model fits less well? What a rip-off! Guess that means I should stick to the non-linear interpretation, but god only knows what that means in practical terms. I’ll let that remain a mystery for today.
plot (sub1$Total, sub1$sfi) abline(model3)
Well, that was frustrating and not especially informative. We now know that the Failed States Index is correlated with the State Fragility Index somehow. That “somehow” reinforces my belief that the Failed States Index Data is totally worth its salt for quantitative analysis, and just waiting for someone much smarter than me to come by and do some analysis. I’m looking at you, R-bloggers.