Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the area of political science where I am active (european politics) it is very common to simply estimate a model, and start drawing inferences immediately. This is a shame as drawing inferences from a model without assessing how well it fits the data, can lead to conclusions based on models that fit the data horribly. The simplest and most straightforward manner in which to assess model fit is to simulate a data set from the model, calculate some metric and compare this metric to the observed data. I have been playing around with this idea for some time, and I decided to test it on the data from Fearon and Laiting on insurgency, ethnicity and civil war. It is not really my field, but a colleague of mine told me that the data fit the models horribly. Thus I decided to see if this was indeed the case. The data can be downloaded from here.

Since the dependent variable is the binary variable of civil war onset, a logit model is fitted. One obvious metric on which to compare data generated from the model to the observed data, is the proportion of positive cases in the data. If the observed proportion falls within the 95\% confidence interval of the simulated distribution, then the model provides a reasonable fit.

The basic procedure is to first estimate the model, then build a loop that first draw a set of coefficients from a multivariate normal distribution, with the betas from the fitted model as the vector of means and covariance matrix from the fitted model as the matrix of variances. Multiply the vector of simulated coefficients with the observed data, plug the resultant vector of fitted values into the inverse logit distribution. This will give us a vector of probabilities. This vector is plugged into a binomial distribution to generate a new data set of zeros and ones.

Below is the result. The graph show the density of the simulated distribution, and the arrow show where the proportion from the observed data falls. As you can see the model fits the data rather well. Here is the R script to reproduce the above graph: