# Open data and ecological fallacy

April 28, 2012
By

(This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers)

A couple of days ago, on Twitter, @alung mentioned an old post I did publish on this blog about open-data, explaining how difficult it was to get access to data in France (the post, published almost 18 months ago can be found here, in French). And  @alung was wondering if it was still that hard to access nice datasets. My first answer was that actually, people were more receptive, and I now have more people willing to share their data. And on the internet, amazing datasets can be found now very easily. For instance in France, some detailed informations can be found about qualitifications, houses and jobs, by small geographical areas, on http://www.recensement.insee.fr (thanks @coulmont for the link). And that is great for researchers (and anyone actually willing to check things by himself).

But one should be aware that those aggregate data might not be sufficient to build up econometric models, and to infere individual behaviors. Thinking that relationships observed for groups necessarily hold for individuals is a common fallacy (the so-called " ecological fallacy").

In a popular paper, Robinson (1950) discussed "ecological inference", stressing the difference between ecological correlations (on groups) and individual correlations (see also Thorndike (1937)) He considered two aggregated quantities, per american state: the percent of the population that was foreign-born, and the percent that was literate. One dataset used in the paper was the following

> library(eco)
> data(forgnlit30)
> tail(forgnlit30)
Y          X         W1          W2 ICPSR
43 0.076931986 0.03097168 0.06834300 0.077206504    66
44 0.006617641 0.11479052 0.03568792 0.002847920    67
45 0.006991899 0.11459207 0.04151310 0.002524065    68
46 0.012793782 0.18491515 0.05690731 0.002785916    71
47 0.007322475 0.13196654 0.03589512 0.002978594    72
48 0.007917342 0.18816461 0.02949187 0.002916866    73

The correlation between  foreign-born and literacy was

> cor(forgnlit30$X,1-forgnlit30$Y)
[1] 0.2069447

So it seems that there is a positive correlation, so a quick interpretation could be that in the 30's, amercians were iliterate, but hopefully, literate immigrants got the idea to come in the US. But here, it is like in Simpson's paradox, because actually, the sign should be negative, as obtained on individual studies. In the state-based-data study, correlation was positive mainly because foreign-born people tend to live in states where the native-born are relatively literate...

Hence, the problem is clearly how individuals were grouped. Consider the following set of individual observations,

> n=1000
> r=-.5
> Z=rmnorm(n,c(0,0),matrix(c(1,r,r,1),2,2))
> X=Z[,1]
> E=Z[,2]
> Y=3+2*X+E
> cor(X,Y)
[1] 0.8636764

Consider now some regrouping, e.g.

> I=cut(Z[,2],qnorm(seq(0,1,by=.05)))
> Yg=tapply(Y,I,mean)
> Xg=tapply(X,I,mean)

Then the correlation is rather different,

>  cor(Xg,Yg)
[1] 0.1476422

Here we have a strong positive individual correlation, and a small (positive correlation) on grouped data, but almost anything is possible.

Models with random coefficients have been used to make ecological inferences. But that is a long story, andI will probably come back with a more detailed post on that topic, since I am still working on this with @coulmont (following some comments by @frbonnet on his post on recent French elections on http://coulmont.com/blog/).

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...