Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week, in our Inequality course, we’ve been looking at data. We started with some simulated data, only a few of them

```> library("ineq")
> (income=sort(income))
  19233  23707  53297  61667 218662```

How could we say that there is inequality in this sample? If we look at the wealth owned by the poorest, the poorest person (1 out of 5) owns 5% of the wealth; the bottom two (2 out of 5) own 11%, etc

```> income/sum(income)
 0.05107471
> sum(income[1:2])/sum(income)
 0.1140305
> sum(income[1:3])/sum(income)
 0.2555648
> sum(income[1:4])/sum(income)
 0.4193262```

If we plot those values, we get Lorenz curve

```> plot(Lc(income))
> points(c(0:5)/5,c(0,cumsum(income)/sum(income)),pch=19,col="blue")
``` Now, what if we got 500 observations (and not only 5). In that case, a natural tool to visualize those data (or to be  more specific, their distribution) is the histogram

```> load(url("http://freakonometrics.free.fr/income_500.RData"))
> summary(income)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
2191   23830   42750   77010   87430 2003000
> hist(log(income),probability=TRUE,col="light blue",border="white")
> lines(density(log(income)),col="red")
> u=seq(6,15,length=251)
lines(u,dnorm(u,mean(log(income)),sd(log(income))),col="blue")
``` Here we use an histogram to visualize our sample. But not on the income, on the logarithm of the income (because of some outliers, we cannot visualize anything on the histogram). Now, it is possible to compute Gini index to get some information about inequalities

```> gini=function(x){
+ n=length(x)
+ mu=mean(x)
+ g=2/(n*(n-1)*mu)*sum((1:n)*sort(x))-(n+1)/(n-1)
+ return(g)}
```

The problem is that, in practice, having an index without any confidence interval can be meaningless. To compute a confidence interval, we can use a bootstrap procedure,

```> boot=function(x,f,b=500){
+ n=length(x)
+ F=rep(NA,n)
+ for(s in 1:b){
+ idx=sample(1:n,size=n,replace=TRUE)
+ F[s]=f(x[idx])}
+ return(F)}
> G=boot(income,gini,1000)
> hist(G,col="light blue",border="white",probability=TRUE)
``` The red segments is the 90% confidence interval,

```> quantile(G,c(.05,.95))
5%       95%
0.4954235 0.5743917
```

We did include also a blue line with a Gaussian distribution,

```> segments(quantile(G,.05),1,quantile(G,.95),1,
+ col="red",lwd=2)
> u=seq(.4,.65,length=251)
> lines(u,dnorm(u,mean(G),sd(G)),
+ col="blue")
```

Another popular tool is the Pareto plot, where we plot the logarithm of the cumulative survival function against the logarithm of the income,

```> n=length(income)
> x=log(sort(income))
> y=log((n:1)/n)
> plot(x,y)
``` If points were on a straight line, it would mean that incomes can be modeled with a Pareto distribution. Obviously, it is not the case here.

We’ve seen previously how to get Lorenz curve. Actually, it is possible to get also Lorenz curve for some parametric distribution, e.g. some log-normal ones,

```> plot(Lc(income))
> lines(Lc.lognorm,param=1.5,col="red")
> lines(Lc.lognorm,param=1.2,col="red")
> lines(Lc.lognorm,param=.8,col="red")
``` Here, it sounds reasonnable to claim that a log-normal distribution would be a good fit. But maybe not a Pareto distribution

```> plot(Lc(income))
> lines(Lc.pareto,param=2,col="red")
> lines(Lc.pareto,param=1.5,col="red")
> lines(Lc.pareto,param=1.2,col="red")
``` Actually, it is possible to fit some parametric distributions. Observe that sometimes, we have to change the scale, and use ‘000s of dollars, instead of dollars,

```> library(MASS)
> fitdistr(income/1e3,"gamma")
shape           rate
1.0812757769   0.0140404379
(0.0604530180) (0.0009868055)
```

Now, consider two distributions, a Gamma one, and a log-normal one (fitted with maximum likelihood techniques)

```> (fit_g <- fitdistr(income/1e2,"gamma"))
shape           rate
1.0812757769   0.0014040438
(0.0473722529) (0.0000544185)
> (fit_ln <- fitdistr(income/1e2,"lognormal"))
meanlog       sdlog
6.11747519   1.01091329
(0.04520942) (0.03196789)
```

We can visualize the densities

```> u=seq(0,2e5,length=251)
> hist(income,breaks=seq(0,2005000,by=5000),
+ col=rgb(0,0,1,.5),border="white",
+ xlim=c(0,2e5),probability=TRUE)
> v_g <- dgamma(u/1e2, fit_g\$estimate,
+ fit_g\$estimate)/1e2
> v_ln <- dlnorm(u/1e2, fit_ln\$estimate,
+ fit_ln\$estimate)/1e2
> lines(u,v_g,col="red",lwd=2)
> lines(u,v_ln,col=rgb(1,0,0,.4),lwd=2)
``` Here, it looks like the log-normal is a good candidate. We can also plot the cumulative distribution functions

```> x <- sort(income)
> y <- (1:500)/500
> plot(x,y,type="s",col="black",xlim=c(0,250000))
> v_g <- pgamma(u/1e2, fit_g\$estimate,
+ fit_g\$estimate)
> v_ln <- plnorm(u/1e2, fit_ln\$estimate,
+ fit_ln\$estimate)
> lines(u,v_g,col="red",lwd=2)
> lines(u,v_ln,col=rgb(1,0,0,.4),lwd=2)
``` Now, consider some more realistic situation, where we do not have samples from surveys, but binned data,

```> load(url("http://freakonometrics.free.fr/income_binned.RData"))
low  high number  mean std_err
1     0  4999     95  3606     964
2  5000  9999    267  7686    1439
3 10000 14999    373 12505    1471
4 15000 19999    350 17408    1368
5 20000 24999    329 22558    1428
6 25000 29999    337 27584    1520
> tail(income_binned)
low   high number   mean std_err
46 225000 229999     10 228374    1197
47 230000 234999     13 232920    1370
48 235000 239999     11 236341    1157
49 240000 244999     14 242359    1474
50 245000 249999     11 247782    1487
51 250000    Inf    228 395459  189032
```

There is a new package to model that kind of data,

```> library(binequality)
> n <- nrow(income_binned)
> fit_LN <- fitFunc(ID=rep("Fake Data",n),
+ hb=income_binned[,"number"],
+ bin_min=income_binned[,"low"],
+ bin_max=income_binned[,"high"],
+ obs_mean=income_binned[,"mean"],
+ ID_name="Country", distribution=LNO,
+ distName="LNO")
Time difference of 2.101471 secs
for LNO fit across 1 distributions
```

Here, we can fit a log-normal distribution (see Methods for estimating inequality from binned incomes for more details about the methodology)

```> N=income_binned\$number
> y2=N/sum(N)/diff(income_binned\$low)
> u=seq(min(income_binned\$low),
+  max(income_binned\$low),length=101)
> v=dlnorm(u,fit_LN\$parameters,
+ fit_LN\$parameters)
> plot(u,v,col="blue",type="l",lwd=2)
> for(i in 1:(n-1)) rect(income_binned\$low[i],0,
+ income_binned\$high[i],y2[i],col=rgb(1,0,0,.2),
+ border="white")
``` Here, on the histogram (since we have binned data, it is natural to plot an histogram), we can see that the fitted log-normal distribution is great. Actually, to be honest, data were simulated from a log-normal distribution, so it makes sense….

```> N <- income_binned\$number
> y1 <- cumsum(N)/sum(N)
> u <- seq(min(income_binned\$low),
+ max(income_binned\$low),length=101)
> v <- plnorm(u,fit_LN\$parameters,
+ fit_LN\$parameters)
> plot(u,v,col="blue",type="l",lwd=2)
> for(i in 1:(n-1)) rect(income_binned\$low[i],0,
+ income_binned\$high[i], y1[i],col=rgb(1,0,0,.2))
> for(i in 1:(n-1)) rect(income_binned\$low[i],
+ y1[i],income_binned\$high[i],c(0,y1)[i],
+ col=rgb(1,0,0,.4))
``` For the cumulated distribution function, I consider either the worst case scenario (everyone in a bin get the lower possible income) and the best case (everyone has the highest possible income).

It is also possible to fit all standard GB2 distributions

```> fits=run_GB_family(ID=rep("Fake Data",n),
+ hb=income_binned[,"number"],
+ bin_min=income_binned[,"low"],
+ bin_max=income_binned[,"high"],
+ obs_mean=income_binned[,"mean"],
+ ID_name="Country")
```

```> fits\$fit.filter[,c("gini","aic","bic")]
```

Now, that was fine, but those were simulated data. What about real data now? Here we go

```> data = read.table(
+ "http://freakonometrics.free.fr/us_income.txt",
low  high number_1000s  mean std_err
1     0  4999         4245  1249      50
2  5000  9999         5128  7923      30
3 10000 14999         7149 12389      28
4 15000 19999         7370 17278      26
5 20000 24999         7037 22162      27
6 25000 29999         6825 27185      28
> n <- nrow(data)
> fit_LN <- fitFunc(ID=rep("US",n),
+ hb=data[,"number_1000s"],
+ bin_min=data[,"low"],
+ bin_max=data[,"high"],
+ obs_mean=data[,"mean"],
+ ID_name="Country",
+ distribution=LNO, distName="LNO")
Time difference of 0.1855791 secs
for LNO fit across 1 distributions```

Again, I try to fit a log-normal distribution

```> N=data\$number
> y2=N/sum(N)/diff(data\$low)
> u=seq(min(income_binned\$low),
+ max(income_binned\$low),length=101)
> v=dlnorm(u,fit_LN\$parameters,
+ fit_LN\$parameters)
> plot(u,v,col="blue",type="l",lwd=2)
> for(i in 1:(n-1)) rect(data\$low[i],
+ 0,data\$high[i],y2[i],col=rgb(1,0,0,.2))``` But here, the fit is rather poor. Again, we can estimate all standard GB2 distribution

```>
> fits=run_GB_family(ID=rep("US",n),
+ hb=data[,"number_1000s"],
+ bin_min=data[,"low"],
+ bin_max=data[,"high"],
+ obs_mean=data[,"mean"],
+ ID_name="Country")```

We can get Gini index, AIC and BIC

```> fits\$fit.filter[,c("gini","aic","bic")]
gini      aic      bic
1  4.413431 825368.5 825407.3
2  4.395080 825598.8 825627.9
3  4.451881 825495.7 825524.8
4  4.480850 825881.7 825910.8
5  4.417276 825323.6 825352.7
6  4.922122 832408.2 832427.6
7  4.341036 827065.2 827084.6
8  4.318667 826112.8 826132.2
9        NA 831054.2 831073.6
10       NA       NA       NA```

to see that the best distribution seems to be a Generalized Gamma.