Gini index to compute inequality or impurity in the data

May 18, 2015
By

(This article was first published on R – My thoughts & learnings, and kindly contributed to R-bloggers)

“Gini index measures the extent to which the distribution of income or consumption expenditure among individuals or households within an economy deviates from a perfectly equal distribution” [1]. A Lorenz curve plots the cumulative percentages of total income received against the cumulative number of recipients, starting with the poorest individual or household. The Gini index measures the area between the Lorenz curve and a hypothetical line of absolute equality, expressed as a percentage of the maximum area under the line. Thus a Gini index of 0 represents perfect equality, while an index of 100 implies perfect inequality.

Simply put Gini index measures the impurity of data D. Some refer to it as inequality. Gini_formula

Now let me show the practical example for the same in R. The package that we use here is ineq (To see the help file use ??help (ineq)) otherwise you can create your own Gini function as shown here like

gini if (!is.numeric(x)) {
warning("'x' is not numeric; returning NA")
return(as.numeric(NA))
}
if (any(na.ind if (!na.rm)
stop("'x' contain NAs")
else
x }
n mu N ox dd dd / (mu * N)
}

Either way, whether you use the ineq package or the aforementioned function it will give you the same result. Now, let us measure the inequality of the river Nile data. In R it can be done as follows;

>data (Nile)

>install.packages(“ineq”)

>library(ineq)

>data(Nile)

>ineq(Nile, type=”Gini”)

[1] 0.1031993

So there is 10% inequality in the Nile data. And if you want to plot this inequality you can do so by using the Lorenzo curve as given

>plot(Lc(Nile),col="darkred",lwd=2).

#This will generate the given plot where the dark red line represents the inequality in the Nile data. It is worthwhile to note that Gini index computation works well for univariate categorical data but it is biased towards multi-valued attributes and has difficulty when the number of classes is large.Gini index plot

Reference

[1] The World Bank. 2014. GINI index (World Bank estimate). [ONLINE] Available at: http://data.worldbank.org/indicator/SI.POV.GINI. [Accessed 18 May 15].

Filed under: pre-processing, R Tagged: descriptive statistics, pre-processing, R

To leave a comment for the author, please follow the link and comment on their blog: R – My thoughts & learnings.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)