Site icon R-bloggers

Sampling distribution of Gini coefficient

[This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Inequality measures

Part of my motivation for importing the New Zealand Income Survey(NZIS) simulated unit record file provided by Statistics New Zealand was to explore the characteristics of various measures of inequality. In particular, I’m interested in what happens to the sampling distributions as sample size changes of the following summary statistics:

A Lorenz curve is an attempt to show the full shape of the income distribution at once. It maps the cumulative proportion of the population against the cumulative proportion of income (or whatever variable of interest) that they own. In the example below, you can read that the poorest 50% of New Zealand individuals, for example, earn around 13% of total income in a given week

This plot was produced with the following R code, which requires the data to have been imported to a database called PlayPen in accordance with the previous post.

library(RODBC)
library(ineq)     # for Lc and Gini
library(dplyr)
library(ggplot2)
library(scales)
library(showtext) # for s
library(stringr)  # for str_wrap

.add.google("Poppins", "my")
showtext.auto()

PlayPen <- odbcConnect("PlayPen_Prod")

inc <- sqlQuery(PlayPen, "select income from vw_mainheader") 

lorenz <- Lc(inc)
lorenz_df <- data.frame(prop_pop = lorenz$p, income = lorenz$L) %>%
   mutate(prop_equality = prop_pop)

p1 <- ggplot(lorenz_df, aes(x = prop_pop, y = income)) +
   geom_ribbon(aes(ymax = prop_equality, ymin = income), fill = "yellow") +
   geom_line() +
   geom_abline(slope = 1, xintercept = 0) +
   scale_x_continuous("nCumulative proportion of population", label = percent) +
   scale_y_continuous("Cumulative proportion of incomen", label = percent) +
   theme_minimal(base_family = "my") +
   coord_equal() +
   annotate("text", 0.53, 0.32, label = "Inequalityngap", family = "my") +
   annotate("text", 0.5, 0.6, label = "Complete equality line", angle = 45, family = "my") + 
   ggtitle (
      str_wrap("Cumulative distribution of New Zealand individual weekly income from all sources", 46))

print(p1)

grid.text("Source: Statistics New ZealandnNational Income Survey 2011 SURF", 0.8, 0.23, 
       gp = gpar(family = "my", size = 8))

Choice of measures

Income inequality is notoriously difficult to measure, and the P90/P10 and P80/P20 measures are attempts to deal with the patchy information available particuarly at the top level (the super rich don’t slum it filling in surveys, and are often expected to be particularly disinclined to reveal how rich they are voluntarily; and there are strong incentives to minimise income in a tax context). However, the P90/P10 and similar measures have been criticised (for example, by Thomas Picketty in his excellent book Capital in the 21st Century) on several grounds, the most important of which seem to me to be:

These arguments seem to me cogent, although I see reasons for keeping those measures too. Picketty prefers to talk about the income or wealth of the bottom 50%, the next 40%, then the top 10% and 1%. He dislikes the Gini coefficient because it is difficult to explain (on which I agree with him) and for not really showing the true inequality (on which I disagree). I disagree on the latter point because I think that the Gini coefficient is actually the logical extension of his approach of looking at the cumulative income or wealth of the bottom 50%, next 40% etc; all the Gini coefficient does is take it to the extreme, and it gives a genuinely good measure of total inequality (assuming the source data are ok), including the impact of the top 10%, 1%, 0.1%, etc.

Gini coefficient

The curve above yields a Gini coefficient of 0.51, which is high compared to the Gini coefficients that are commonly reported. For example Statistics New Zealand via the OECD report a Gini coefficient of 0.33 for household income. There are three reasons (at least) for the discrepancy, which make the NZIS a poor choice for inequality measures that are comparable to the most commonly used (although still ok for my purposes):

Let’s be clear – I’m not saying my Gini coefficient is better than Statistics New Zealand’s. If anyone claims I’m criticising the official measure you’re misrepresenting me and please don’t! In this post, I’m exploring the technical characteristics of estimates of Gini coefficients, and I happen to have weekly, individual gross income rather than annual, household, after taxes-and-transfers income to do it with.

Weekly versus annual measures?

Let’s explore that weekly issue a bit more. Consider two extremes:

For those not familiar with the Gini coefficient, 1 means complete inequality (ie one person receives all the income) and 0 means complete inequality (everyone gets exactly the same income). 0.09 is far lower than any observed economic inequalities, and reflects the absurdity of everyone’s weekly income being random – cleaners don’t spend the occasional week as CEO of a large firm. However, one certainly expects some degree of smoothing.

Taking all those factors into account, the weekly individual income Gini coefficient of 0.51 does not contradict the official household annual figure of 0.33 (which is a relief, of course). Here’s how I worked out that 0.09 number, with a little simulation based on resampling the observed data:

Gini(inc)                   # 0.51

# if completely constant each week
Gini(inc * 52)              # 0.51

# create incomes that simulate each week being a random pull from the pool
random_incomes <- data_frame(
   income = sample(inc, 52 * length(inc), replace = TRUE),
   person = rep(1:length(inc), 52)) %>%  
   group_by(person) %>%
   summarise(income = sum(income))

Gini(random_incomes$income) # 0.09

“Must be positive non zero”?

The literature on Gini coefficients says it can only be fitted to positive, non zero data but I see no practical problem with fitting them to a dataset with zeroes and negative values, even one with a large number of zeroes like the NZIS. Including negative numbers means the Lorenz curve briefly dips below zero, and it becomes possible to have a Gini coefficient greater than 1 (for example, if one person has a positive income and everyone else has negative), but this doesn’t strike me as a reason for not using it. The result is intuitively ok and there are no computational difficulties so long as at least one number is positive and the sum of the positive numbers is enough to outweigh the sum of the negative numbers. If anyone can think of a reason for only applying Gini coefficients and Lorenz curves to strictly positive data let me know.

Sampling distribution of a Gini coefficient

So the question I’m seeking to answer is, how do inequality statistics like the Gini coefficient and the others stand up with small samples of real data? Putting aside non-sampling error (like people misleading the interviewers), what happens with a smaller survey? I know from various sources online that “The small sample variance properties of G are not known, and large sample approximations to the variance of G are poor”.

I’m interested in both the shape of the distribution of estimates of these figures, and their standard errors ie how far out from the “true” value we’d expect the estimates to be. The NZIS has a sample of around 29,000; let’s shrink that down to 30 and see how much the estimated Gini coefficient changes with different randomly selected bunches of 30 people, compared to 1,000, and the original sample of around 30,000. Note that the horizontal axes on the plots below are on differing scales – I did this to keep a visual sense of the shape of the distribution.

This estimator behaves better than I thought it would with real data that has all the weird and wonderful extremes we get with individual economic variables. Certainly by the time the sample size gets up to 1,000 or so there are no outrageous values and the range of estimated values is reasonably narrow, although wide enough to beware of comparisons of small differences at that sample size. This is in contrast to the sample size of 30, where clearly in one instance we got a sample with one high earner and many zeros or negatives, resulting in a Gini coefficient of more than 1! Make a note not to estimate a population’s inequality from such a small sample. For the actual sample size of the New Zealand Income Survey of nearly 30,000, sampling error is negligible. Now, if only we could say the same for the non-sampling error – so much harder to quantify, so much harder to control for. But subject for reflections at a later date.

The above analysis was done by creating a function to conduct the simulation and draw the plot for a given sample size n:

sim_gini <- function(n, reps = 1000){
   results <- data.frame(
      trial = 1:reps,
      estimate = numeric(reps))
   
   set.seed(123) # for reproducibility
   for(i in 1:reps){
      results[i, "estimate"] <- 
         Gini(sample(inc, n, replace = TRUE))
   }
   
   print(results %>%
      ggplot(aes(x = estimate)) +
      geom_density() +
      geom_rug() +
      theme_minimal(base_family = "my") +
      ggtitle(paste("Distribution of estimated Gini coefficient, n =", n)))
   
   grid.text(paste0("Standard error: ", round(sd(results$estimate), 3)), 0.8, 0.6, 
             gp = gpar(family = "my", size = 9))
   
   grid.text(paste0("95% Confidence interval:n", 
                    paste(round(quantile(results$estimate, c(0.025, 0.975)), 3), collapse = ", ")), 0.8, 0.5, 
             gp = gpar(family = "my", size = 9))
 }
 
sim_gini(30)

sim_gini(1000)

sim_gini(30000)

Further exploration of the other inequality measures and their distributions will wait for another post, as this one is long enough already.

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.