International Household Income Inequality data

[This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m at the New Zealand Association of Economists annual conference in Auckland. The opening keynote speech was from James K. Galbraith on a global view of inequality. He showed a variety of results from the University of Texas Inequality Project’s Estimated Household Income Inequality dataset, which I hadn’t realised existed before. It’s the result of a patient and painstaking effort to make the most internationally comparable estimate possible of household inequality, and involves modelling when needed to create predicted inequality based on the best indicators available.

The data are also online, with a whole bunch of supporting material. Well done Professor Galbraith and University of Texas! Here’s a taster view.

First, download the data, bring it into R, and tidy it up from its wide format into a more analysis-friendly tidy or normalised form:

devtools::install_github("hadley/ggplot2") # dev version needed for subtitle and caption

              destfile = "ehii.xlsx", mode = "wb")

ehii <- read.xlsx("ehii.xlsx")[ , -1] # don't need the first column

ehii_tidy <- ehii %>%
   gather(Year, Gini, -Country, -Code) %>%
   mutate(Year = as.numeric(Year))

Let’s take a first look

ggplot(ehii_tidy, aes(x = Year, y = Gini, colour = Country)) +
   geom_line() +
   theme(legend.position = "none")


OK, lots of lovely data, not a terribly attractive plot. Not informative either, having chopped off the legend. We should be able to do better than that.

One thing of interest might be which countries have seen the biggest changes over time. Restricting ourselves to just countries with data in 1963 (to make comparison valid), let’s have a go:


Here’s the code that constructed that plot:

full_countries <- ehii_tidy %>%
   filter(Year = min(Year) & !

final_result <- ehii_tidy %>%
   filter(Country %in% full_countries$Country & ! %>%
   group_by(Country) %>%
   mutate(Gini_index = Gini / Gini[1] * 100) %>%
   filter(Year == max(Year)) %>%
   mutate(Year = Year + 1,
          label = paste(Code, round(Gini)))

ehii_tidy %>%
   filter(Country %in% full_countries$Country) %>%
   ggplot(aes(x = Year, y = Gini, colour = Country)) +
   stat_index(index.ref = 1, alpha = 0.3) +
   theme(legend.position = "none") +
   geom_text(data = final_result, aes(label = label, y = Gini_index)) +
   ggtitle("Relative changes in inequality since 1963",
           subtitle = "(for countries with data from 1963)") +
   labs(y = "Index of Gini coefficient, set to be 100 in 1963",
        x = "", caption = "UTIP Estimated Household Income Inequality dataset") +
   xlim(1960, 2012) +
   annotate("text", x = 1977, y = 150, 
            label = "Country codes appear at their final data point; 
numbers are the latest available Gini coefficient")

This is obviously just the beginning. The countries have their ISO 3 character codes, which will make it easy to join them with other data for analysis. Maps are an obvious presentation step too, and Galbraith’s team look to make extensive use of the data for this purpose. Looking forward to a closer look when I’ve got more time.

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)