International Household Income Inequality data

June 29, 2016

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

I’m at the New Zealand Association of Economists annual conference in Auckland. The opening keynote speech was from James K. Galbraith on a global view of inequality. He showed a variety of results from the University of Texas Inequality Project’s Estimated Household Income Inequality dataset, which I hadn’t realised existed before. It’s the result of a patient and painstaking effort to make the most internationally comparable estimate possible of household inequality, and involves modelling when needed to create predicted inequality based on the best indicators available.

The data are also online, with a whole bunch of supporting material. Well done Professor Galbraith and University of Texas! Here’s a taster view.

First, download the data, bring it into R, and tidy it up from its wide format into a more analysis-friendly tidy or normalised form:

devtools::install_github("hadley/ggplot2") # dev version needed for subtitle and caption

              destfile = "ehii.xlsx", mode = "wb")

ehii <- read.xlsx("ehii.xlsx")[ , -1] # don't need the first column

ehii_tidy <- ehii %>%
   gather(Year, Gini, -Country, -Code) %>%
   mutate(Year = as.numeric(Year))

Let’s take a first look

ggplot(ehii_tidy, aes(x = Year, y = Gini, colour = Country)) +
   geom_line() +
   theme(legend.position = "none")


OK, lots of lovely data, not a terribly attractive plot. Not informative either, having chopped off the legend. We should be able to do better than that.

One thing of interest might be which countries have seen the biggest changes over time. Restricting ourselves to just countries with data in 1963 (to make comparison valid), let’s have a go:


Here’s the code that constructed that plot:

full_countries <- ehii_tidy %>%
   filter(Year = min(Year) & !

final_result <- ehii_tidy %>%
   filter(Country %in% full_countries$Country & ! %>%
   group_by(Country) %>%
   mutate(Gini_index = Gini / Gini[1] * 100) %>%
   filter(Year == max(Year)) %>%
   mutate(Year = Year + 1,
          label = paste(Code, round(Gini)))

ehii_tidy %>%
   filter(Country %in% full_countries$Country) %>%
   ggplot(aes(x = Year, y = Gini, colour = Country)) +
   stat_index(index.ref = 1, alpha = 0.3) +
   theme(legend.position = "none") +
   geom_text(data = final_result, aes(label = label, y = Gini_index)) +
   ggtitle("Relative changes in inequality since 1963",
           subtitle = "(for countries with data from 1963)") +
   labs(y = "Index of Gini coefficient, set to be 100 in 1963",
        x = "", caption = "UTIP Estimated Household Income Inequality dataset") +
   xlim(1960, 2012) +
   annotate("text", x = 1977, y = 150, 
            label = "Country codes appear at their final data point; 
numbers are the latest available Gini coefficient")

This is obviously just the beginning. The countries have their ISO 3 character codes, which will make it easy to join them with other data for analysis. Maps are an obvious presentation step too, and Galbraith’s team look to make extensive use of the data for this purpose. Looking forward to a closer look when I’ve got more time.

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)