Violin plots and regional income distribution

[This article was first published on StaTEAstics., and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

While preparing my slides for statistical graphics, a plot really caught my eye when I was playing around with the data.

I started off by plotting the time seriesof GNI per capita by country, and as expected it got quite messy and incomprehensible.

## Download and manipulate the data
library(FAOSTAT)
raw.lst = getWDItoSYB(indicator = c("NY.GNP.PCAP.CD", "SP.POP.TOTL"))
raw.df = raw.lst[["entity"]]
traw.df = translateCountryCode(raw.df, from = "ISO2_WB_CODE", to = "UN_CODE")
mraw.df = merge(traw.df, FAOregionProfile[, c("UN_CODE", "UNSD_MACRO_REG")])
final.df = mraw.df[!is.na(mraw.df$UNSD_MACRO_REG), ]

## Simple ugly time series plot
ggplot(data = final.df, aes(x = Year, y = NY.GNP.PCAP.CD)) +
    geom_line(aes(col = Country)) +
    labs(x = NULL, y = "GNI per capita")

plot of chunk unnamed-chunk-1

So I decided to compute the weighted average by region to examine the regional trends.

## Compute regional aggregates based on UN M49 definition
reg.df = aggRegion(aggVar = "NY.GNP.PCAP.CD", weightVar = "SP.POP.TOTL",
    data = traw.df, keepUnspecified = FALSE, aggMethod = "weighted.mean",
    relationDF = data.frame(UN_CODE = FAOregionProfile[, "UN_CODE"],
                            REG_NAME = FAOregionProfile[, "UNSD_MACRO_REG"]))

## Plot regional aggregates
ggplot(data = reg.df[!is.na(reg.df$NY.GNP.PCAP.CD), ],
    aes(x = Year, y = NY.GNP.PCAP.CD)) +
    geom_line(aes(col = REG_NAME)) +
    labs(x = NULL, y = "GNI per capita", col = "")

plot of chunk unnamed-chunk-2

I can now see the trend clearly, but there are two problems with this approach. First, the variability within region is vast and thus the weighted average or any summary statistic such as quantile can be misleading and it does not tell me what is going on within the regions. Secondly, since a minimum of 65% of the country must be present in order to compute the aggregation, no statistics was available prior to 1985.

While I was carrying out regional comparisons with box-plot and violin plot I thought why not plot them accross time as well! So here is the final graph:

## Time series violin plot
ggplot(data = final.df,
       aes(x = as.character(Year), y = NY.GNP.PCAP.CD)) +
    geom_violin() + scale_y_log10() +
        facet_wrap(~UNSD_MACRO_REG, ncol = 1, scales = "free_y") +
    scale_x_discrete(breaks = as.character((seq(1960, 2010, by = 10))),
                     labels = as.character((seq(1960, 2010, by = 10)))) +
    labs(x = NULL, y = "GNI per capita")

plot of chunk unnamed-chunk-3

Now I can compare the regions, but at the same time I can see the within region income distribution. It amazes me how the income distribution diverges in Europe and Oceania while America and Asia moves towards a bell shaped distribution. Growth in Africa appears to be slow, but there are several countries which are growing at a faster rate and pushing the tail of the distribution. Although some of the variability in the density may have resulted from independence of countries, nonetheless it is still infromative.

To leave a comment for the author, please follow the link and comment on their blog: StaTEAstics..

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)