Violin plots and regional income distribution

March 20, 2013

(This article was first published on StaTEAstics., and kindly contributed to R-bloggers)

While preparing my slides for statistical graphics, a plot really caught my eye when I was playing around with the data.

I started off by plotting the time seriesof GNI per capita by country, and as expected it got quite messy and incomprehensible.

## Download and manipulate the data
raw.lst = getWDItoSYB(indicator = c("NY.GNP.PCAP.CD", "SP.POP.TOTL"))
raw.df = raw.lst[["entity"]]
traw.df = translateCountryCode(raw.df, from = "ISO2_WB_CODE", to = "UN_CODE")
mraw.df = merge(traw.df, FAOregionProfile[, c("UN_CODE", "UNSD_MACRO_REG")])
final.df = mraw.df[!$UNSD_MACRO_REG), ]

## Simple ugly time series plot
ggplot(data = final.df, aes(x = Year, y = NY.GNP.PCAP.CD)) +
geom_line(aes(col = Country)) +
labs(x = NULL, y = "GNI per capita")

plot of chunk unnamed-chunk-1

So I decided to compute the weighted average by region to examine the regional trends.

## Compute regional aggregates based on UN M49 definition
reg.df = aggRegion(aggVar = "NY.GNP.PCAP.CD", weightVar = "SP.POP.TOTL",
data = traw.df, keepUnspecified = FALSE, aggMethod = "weighted.mean",
relationDF = data.frame(UN_CODE = FAOregionProfile[, "UN_CODE"],
REG_NAME = FAOregionProfile[, "UNSD_MACRO_REG"]))

## Plot regional aggregates
ggplot(data = reg.df[!$NY.GNP.PCAP.CD), ],
aes(x = Year, y = NY.GNP.PCAP.CD)) +
geom_line(aes(col = REG_NAME)) +
labs(x = NULL, y = "GNI per capita", col = "")

plot of chunk unnamed-chunk-2

I can now see the trend clearly, but there are two problems with this approach. First, the variability within region is vast and thus the weighted average or any summary statistic such as quantile can be misleading and it does not tell me what is going on within the regions. Secondly, since a minimum of 65% of the country must be present in order to compute the aggregation, no statistics was available prior to 1985.

While I was carrying out regional comparisons with box-plot and violin plot I thought why not plot them accross time as well! So here is the final graph:

## Time series violin plot
ggplot(data = final.df,
aes(x = as.character(Year), y = NY.GNP.PCAP.CD)) +
geom_violin() + scale_y_log10() +
facet_wrap(~UNSD_MACRO_REG, ncol = 1, scales = "free_y") +
scale_x_discrete(breaks = as.character((seq(1960, 2010, by = 10))),
labels = as.character((seq(1960, 2010, by = 10)))) +
labs(x = NULL, y = "GNI per capita")

plot of chunk unnamed-chunk-3

Now I can compare the regions, but at the same time I can see the within region income distribution. It amazes me how the income distribution diverges in Europe and Oceania while America and Asia moves towards a bell shaped distribution. Growth in Africa appears to be slow, but there are several countries which are growing at a faster rate and pushing the tail of the distribution. Although some of the variability in the density may have resulted from independence of countries, nonetheless it is still infromative.

To leave a comment for the author, please follow the link and comment on their blog: StaTEAstics.. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)