A package was recently released to generate plots in the style of xkcd using R. Being a big fan of the cartoon, I could not resist trying it out. So I set out to produce something like one of Hans Rosling’s bubble plots.
First I needed some data. Spoilt for choice. I scraped some population data broken down by country and retained only the country and population fields.
population.url = "http://en.wikipedia.org/wiki/List_of_countries_by_population" download.file(population.url, "data/wiki-population.html") library(XML) population = readHTMLTable("data/wiki-population.html", which = 2, trim = TRUE)
After a bit of tidying up, this was ready to use.
> head(population) region population 1 China 1354040000 2 India 1210569573 3 United States 315901000 4 Indonesia 237641326 5 Brazil 193946886 6 Pakistan 183122000
library(xlsx) GDP = read.xlsx("data/NY.GDP.MKTP.CD_Indicator_MetaData_en_EXCEL.xls", 1, stringsAsFactors = FALSE)
I simply retained the entries for 2011, which had few missing values.
Education spending data are also available from the World Bank. These data are a little more patchy, so I kept the most recent value for each country. This required a little fancy footwork.
XPD = read.xlsx("data/SE.XPD.TOTL.GD.ZS_Indicator_MetaData_en_EXCEL.xls", 1, stringsAsFactors = FALSE) # Returns the last element in x which is not an NA # last.not.na After the requisite tidying, these two sets of data were also ready. 1 > head(GDP) region code GDP 1 Arab World ARB 2.410300e+12 2 Caribbean small states CSS 6.178652e+10 3 East Asia & Pacific (all income levels) EAS 1.880026e+13 4 East Asia & Pacific (developing only) EAP 9.313033e+12 5 Euro area EMU 1.307986e+13 6 Europe & Central Asia (all income levels) ECS 2.215649e+13 > head(XPD) region education 1 Arab World 4.337300 2 Caribbean small states 6.354870 3 East Asia & Pacific (all income levels) 3.766995 4 East Asia & Pacific (developing only) 4.442010 5 Euro area 5.910550 6 Europe & Central Asia (all income levels) 5.478525
Finally I aggregated the three sets of data and removed any rows which were missing either GDP or education statistics.
data data # data # data Since there was a range of many orders of magnitude in both the population and GDP data, I took logarithms of these columns. 1 > data[,4] data[,3] head(data) region code population GDP education 1 Afghanistan AFG 7.406542 10.282776 1.72998 2 Albania ALB 6.450553 10.112590 3.26756 3 Algeria DZA 7.578639 11.275728 4.33730 6 Angola AGO 7.314063 11.018416 3.47644 7 Antigua and Barbuda ATG 4.935986 9.048565 2.53790 8 Argentina ARG 7.603329 11.649378 5.78195
Then came the fun bit: putting the plot together. There is a great document “An introduction to the xkcd package” by Emilio Torres Manzanera which got me up to speed.
library(xkcd) library(ggplot2) xrange yrange p geom_point(aes(education, GDP, size = population), alpha = 0.35, colour = I("red"), data = data) + scale_size_continuous(name = "log(population)", range = c(5, 20)) + geom_text(aes(education, GDP, label=code), size=5, family="xkcd", data = data) + xkcdaxis(xrange,yrange) + xlab("education (% of GDP)") + ylab("log(GDP) in $") print(p)
And here is the result. Click on the image below to see it at higher resolution. Interesting that small countries like our neighbour, Lesotho, are spending a large fraction of their GDP on education. Also I must confess to having been previously completely unaware of the existence of Tuvalu (TUV), which is the fourth smallest country in the world (and the smallest country in my data).fff