xkcd Style Bubble Plot
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A package was recently released to generate plots in the style of xkcd using R. Being a big fan of the cartoon, I could not resist trying it out. So I set out to produce something like one of Hans Rosling’s bubble plots.
First I needed some data. Spoilt for choice. I scraped some population data broken down by country and retained only the country and population fields.
population.url = "http://en.wikipedia.org/wiki/List_of_countries_by_population"
download.file(population.url, "data/wiki-population.html")
library(XML)
population = readHTMLTable("data/wiki-population.html", which = 2, trim = TRUE)
After a bit of tidying up, this was ready to use.
> head(population)
region population
1 China 1354040000
2 India 1210569573
3 United States 315901000
4 Indonesia 237641326
5 Brazil 193946886
6 Pakistan 183122000
Next I got my hands on some Gross Domestic Product (GDP) data from the World Bank. These data came as a spreadsheet which could be sucked into R with little effort.
library(xlsx)
GDP = read.xlsx("data/NY.GDP.MKTP.CD_Indicator_MetaData_en_EXCEL.xls", 1, stringsAsFactors = FALSE)
I simply retained the entries for 2011, which had few missing values.
Education spending data are also available from the World Bank. These data are a little more patchy, so I kept the most recent value for each country. This required a little fancy footwork.
XPD = read.xlsx("data/SE.XPD.TOTL.GD.ZS_Indicator_MetaData_en_EXCEL.xls", 1,
stringsAsFactors = FALSE)
# Returns the last element in x which is not an NA
#
last.not.na After the requisite tidying, these two sets of data were also ready.
1 > head(GDP)
region code GDP
1 Arab World ARB 2.410300e+12
2 Caribbean small states CSS 6.178652e+10
3 East Asia & Pacific (all income levels) EAS 1.880026e+13
4 East Asia & Pacific (developing only) EAP 9.313033e+12
5 Euro area EMU 1.307986e+13
6 Europe & Central Asia (all income levels) ECS 2.215649e+13
> head(XPD)
region education
1 Arab World 4.337300
2 Caribbean small states 6.354870
3 East Asia & Pacific (all income levels) 3.766995
4 East Asia & Pacific (developing only) 4.442010
5 Euro area 5.910550
6 Europe & Central Asia (all income levels) 5.478525
Finally I aggregated the three sets of data and removed any rows which were missing either GDP or education statistics.
data data #
data #
data Since there was a range of many orders of magnitude in both the population and GDP data, I took logarithms of these columns.
1 > data[,4] data[,3] head(data)
region code population GDP education
1 Afghanistan AFG 7.406542 10.282776 1.72998
2 Albania ALB 6.450553 10.112590 3.26756
3 Algeria DZA 7.578639 11.275728 4.33730
6 Angola AGO 7.314063 11.018416 3.47644
7 Antigua and Barbuda ATG 4.935986 9.048565 2.53790
8 Argentina ARG 7.603329 11.649378 5.78195
Then came the fun bit: putting the plot together. There is a great document “An introduction to the xkcd package” by Emilio Torres Manzanera which got me up to speed.
library(xkcd)
library(ggplot2)
xrange yrange p geom_point(aes(education, GDP, size = population), alpha = 0.35, colour = I("red"), data = data) +
scale_size_continuous(name = "log(population)", range = c(5, 20)) +
geom_text(aes(education, GDP, label=code), size=5, family="xkcd", data = data) +
xkcdaxis(xrange,yrange) +
xlab("education (% of GDP)") + ylab("log(GDP) in $")
print(p)
And here is the result. Click on the image below to see it at higher resolution. Interesting that small countries like our neighbour, Lesotho, are spending a large fraction of their GDP on education. Also I must confess to having been previously completely unaware of the existence of Tuvalu (TUV), which is the fourth smallest country in the world (and the smallest country in my data).fff
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
