xkcd Style Bubble Plot

[This article was first published on Exegetic Analytics » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A package was recently released to generate plots in the style of xkcd using R. Being a big fan of the cartoon, I could not resist trying it out. So I set out to produce something like one of Hans Rosling’s bubble plots.

First I needed some data. Spoilt for choice. I scraped some population data broken down by country and retained only the country and population fields.

population.url = "http://en.wikipedia.org/wiki/List_of_countries_by_population"
download.file(population.url, "data/wiki-population.html")

library(XML)

population = readHTMLTable("data/wiki-population.html", which = 2, trim = TRUE)

After a bit of tidying up, this was ready to use.

> head(population)
         region population
1         China 1354040000
2         India 1210569573
3 United States  315901000
4     Indonesia  237641326
5        Brazil  193946886
6      Pakistan  183122000

Next I got my hands on some Gross Domestic Product (GDP) data from the World Bank. These data came as a spreadsheet which could be sucked into R with little effort.

library(xlsx)

GDP = read.xlsx("data/NY.GDP.MKTP.CD_Indicator_MetaData_en_EXCEL.xls", 1, stringsAsFactors = FALSE)

I simply retained the entries for 2011, which had few missing values.

Education spending data are also available from the World Bank. These data are a little more patchy, so I kept the most recent value for each country. This required a little fancy footwork.

XPD = read.xlsx("data/SE.XPD.TOTL.GD.ZS_Indicator_MetaData_en_EXCEL.xls", 1,
                stringsAsFactors = FALSE)

# Returns the last element in x which is not an NA
#
last.not.na After the requisite tidying, these two sets of data were also ready.
1 > head(GDP)
                                     region code          GDP
1                                Arab World  ARB 2.410300e+12
2                    Caribbean small states  CSS 6.178652e+10
3   East Asia & Pacific (all income levels)  EAS 1.880026e+13
4     East Asia & Pacific (developing only)  EAP 9.313033e+12
5                                 Euro area  EMU 1.307986e+13
6 Europe & Central Asia (all income levels)  ECS 2.215649e+13
> head(XPD)
                                     region education
1                                Arab World  4.337300
2                    Caribbean small states  6.354870
3   East Asia & Pacific (all income levels)  3.766995
4     East Asia & Pacific (developing only)  4.442010
5                                 Euro area  5.910550
6 Europe & Central Asia (all income levels)  5.478525

Finally I aggregated the three sets of data and removed any rows which were missing either GDP or education statistics.

data data #
data #
data Since there was a range of many orders of magnitude in both the population and GDP data, I took logarithms of these columns.
1 > data[,4]  data[,3]  head(data)
               region code population       GDP education
1         Afghanistan  AFG   7.406542 10.282776   1.72998
2             Albania  ALB   6.450553 10.112590   3.26756
3             Algeria  DZA   7.578639 11.275728   4.33730
6              Angola  AGO   7.314063 11.018416   3.47644
7 Antigua and Barbuda  ATG   4.935986  9.048565   2.53790
8           Argentina  ARG   7.603329 11.649378   5.78195

Then came the fun bit: putting the plot together. There is a great document “An introduction to the xkcd package” by Emilio Torres Manzanera which got me up to speed.

library(xkcd)
library(ggplot2)

xrange yrange p     geom_point(aes(education, GDP, size = population), alpha = 0.35, colour = I("red"), data = data) +
    scale_size_continuous(name = "log(population)", range = c(5, 20)) +
    geom_text(aes(education, GDP, label=code), size=5, family="xkcd", data = data) +
    xkcdaxis(xrange,yrange) +
    xlab("education (% of GDP)") + ylab("log(GDP) in $")
print(p)

And here is the result. Click on the image below to see it at higher resolution. Interesting that small countries like our neighbour, Lesotho, are spending a large fraction of their GDP on education. Also I must confess to having been previously completely unaware of the existence of Tuvalu (TUV), which is the fourth smallest country in the world (and the smallest country in my data).fff

GDP-education-population

To leave a comment for the author, please follow the link and comment on their blog: Exegetic Analytics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)