Charting/Mapping the Scottish Vote with R (an rvest/dplyr/tidyr/TopoJSON/ggplot tutorial)

[This article was first published on rud.is » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The BBC did a pretty good job live tracking the Scotland secession vote, but I really didn’t like the color scheme they chose and decided to use the final tally site as the basis for another tutorial using the tools from the Hadleyverse and taking advantage of the fact that newer gdal libraries can read in TopoJSON/GeoJSON files, meaning we can use most of the maps the D3-ers create/use right in R.

We’ll need a few R packages to help us get, clean, format and chart the data:

library(rvest)
library(dplyr)
library(httr) # >0.5
library(tidyr)
library(gpclib)
library(rgeos)
library(sp)
library(maptools)
library(rgdal) # needs gdal > 1.11.0
library(ggplot2)
library(reshape2)
library(gridExtra)

The new rvest package makes it super-fun (and easy) to get data out of web pages (as I’ve mentioned on the sister blog), but said data is still web page data, usually geared towards making things render well in a browser, and we end up having to clean up the extracted fields to get useful data. Since we usually want a data frame from the extraction, an rvest idiom I’ve been playing with involves bundling the element extraction & cleanup code into one function and then using that to build the columns:

# extract data from rvest-ed <div>'s and clean it up a bit
# pass in the rvested HTML object and the CSS selector to extract, also 
# indicating whether we want a number or character vector returned
 
extractAndCleanup <- function(data, selector, make_numeric=FALSE) {
  x <- data %>% html_nodes(selector) %>% html_text()
  x <- gsub("^[[:punct:][:space:]]*|[[:punct:][:space:]]*$", "", x)
  if (make_numeric) x <- as.numeric(gsub("[,[:space:]]*", "", x))
  x
}
 
bbc_vote <- html("http://www.bbc.com/news/events/scotland-decides/results")
 
secede <- data.frame(
  council=bbc_vote %>% extractAndCleanup(".body-row__cell--council"),
  electorate=bbc_vote %>% extractAndCleanup(".body-row__cell--electorate", TRUE),
  yes=bbc_vote %>% extractAndCleanup(".body-row__cell--yes", TRUE),
  no=bbc_vote %>% extractAndCleanup(".body-row__cell--no", TRUE),
  stringsAsFactors=FALSE)

We can then compute whether the vote tally was to secede or not and assign a color in the event we choose to use base graphics for plotting (we won’t for this tutorial). I chose a muted version of the Union Jack red and the official Scottish blue for this exercise.

secede <- secede %>% mutate(gone=yes>no,
                            color=ifelse(gone, "#0065BD", "#CF142B77"))

Getting the map from the BBC site is just as simple. An inspection of the site in any decent browser with a “Developer” mode lets us see the elements being downloaded. For the BBC map, it reads the data from: http://static.bbci.co.uk/news/1.46.4-1166/js/data/maps/l/scotland-elections.js which is a TopoJSON object wrapped in two lines of extra javascript code. We’ll grab that file, clean it up and read the map into R using httr‘s new-ish ability to save to a data file:

GET("http://static.bbci.co.uk/news/1.46.4-1166/js/data/maps/l/scotland-elections.js",
    write_disk("data/scotland.json"), progress())
tmp <- readLines("data/scotland.json")
writeLines(tmp[2], "data/scotland.json")
 
map = readOGR("data/scotland.json", "scotland-elections")

We’ll want to work with the map using Council names, so we need to ensure the names from the extracted div elements match what’s in the TopoJSON file:

secede$council %in% map@data$name
 
##  [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [13] FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

It looks like we’ll need to clean the names up a bit, but thankfully the names aren’t too far off:

secede$council <- gsub("&", "and", secede$council)
secede[secede$council=="Edinburgh",]$council = "City of Edinburgh"
secede[secede$council=="Glasgow",]$council = "Glasgow City"
secede[secede$council=="Comhairle nan Eilean Siar",]$council = "Na h-Eileanan an Iar"

If we were using base graphics for plotting, we’d also have to ensure the data was in the right order:

secede$council <- factor(secede$council, map@data$name, ordered=TRUE)
secede <- secede %>% arrange(council)

We’re going to use ggplot for the mapping portion, but the normal fortify process didn’t work on this TopoJSON file (some polygon errors emerged), so we’ll take another route and do the data Council name↔id mapping after the fortify call and merge the rest of our data into the map data frame:

map_df <- fortify(map)
 
# manually associate the map id's with the Council names and vote data
councils <- data.frame(id=0:(length(map@data$name)-1),
                       council=as.character(map@data$name))
map_df <- merge(map_df, councils, by="id")
map_df <- merge(map_df, secede, by="council")

Now we can generate the choropleth:

gg <- ggplot()
gg <- gg + geom_map(data=map_df, map=map_df,
                    aes(map_id=id, x=long, y=lat, group=group, fill=color),
                    color="white", size=0.25)
gg <- gg + scale_fill_manual(values=rev(unique(secede$color)),
                             labels=c("Yes", "No"), name="Secede?")
gg <- gg + xlim(extendrange(r=range(coordinates(map)[,1]), f=0.15))
gg <- gg + ylim(extendrange(r=range(coordinates(map)[,2]), f=0.07))
gg <- gg + coord_map()
gg <- gg + labs(x="", y="")
gg <- gg + theme_bw()
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(legend.position="none")
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text=element_blank())

A choropleth is all well-and-good, but—since we have the data–let’s add the bar chart to complete the presentation. We’ll combine some dplyr and tidyr calls to melt and subset our data frame:

secede_m <- secede %>%
  gather(variable, value, -council) %>%
  filter(variable %in% c("yes", "no")) %>%
  mutate(value=as.numeric(value))

For this exercise, we’ll plot the 100% stacked bars in order of the “No” votes, and we’ll pre-process this ordering to make the ggplot code easier on the eyes. We start by merging some data back into our melted data frame so we can build the sorted factor by the “No” value column and then make sure the Councils will be in that order:

secede_m <- merge(secede_m, secede, by="council")
secede_m$variable <- factor(secede_m$variable,
                            levels=c("yes", "no"), ordered=TRUE)
secede_m <- secede_m %>% arrange(no, variable)
secede_m$council <- factor(secede_m$council,
                           unique(secede_m$council), ordered=TRUE)

Finally, we can create the 100% stacked bar plot and combine it with the choropleth to build the final product:

gg1 <- ggplot(secede_m, aes(x=council, y=value, fill=factor(variable)))
gg1 <- gg1 + geom_bar(stat="identity", position="fill")
gg1 <- gg1 + scale_fill_manual(values=rev(unique(secede$color)),
                             labels=c("Yes", "No"), name="Secede?")
gg1 <- gg1 + geom_hline(yintercept=0.50, color="gray80")
gg1 <- gg1 + geom_text(aes(label=percent(yes/100)), y=0.08, color="white", size=3)
gg1 <- gg1 + geom_text(aes(label=percent(no/100)), y=0.92, color="white", size=3)
gg1 <- gg1 + coord_flip()
gg1 <- gg1 + labs(x="", y="")
gg1 <- gg1 + theme_bw()
gg1 <- gg1 + theme(panel.grid=element_blank())
gg1 <- gg1 + theme(legend.position="top")
gg1 <- gg1 + theme(panel.border=element_blank())
gg1 <- gg1 + theme(axis.ticks=element_blank())
gg1 <- gg1 + theme(axis.text.x=element_blank())
 
vote <- arrangeGrob(gg1, gg, ncol=2,
                     main=textGrob("Scotland Votes", gp=gpar(fontsize=20)))


(Click for larger version)

I’ve bundled this code up into it’s own github repo. The full project example has a few extra features as

  • it shows how to save the resultant data frame to an R data file (in case the BBC nukes the site)
  • also saves the cleaned-up JSON (getting minimal Scotland shapefiles is tricky so this one’s a keeper even with the polygon errors)
  • wraps all that in if statements so future analysis/vis can work with or without the live data being available.

Hadley really has to stop making R so fun to work with 🙂

To leave a comment for the author, please follow the link and comment on their blog: rud.is » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)