Plotting Scottish census data with some tidyverse magic

November 28, 2018
By

(This article was first published on R – scottishsnow, and kindly contributed to R-bloggers)

I’ve been working with the Scottish census recently, to investigate employment in land-based (agriculture, forestry and fishing) industry. A friend of mine has recently moved to Dumfries and Galloway – a rural, farming area of Scotland. He’s commented on the ageing population in the area, so I pulled out the age profile from the census for his civil parish. This post shows how to plot up an age profile from the Scottish census table KS102SC, which is available online.

First up, let’s load our packages and read in the table. Note I’ve skipped the first few header lines and have coded – to NA. In reality – are actually 0s, so I’ve used `mutate_all` to fix them.

library(tidyverse)

df = read_csv("~/Downloads/temp/KS102SC.csv", skip=4, na="-") %>%
   mutate_all(funs(replace(., is.na(.), 0)))

Next we can select the parish of interest, select the columns we’re interested in, convert these to long format, and force the ordering of the ages (e.g. 8-10 should come before 10-14). I’ve piped the output of this munging into ggplot and added some styling and an all important licence statement.

Dalton

df %>%
   filter(X1=="Dalton") %>%
   select(-X1, -`All people`, -`Mean age`, -`Median age`, -X21) %>%
   gather() %>%
   mutate(key = reorder(key, seq_along(key))) %>%
   ggplot(aes(key, value)) +
   geom_col() +
   labs(title="Dalton parish population distribution",
        subtitle="Contains: Scotland's Census data and Scottish Government data\nlicensed under the Open Government Licence v3.0",
        x="",
        y="People") +
   coord_flip() +
   theme_bw() +
   theme(text=element_text(size=20),
         plot.subtitle=element_text(size=10))

It’s also of interest to compare one parish against another, so I compared Dalton against Edinburgh. Basically as before but adding an extra point layer for the visualisation. The data have now been changed to proportions of each parish so they are comparable.

Dalton_Edinburgh

x = df %>%
   filter(X1=="Dalton" | X1=="Edinburgh") %>%
   select(-`Mean age`, -`Median age`, -X21) %>%
   mutate_at(vars(-X1), funs(prop = . / `All people`)) %>%
   select(-`All people_prop`) %>%
   select(X1, ends_with("prop")) %>%
   gather(key, value, -X1) %>%
   separate(key, c("key", "drop"), "_") %>%
   mutate(key = reorder(key, seq_along(key)))

x %>%
   filter(X1=="Dalton") %>%
   ggplot(aes(key, value)) +
   geom_col() +
   geom_point(data=filter(x, X1=="Edinburgh"), aes(key, value)) +
   scale_y_continuous(labels=scales::percent) +
   labs(title="Dalton parish (bars) and Edinburgh (dots) population distribution",
        subtitle="Contains: Scotland's Census data and Scottish Government data\nlicensed under the Open Government Licence v3.0",
        x="",
        y="People") +
   coord_flip() +
   theme_bw() +
   theme(text=element_text(size=20),
         plot.subtitle=element_text(size=10))

Finally, we can compare distributions for the whole of Scotland against Edinburgh and Dalton using boxplots. I can imagine a beautiful plot with density polygons showing the national data, but I don’t have time to figure it out now!

boxplots

x = df %>%
   select(-`Mean age`, -`Median age`, -X21) %>%
   mutate_at(vars(-X1), funs(prop = . / `All people`)) %>%
   select(-`All people_prop`) %>%
   select(X1, ends_with("prop")) %>%
   gather(key, value, -X1) %>%
   separate(key, c("key", "drop"), "_") %>%
   mutate(key = reorder(key, seq_along(key)))

x %>%
   filter(X1!="Scotland") %>%
   ggplot(aes(key, value)) +
   geom_boxplot(colour="grey50") +
   geom_point(data=filter(x, X1=="Dalton"), aes(key, value), colour="purple4", shape=4, stroke=2, show.legend=T) +
   geom_point(data=filter(x, X1=="Edinburgh"), aes(key, value), colour="darkorange2", shape=2, stroke=1.5, show.legend=T) +
   scale_y_continuous(labels=scales::percent) +
   labs(title="Dalton parish (purple crosses) and Edinburgh (orange triangles)\nover Scotland's population distribution",
        subtitle="Contains: Scotland's Census data and Scottish Government data\nlicensed under the Open Government Licence v3.0",
        x="",
        y="People") +
   coord_flip() +
   theme_bw() +
   theme(text=element_text(size=20),
         plot.subtitle=element_text(size=10))

To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)