Tutorial – make population pyramids with Queensland data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this tutorial I will use Queensland population data already downloaded from the Australian Bureau of Statistics (ABS), and rearrange the data to produce population pyramids using ggplot. The population pyramid is one of the most popular methods to visualise population age structure. The method constructs a chart with each age group represented by a bar, and each bar ranged one above the other—youngest at the bottom, oldest at the top, and with the sexes separated—you get a simple shape.
Over recent years, the structure of populations has markedly changed, and the population pyramids have taken on more of a dome-like structure. This is nicely represented in the Queensland data. You can read more in the article from The Economist.
The data for this tutorial are the same as were used in the previous tutorial in which a spreadsheet was downloaded from the ABS website.
The clean data are included here for convenience.
qld_pop <- read.csv(file = 'http://marquess.me/data/qld_pop.csv')
The objectives this tutorial are to:
- Recode cariables in the data into intervals, or bins,
- Perform some reasonably involved data manipluation,
- Make a plot using
ggplot.
The tools you will use are:
dplyrto manipluate the data,cut_intervalto group indovidual ages into age bins,mutateto make new variables,gatherandspreadto make useful variables in the data set,ggplotto make the pyramid,
Only the main tidyverse library is required for this tutorial.
library(tidyverse) library(kableExtra) library(ggthemes)
Recode the individual ages into age bins
The rows of individual age data need to ge grouped as age bins of 5 years. The cut_interval function allows us to make groups with an equal number of intervals. The data have ages from under 1 years old to over 100 that we want to split into 5 year age groups, therefore we need 20 intervals. We use the closed right argument to specify that the higher age in each bin is inclusive. We can provide an array of equsl length to the interval as labels.
age_bins <- c('0-5','6-10','11-15','16-20','21-25',
'26-30','31-35','36-40','41-45','46-50',
'51-55','56-60','61-65','66-70','71-75',
'76-80','81-85','86-90','91-95','95+')
qld_pop$age_bin <- cut_interval(qld_pop$Age, n=20, closed='right', labels=age_bins)
Prepare the data for plotting.
In this part of the exercise we want to perform a number of steps to convert the line list of individual observations of population counter per year for each year of and sex into an aggregated percentage for sea, age group, and year. The steps are as follows:
filterthe population line list to sect three years we intend to plot,group_byto group by year, sex, and age group,summariseto sum the population in each group,spreadthe data to obtain male and female columns,mutateto calculate the percent in each group of males and females,- remove the redundant male and female population count columns,
mutatethe male data to a negative value. This is important for plotting later on because of the way pyramid plots are constructed.renamethe columns,gatherthe data set into long format .
pop_pyr_data_pct <- qld_pop %>% filter(Year %in% c(1977,1997,2017)) %>% group_by(Year, Sex, age_bin) %>% summarise(count=sum(Population)) %>% spread(Sex, count) %>% mutate(pct_F = Female*100/sum(Female), pct_M = Male*100/sum(Male)) %>% mutate(pct_M = -pct_M) %>% select(-Female, -Male) %>% rename(AgeGrp = age_bin, Female=pct_F, Male=pct_M) %>% gather(Sex, Percent, -Year, -AgeGrp)
We obtain a data set that looks like this.
| Year | AgeGrp | Sex | Percent |
|---|---|---|---|
| 1971 | 0-5 | Female | 11.509087 |
| 1971 | 6-10 | Female | 9.987273 |
| 1971 | 11-15 | Female | 9.662507 |
| 1971 | 16-20 | Female | 8.697683 |
| 1971 | 21-25 | Female | 8.049023 |
| 1971 | 26-30 | Female | 6.629657 |
| 1971 | 31-35 | Female | 5.737124 |
| 1971 | 36-40 | Female | 5.451008 |
Plot the data
Now we can begin to plot the data using ggplot. Each pyramid plot is a plot of two sides, with male and female data displayed on either side of zero on the x-axis. That was the reason to convert the male data to negative values. The plot is simply a bar plot with some additional formatting.
Here are the steps to construct the plot:
- load the data and set the aesthetics for age group on x, percent on y, and colour the bars,
- set
stat = 'identityand use a suitablewidthvalue (personal preference) to separate the bars, - because we have negative values for male data we need to set the
labelsmanually with labels andbreaks. Thebreaksandlabelsneed to be the same length to align, - Flip the chart on its side with
coord_flip(), - Add a main label,
- Tinker about with the format and colours of the theme,
- I used
theme_tuftefromggthemesfor a nice clean graph and mofied the font in this plot, - Importantly, facet the plot by year so that each year has its own pyramid.
Here is the code for the plot.
ggplot(pop_pyr_data_pct, aes(x = AgeGrp, y = Percent, fill = Sex)) + # Fill column
geom_bar(stat = "identity", width = .85) + # draw the bars
scale_y_continuous(breaks = seq(-10,10,length.out = 5),labels = c('10%','5%','0','5%','10%')) +
coord_flip() + # Flip axes
labs(title="Changes in Queensland populations structure since 1977") +
theme(plot.title = element_text(hjust = .5),
axis.ticks = element_blank()) + # Centre plot title
scale_fill_manual(values=c("#899DA4", "#C93312")) +
theme_tufte(base_size = 12, base_family="Avenir") +
facet_grid(. ~ Year)
Click the image below to enlarge.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.