In this tutorial I will use Queensland population data already downloaded from the Australian Bureau of Statistics (ABS), and rearrange the data to produce population pyramids using
ggplot. The population pyramid is one of the most popular methods to visualise population age structure. The method constructs a chart with each age group represented by a bar, and each bar ranged one above the other—youngest at the bottom, oldest at the top, and with the sexes separated—you get a simple shape.
Over recent years, the structure of populations has markedly changed, and the population pyramids have taken on more of a dome-like structure. This is nicely represented in the Queensland data. You can read more in the article from The Economist.
The clean data are included here for convenience.
qld_pop <- read.csv(file = 'http://marquess.me/data/qld_pop.csv')
The objectives this tutorial are to:
- Recode cariables in the data into intervals, or bins,
- Perform some reasonably involved data manipluation,
- Make a plot using
The tools you will use are:
dplyrto manipluate the data,
cut_intervalto group indovidual ages into age bins,
mutateto make new variables,
spreadto make useful variables in the data set,
ggplotto make the pyramid,
Only the main
tidyverse library is required for this tutorial.
library(tidyverse) library(kableExtra) library(ggthemes)
Recode the individual ages into age bins
The rows of individual age data need to ge grouped as age bins of 5 years. The
cut_interval function allows us to make groups with an equal number of intervals. The data have ages from under 1 years old to over 100 that we want to split into 5 year age groups, therefore we need 20 intervals. We use the closed right argument to specify that the higher age in each bin is inclusive. We can provide an array of equsl length to the interval as labels.
age_bins <- c('0-5','6-10','11-15','16-20','21-25', '26-30','31-35','36-40','41-45','46-50', '51-55','56-60','61-65','66-70','71-75', '76-80','81-85','86-90','91-95','95+') qld_pop$age_bin <- cut_interval(qld_pop$Age, n=20, closed='right', labels=age_bins)
Prepare the data for plotting.
In this part of the exercise we want to perform a number of steps to convert the line list of individual observations of population counter per year for each year of and sex into an aggregated percentage for sea, age group, and year. The steps are as follows:
filterthe population line list to sect three years we intend to plot,
group_byto group by year, sex, and age group,
summariseto sum the population in each group,
spreadthe data to obtain male and female columns,
mutateto calculate the percent in each group of males and females,
- remove the redundant male and female population count columns,
mutatethe male data to a negative value. This is important for plotting later on because of the way pyramid plots are constructed.
gatherthe data set into long format .
pop_pyr_data_pct <- qld_pop %>% filter(Year %in% c(1977,1997,2017)) %>% group_by(Year, Sex, age_bin) %>% summarise(count=sum(Population)) %>% spread(Sex, count) %>% mutate(pct_F = Female*100/sum(Female), pct_M = Male*100/sum(Male)) %>% mutate(pct_M = -pct_M) %>% select(-Female, -Male) %>% rename(AgeGrp = age_bin, Female=pct_F, Male=pct_M) %>% gather(Sex, Percent, -Year, -AgeGrp)
We obtain a data set that looks like this.
Plot the data
Now we can begin to plot the data using
ggplot. Each pyramid plot is a plot of two sides, with male and female data displayed on either side of zero on the x-axis. That was the reason to convert the male data to negative values. The plot is simply a bar plot with some additional formatting.
Here are the steps to construct the plot:
- load the data and set the aesthetics for age group on x, percent on y, and colour the bars,
stat = 'identityand use a suitable
widthvalue (personal preference) to separate the bars,
- because we have negative values for male data we need to set the
labelsmanually with labels and
labelsneed to be the same length to align,
- Flip the chart on its side with
- Add a main label,
- Tinker about with the format and colours of the theme,
- I used
ggthemesfor a nice clean graph and mofied the font in this plot,
- Importantly, facet the plot by year so that each year has its own pyramid.
Here is the code for the plot.
ggplot(pop_pyr_data_pct, aes(x = AgeGrp, y = Percent, fill = Sex)) + # Fill column geom_bar(stat = "identity", width = .85) + # draw the bars scale_y_continuous(breaks = seq(-10,10,length.out = 5),labels = c('10%','5%','0','5%','10%')) + coord_flip() + # Flip axes labs(title="Changes in Queensland populations structure since 1977") + theme(plot.title = element_text(hjust = .5), axis.ticks = element_blank()) + # Centre plot title scale_fill_manual(values=c("#899DA4", "#C93312")) + theme_tufte(base_size = 12, base_family="Avenir") + facet_grid(. ~ Year)
Click the image below to enlarge.