# Tutorial – make population pyramids with Queensland data

April 13, 2018
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this tutorial I will use Queensland population data already downloaded from the Australian Bureau of Statistics (ABS), and rearrange the data to produce population pyramids using `ggplot`. The population pyramid is one of the most popular methods to visualise population age structure. The method constructs a chart with each age group represented by a bar, and each bar ranged one above the other—youngest at the bottom, oldest at the top, and with the sexes separated—you get a simple shape.

Over recent years, the structure of populations has markedly changed, and the population pyramids have taken on more of a dome-like structure. This is nicely represented in the Queensland data. You can read more in the article from The Economist.

The data for this tutorial are the same as were used in the previous tutorial in which a spreadsheet was downloaded from the ABS website.

The clean data are included here for convenience.

``````qld_pop <- read.csv(file = 'http://marquess.me/data/qld_pop.csv')
``````

The objectives this tutorial are to:

• Recode cariables in the data into intervals, or bins,
• Perform some reasonably involved data manipluation,
• Make a plot using `ggplot`.

The tools you will use are:

• `dplyr` to manipluate the data,
• `cut_interval` to group indovidual ages into age bins,
• `mutate` to make new variables,
• `gather` and `spread` to make useful variables in the data set,
• `ggplot` to make the pyramid,

Only the main `tidyverse` library is required for this tutorial.

``````library(tidyverse)
library(kableExtra)
library(ggthemes)
``````

#### Recode the individual ages into age bins

The rows of individual age data need to ge grouped as age bins of 5 years. The `cut_interval` function allows us to make groups with an equal number of intervals. The data have ages from under 1 years old to over 100 that we want to split into 5 year age groups, therefore we need 20 intervals. We use the closed right argument to specify that the higher age in each bin is inclusive. We can provide an array of equsl length to the interval as labels.

``````age_bins <- c('0-5','6-10','11-15','16-20','21-25',
'26-30','31-35','36-40','41-45','46-50',
'51-55','56-60','61-65','66-70','71-75',
'76-80','81-85','86-90','91-95','95+')

qld_pop\$age_bin <- cut_interval(qld_pop\$Age, n=20, closed='right', labels=age_bins)
``````

#### Prepare the data for plotting.

In this part of the exercise we want to perform a number of steps to convert the line list of individual observations of population counter per year for each year of and sex into an aggregated percentage for sea, age group, and year.
The steps are as follows:

• `filter` the population line list to sect three years we intend to plot,
• `group_by` to group by year, sex, and age group,
• `summarise` to sum the population in each group,
• `spread` the data to obtain male and female columns,
• `mutate` to calculate the percent in each group of males and females,
• remove the redundant male and female population count columns,
• `mutate` the male data to a negative value. This is important for plotting later on because of the way pyramid plots are constructed.
• `rename` the columns,
• `gather` the data set into long format .
``````pop_pyr_data_pct <-
qld_pop %>%
filter(Year %in% c(1977,1997,2017)) %>%
group_by(Year, Sex, age_bin) %>%
summarise(count=sum(Population)) %>%
mutate(pct_F = Female*100/sum(Female),  pct_M = Male*100/sum(Male)) %>%
mutate(pct_M = -pct_M) %>%
select(-Female, -Male) %>%
rename(AgeGrp = age_bin, Female=pct_F, Male=pct_M) %>%
gather(Sex, Percent, -Year, -AgeGrp)
``````

We obtain a data set that looks like this.

Year AgeGrp Sex Percent
1971 0-5 Female 11.509087
1971 6-10 Female 9.987273
1971 11-15 Female 9.662507
1971 16-20 Female 8.697683
1971 21-25 Female 8.049023
1971 26-30 Female 6.629657
1971 31-35 Female 5.737124
1971 36-40 Female 5.451008

#### Plot the data

Now we can begin to plot the data using `ggplot`. Each pyramid plot is a plot of two sides, with male and female data displayed on either side of zero on the x-axis. That was the reason to convert the male data to negative values. The plot is simply a bar plot with some additional formatting.

Here are the steps to construct the plot:

• load the data and set the aesthetics for age group on x, percent on y, and colour the bars,
• set `stat = 'identity` and use a suitable `width` value (personal preference) to separate the bars,
• because we have negative values for male data we need to set the `labels` manually with labels and `breaks`. The `breaks` and `labels` need to be the same length to align,
• Flip the chart on its side with `coord_flip()`,
• Tinker about with the format and colours of the theme,
• I used `theme_tufte` from `ggthemes` for a nice clean graph and mofied the font in this plot,
• Importantly, facet the plot by year so that each year has its own pyramid.

Here is the code for the plot.

``````ggplot(pop_pyr_data_pct, aes(x = AgeGrp, y = Percent, fill = Sex)) +   # Fill column
geom_bar(stat = "identity", width = .85) +   # draw the bars
scale_y_continuous(breaks = seq(-10,10,length.out = 5),labels = c('10%','5%','0','5%','10%')) +
coord_flip() +  # Flip axes
labs(title="Changes in Queensland populations structure since 1977") +
theme(plot.title = element_text(hjust = .5),
axis.ticks = element_blank()) +   # Centre plot title
scale_fill_manual(values=c("#899DA4", "#C93312")) +
theme_tufte(base_size = 12, base_family="Avenir") +
facet_grid(. ~ Year)
``````

Click the image below to enlarge. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.