Site icon R-bloggers

Tutorial – make population pyramids with Queensland data

[This article was first published on Home, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this tutorial I will use Queensland population data already downloaded from the Australian Bureau of Statistics (ABS), and rearrange the data to produce population pyramids using ggplot. The population pyramid is one of the most popular methods to visualise population age structure. The method constructs a chart with each age group represented by a bar, and each bar ranged one above the other—youngest at the bottom, oldest at the top, and with the sexes separated—you get a simple shape.

Over recent years, the structure of populations has markedly changed, and the population pyramids have taken on more of a dome-like structure. This is nicely represented in the Queensland data. You can read more in the article from The Economist. < !-- more -->

The data for this tutorial are the same as were used in the previous tutorial in which a spreadsheet was downloaded from the ABS website.

The clean data are included here for convenience.

qld_pop <- read.csv(file = 'http://marquess.me/data/qld_pop.csv')

The objectives this tutorial are to:

The tools you will use are:

Only the main tidyverse library is required for this tutorial.

library(tidyverse)
library(kableExtra)
library(ggthemes)

Recode the individual ages into age bins

The rows of individual age data need to ge grouped as age bins of 5 years. The cut_interval function allows us to make groups with an equal number of intervals. The data have ages from under 1 years old to over 100 that we want to split into 5 year age groups, therefore we need 20 intervals. We use the closed right argument to specify that the higher age in each bin is inclusive. We can provide an array of equsl length to the interval as labels.

age_bins <- c('0-5','6-10','11-15','16-20','21-25',
              '26-30','31-35','36-40','41-45','46-50',
              '51-55','56-60','61-65','66-70','71-75',
              '76-80','81-85','86-90','91-95','95+')

qld_pop$age_bin <- cut_interval(qld_pop$Age, n=20, closed='right', labels=age_bins)

Prepare the data for plotting.

In this part of the exercise we want to perform a number of steps to convert the line list of individual observations of population counter per year for each year of and sex into an aggregated percentage for sea, age group, and year. The steps are as follows:

pop_pyr_data_pct <-
  qld_pop %>%
  filter(Year %in% c(1977,1997,2017)) %>%
  group_by(Year, Sex, age_bin) %>%
  summarise(count=sum(Population)) %>%
  spread(Sex, count) %>%
  mutate(pct_F = Female*100/sum(Female),  pct_M = Male*100/sum(Male)) %>%
  mutate(pct_M = -pct_M) %>%
  select(-Female, -Male) %>%
  rename(AgeGrp = age_bin, Female=pct_F, Male=pct_M) %>%
  gather(Sex, Percent, -Year, -AgeGrp)

We obtain a data set that looks like this.

Year AgeGrp Sex Percent
1971 0-5 Female 11.509087
1971 6-10 Female 9.987273
1971 11-15 Female 9.662507
1971 16-20 Female 8.697683
1971 21-25 Female 8.049023
1971 26-30 Female 6.629657
1971 31-35 Female 5.737124
1971 36-40 Female 5.451008

Plot the data

Now we can begin to plot the data using ggplot. Each pyramid plot is a plot of two sides, with male and female data displayed on either side of zero on the x-axis. That was the reason to convert the male data to negative values. The plot is simply a bar plot with some additional formatting.

Here are the steps to construct the plot:

Here is the code for the plot.

ggplot(pop_pyr_data_pct, aes(x = AgeGrp, y = Percent, fill = Sex)) +   # Fill column
  geom_bar(stat = "identity", width = .85) +   # draw the bars
  scale_y_continuous(breaks = seq(-10,10,length.out = 5),labels = c('10%','5%','0','5%','10%')) +
  coord_flip() +  # Flip axes
  labs(title="Changes in Queensland populations structure since 1977") +
  theme(plot.title = element_text(hjust = .5),
        axis.ticks = element_blank()) +   # Centre plot title
  scale_fill_manual(values=c("#899DA4", "#C93312")) +
  theme_tufte(base_size = 12, base_family="Avenir") +
  facet_grid(. ~ Year)

Click the image below to enlarge.

To leave a comment for the author, please follow the link and comment on their blog: Home.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.