An R tutorial to download and plot some Queensland population data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
In this tutorial I will download Queensland population data from the Australian Bureau of Statistics (ABS), tidy it up a bit, and plot some graphs with ggplot. The data in the spreadsheet are not in the most useful format so this tutorial will use tidyverse tools to manipulate the data into something a bit more useful.
The objectives are to:
- Download the spreadsheet from the ABS,
- Extract data from the spreadsheet,
- Wrangle the data to get into into a useful format,
- Plot some of the data using
ggplot.
The tools you will use are:
readxlto selectively use data from a spreadsheet,stringrto tidy up column names,kableandkableExtrato display data in a pretty tablelubridateto change a date format,gatherandseparateto make useful variables in the data set,ggplotto make a line graph,wesandersoncolour themes for nice plot colours,- you will also see how to make a custom
ggplottheme.
Load the required libraries. I use the wesanderson package for picking manual colours in my plots.
library(tidyverse) library(lubridate) library(readxl) library(stringr) library(wesanderson) library(kableExtra)
Make a custom ggplot theme
Making a theme is not too hard. You reaaly just have to type a ggplot theme name at the console without the parentheses and copy what you need into a function. This is my custom ggplot theme. It sets some ggplot options so I don’t have to add them to the plotting code.
theme_marquess <- function(){
theme_minimal(base_size = 12, base_family="Avenir") %+replace%
theme(rect = element_rect(fill = 'white', linetype = 0, colour = NA, size=NA),
panel.background = element_blank(),
panel.grid.minor.x = element_line(colour = NA, size = NA),
panel.grid.minor.y = element_line(colour = 'gray90', size = 0.2),
panel.grid.major.x = element_line(size = NA, colour = NA),
panel.grid.major.y = element_line(size = 0.3, linetype = 'solid', colour = "gray80"),
strip.background = element_rect(fill = "grey80", colour = "grey20"),
# Optional - sets x and y axis lines
# axis.line = element_line(colour="gray50", size=0.3, lineend="square"),
# Optional - sets a border around the chart. Leaving it off provides a cleab 538 style chart
# panel.border = element_rect(linetype = "solid", fill = NA, color='gray60', size=1),
legend.key = element_rect(fill="transparent", colour=NA)
)
}
Get the data.
The spreadsheet files are available on the ABS website. I am using TABLE 53. Estimated Resident Population By Single Year Of Age, Queensland. I tend to download the spreadsheet locally and read in the data with readxl because the spreadsheets are periodically updated.
When I have downloaded the file to my working directory I read it in with the code below. There are a number of worksheets in the spreadsheet and we are only interested in two of them. One is the main data worksheet (Data1), and the other is the Index worksheet that I will use for column names.
The data worksheet is relatively straightforward. The range argument allows a certain range of cells that have data to be loaded. This means eyeballing the spreadsheet in advance to identify what needs to be extracted. Obtaining the column names reads the Index worksheet, specifies a range of cells to read, defines the data type in the cells (as text), and gives the tibble column a name.
qld <- read_excel(path = '3101053.xls',
sheet = 'Data1',
range = 'A10:GU57')
columns <- read_excel(path = '3101053.xls',
sheet = 1,
range='A13:A214',
col_types = 'text',
col_names = 'Header')
Tidy up the data (part 1).
The data set you get is 47 rows and 203 columns of population data. The columns show the unit age population levels from males aged 1 years old and under to 100 years old and over. Following that is the same for females. So 101 columns for males and 101 for females, plus 1 column for years results in 203 columns.
The Index worksheet that was used to extract strings to use as column headers is long and contains redundant information and punctuation. These can all be removed using the stringr package as follows. Just remove what is not necessary and trim the whitespace as follows.
columns$Header <- str_remove(columns$Header, pattern = "Estimated Resident Population ; ")
columns$Header <- str_remove(columns$Header, pattern = "\ ; ")
columns$Header <- str_remove(columns$Header, pattern = "and over")
columns$Header <- str_remove(columns$Header, pattern = ";")
columns$Header <- str_trim(columns$Header, side = 'right')
colnames(qld) <- c('Year', columns$Header)
Look at the data.
You can use the kable and kableExtra packages to make a pretty table to look at bits of the data. Obviously in a regular R script you would just output to console. However, I just discovered kableExtra yesterday so I am keen to include the kable_styling function to make a Bootstrap style table.
kable(qld[1:5, 1:5], "html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| Year | Male 0 | Male 1 | Male 2 | Female 0 | Female 1 | Female 2 | Female 3 |
|---|---|---|---|---|---|---|---|
| 1971-06-01 | 19091 | 18909 | 18480 | 18378 | 17975 | 17982 | 17511 |
| 1972-06-01 | 20106 | 19460 | 19239 | 19380 | 18669 | 18349 | 18172 |
| 1973-06-01 | 19586 | 20438 | 19999 | 18819 | 19785 | 19122 | 18844 |
| 1974-06-01 | 19281 | 19988 | 20898 | 18266 | 19234 | 20287 | 19654 |
| 1975-06-01 | 18738 | 19513 | 20275 | 17834 | 18583 | 19582 | 20681 |
Tidy up the data (part 2).
As you can see the data in wide and untidy. Each column has tow components, sex and age, that should be separated out. Also, the data would benefit from being converted to long format. The Year column is not formatted as a year, it is still in a date format. We can fix this with a short number of steps.
- Gather the data to long format using the columns with
Sex_Agethe key,Populationas the value, and ignore theYearcolumn. - Covert the
Sex_Agecolumns to 2 columns,SexandAge, using separate. - Convert the date to a year.
- Convert the data in the
Agecolumnn to integers.
qld_long <- qld %>% gather(Sex_Age, Population, -'Year')
qld_pop <- qld_long %>% separate(Sex_Age, into = c('Sex', 'Age'), sep = " ")
qld_pop$Year <- year(qld_pop$Year)
qld_pop$Age <- as.integer(qld_pop$Age)
Plot the data
Now we can begin to plot the data using ggplot. There are many ways to slice the data and plot it, but to keep thing simple let’s just do a line plot of population count of males and females by year for each year of age, for a particular year.
We will use filter to pick out a particular year to plot and make a line graph using geom_line, and my theme that I included earier.
qld_pop %>% filter(Year == 2017) %>%
ggplot(aes(x=Age, y=Population, color=Sex)) +
geom_line(size=1.1, alpha=0.8) +
labs(x = "Age",
y = 'Number of people',
title = "Population of Queensland in 2017 by sex and age.") +
scale_color_manual(values=wes_palette("Royal1")) +
theme_marquess()
Click the image below to enlarge.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.