Using R/anomalize to identify delays in games of Australian Rules football

[This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In which we generate a dataset of game durations, do a little exploratory data analysis and then try to identify unusual instances and their causes.

In my Twitter-whinge-post, I mentioned that one of its last redeeming features is the AFL statistics community. I always feel the need to apologise for the sports-related posts to long-time followers of this blog, but here’s the thing: sport can provide rich datasets, generate numerous interesting questions and can be a fun way to practice your data analysis skills.

For example, Emlyn writes:

and I immediately think (1) is there a dataset with the duration of quarters in AFL games, (2) if not, can we make one, (3) why limit ourselves to recent seasons and (4) this sounds like a job for one of my favourite R packages, anomalize.

The dataset

AFL games are played over four quarters which typically take around 30 minutes each. So far as I know, the duration of quarters is not available via the fitzRoy package, nor did I find a dataset online.

So here is some quick, dirty and brittle code that uses rvest to scrape the data from AFL Tables, and here is the resulting dataset in CSV format, with the quarter durations in seconds. We have quarter duration data from 2001 onwards. The first few rows look like this:

  url                                                              match_id match_date    Q1    Q2    Q3    Q4
  <chr>                                                               <dbl> <date>     <dbl> <dbl> <dbl> <dbl>
1 https://afltables.com/afl/stats/games/2001/051220010330.html  51220010330 2001-03-30  2094  1623  1842  2004
2 https://afltables.com/afl/stats/games/2001/041020010331.html  41020010331 2001-03-31  1722  1882  1816  1704
3 https://afltables.com/afl/stats/games/2001/071520010331.html  71520010331 2001-03-31  1772  1872  1978  1738
4 https://afltables.com/afl/stats/games/2001/030820010331.html  30820010331 2001-03-31  1783  2035  1926  1916
5 https://afltables.com/afl/stats/games/2001/131920010331.html 131920010331 2001-03-31  1611  1751  2100  1814
6 https://afltables.com/afl/stats/games/2001/011620010401.html  11620010401 2001-04-01  1779  1686  1871  1755

Exploratory Data Analysis (EDA)

There are many good options for EDA using R. I like the simplicity of skimr.

skimr::skim(afl_quarter_lengths, starts_with("Q"))

── Data Summary ────────────────────────
                           Values             
Name                       afl_quarter_lengths
Number of rows             4926               
Number of columns          7                  
_______________________                       
Column type frequency:                        
  numeric                  4                  
________________________                      
Group variables            None               

── Variable type: numeric ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate  mean   sd   p0  p25  p50  p75 p100 hist 
1 Q1                   17         0.997 1810. 134. 1213 1725 1807 1895 2571 ▁▅▇▁▁
2 Q2                   17         0.997 1817. 138. 1216 1730 1816 1906 2429 ▁▂▇▂▁
3 Q3                   17         0.997 1830. 142. 1223 1743 1828 1919 2395 ▁▂▇▃▁
4 Q4                   17         0.997 1820. 152. 1201 1728 1814 1911 3956 ▃▇▁▁▁

Not too many missing values, good.

We can look at the distribution of quarter lengths by season.


library(tidyverse)
library(lubridate)
theme_set(theme_bw())

afl_quarter_lengths %>%
  pivot_longer(4:7) %>%
  mutate(season = year(match_date)) %>%
  ggplot(aes(season, value)) +
  geom_boxplot(aes(group = season)) +
  facet_wrap(~name) + 
  labs(x = "Season",
       y = "Duration (seconds)",
       title = "Duration of quarters in AFL games 2021-2025",
       subtitle = "data from afltables.com")

Of note: (1) season 2020 was affected by COVID and the length of quarters was reduced; (2) the last 5 seasons have seen a steady increase in quarter lengths, albeit by a median change of only around 100 seconds or so; (3) we can already see potential outliers.

The increase in game length seems to be of concern to AFL House, if almost no one else, so we can look at that. We can argue about the best way to process and visualise, but let’s just calculate the total game time, mean and median values by season, omit season 2020 and plot as points.

afl_quarter_lengths %>% 
  mutate(season = year(match_date),
         total = Q1 + Q2 + Q3 + Q4) %>% 
  group_by(season) %>%
  summarise(`mean duration` = mean(total, na.rm = TRUE),
            `median duration` = median(total, na.rm = TRUE)) %>%
  pivot_longer(-season) %>%
  filter(season != 2020) %>%
  ggplot(aes(season, value)) +
  geom_point() +
  facet_wrap(~name) +
  geom_smooth() +
  labs(x = "Season",
       y = "Duration (seconds)",
       title = "Mean and median AFL game duration 2001-2025",
       subtitle = "data from afltables.com")

Arguably an upward trend, especially in recent seasons. It’s worth noting that the gap between shortest and longest mean or median values is roughly 10 minutes, over a 2 1/2 hour game.

Using anomalize

Now let’s use anomalize to look for unusually short or long quarters. We can follow along with the quick start guide, making a few alterations to avoid error messages – see comments in the code below. Whether these are good practice for time series analysis would require more research but for now, they look to be generating interesting results.

library(tibbletime)

afl_quarter_lengths_anomalized <- afl_quarter_lengths %>%
  pivot_longer(4:7) %>%
  # remove missing values and the COVID-affected season 2020
  filter(!is.na(value),
         year(match_date) != 2020) %>%
  # group data by quarter
  group_by(name) %>%
  # convert to tibbletime
  as_tbl_time(index = match_date) %>%
  # required to prevent Error in `reconstruct()`
  as_period("1 day") %>%
  time_decompose(value, merge = TRUE) %>%
  anomalize(remainder) %>%
  time_recompose()

afl_quarter_lengths_anomalized %>%
  plot_anomalies()

The code above throws some warnings about “Index not ordered”, which we can ignore as the match dates are in ascending order. Well, we have anomalies – six in total. The plot_anomalies() method is pretty good, but we can exert more control over the plot using ggplot2 methods – either by modifying the output from plot_anomalies() or working with the output data frame.

afl_quarter_lengths_anomalized %>%
  ggplot(aes(match_date, value)) +
  geom_point(aes(color = anomaly)) +
  facet_wrap(~name) +
  scale_color_manual(values = c("grey80", "red3")) + 
  theme(legend.position = "top") +
  labs(x = "Match date",
       y = "Duration (seconds)",
       title = "Anomalous quarter lengths in AFL games 2001-2025",
       subtitle = "data from afltables.com")

We can also experiment with different algorithms for time series decomposition and anomaly detection. This, for example, finds 8 anomalies.

afl_quarter_lengths_anomalized_tw <- afl_quarter_lengths %>%
  pivot_longer(4:7) %>%
  filter(!is.na(value),
         year(match_date) != 2020) %>%
  group_by(name) %>%
  as_tbl_time(index = match_date) %>%
  as_period("1 day") %>%
  time_decompose(value, merge = TRUE, method = "twitter") %>%
  anomalize(remainder, method = "gesd") %>%
  time_recompose()

afl_quarter_lengths_anomalized_tw %>%
  plot_anomalies()

Reasons for game delays

Staying with the method that found 8 anomalies, we can filter the output data, visit the match URL at AFL Tables and then do some online research to see whether match reports indicate reasons for the delay. I’ve manually tacked that information on to the table below, with a link to the source.

afl_quarter_lengths_anomalized_tw %>%
  filter(anomaly == "Yes") %>%
  select(url, match_id, name, match_date)


  url                                                              match_id name  match_date  delay_reason
1 https://afltables.com/afl/stats/games/2018/141620180628.html 141620180628 Q1    2018-06-28  injury
2 https://afltables.com/afl/stats/games/2023/011220230701.html  11220230701 Q1    2023-07-01  injury
3 https://afltables.com/afl/stats/games/2024/192120240914.html 192120240914 Q1    2024-09-14  injury
4 https://afltables.com/afl/stats/games/2014/011920140810.html  11920140810 Q2    2014-08-10  injury
5 https://afltables.com/afl/stats/games/2001/030920010901.html  30920010901 Q3    2001-09-01  injury
6 https://afltables.com/afl/stats/games/2021/111820210809.html 111820210809 Q4    2021-08-09  weather
7 https://afltables.com/afl/stats/games/2022/091620220325.html  91620220325 Q4    2022-03-25  goal celebration
8 https://afltables.com/afl/stats/games/2024/041120240823.html  41120240823 Q4    2024-08-23  weather

For example, the first game in 2018 was held between the Sydney Swans and the Richmond Tigers and we learn that “Conca’s 100th game ends with sickening injury“. “Conca’s left leg was trapped underneath Lance Franklin midway through the first quarter, with the Tigers midfielder in agony as he was treated by the club’s medical team […] The game was stopped for several minutes“. Checks out.

What’s that, “goal celebration”? My personal favourite game delay, I was there!

Missing anomalies

What about delays which occurred during quarter time breaks, you ask? Good question and the answer is, we don’t have that data, sorry.

How about Round 2 2023, Brisbane Lions versus Melbourne Demons at The Gabba, when “The 2032 Olympic venue suddenly fell into darkness with 12 minutes left in the fourth quarter with the Lions 40 points ahead of the Demons”, delaying the game for 35 minutes? For some reason AFL Tables has the fourth quarter of this game listed as 31 minutes 54 seconds, proving once again that analysis is only as good as the data.

That’s it.

Hope you enjoyed it.

To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)