How long since your team scored 100+ points? This blog’s first foray into the fitzRoy R package

March 21, 2019
By

(This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers)

When this blog moved from bioinformatics to data science I ran a Twitter poll to ask whether I should start afresh at a new site or continue here. “Continue here”, you said.

So let’s test the tolerance of the long-time audience and celebrate the start of the 2019 season as we venture into the world of – Australian football (AFL) statistics!

I’ve been hooked on the wonderful sport of AFL since attending my first game, the ANZAC Day match between the Sydney Swans and Melbourne in 2003, and have hardly missed a Swans home game since. However, I don’t think you need to be a sports fanatic – I certainly am not – to appreciate that sport is a rich source of data on which you can practice your R, statistics and data science skills. A large part of data science is figuring out what makes an interesting question, then querying the data to get the answer. Sport of course is full of trivia questions: the first, the last, the highest, the longest; and so provides many opportunities to devise questions and find answers. Sports fans also tend to hold strong opinions and make bold statements – not always backed up with evidence – which can be fun to engage with, armed with a little data.

As an example we’ll use this list of predictions for the 2019 season which tells us that:

Carlton will score 100 points
A gentle one off the bat. The Blues sub-ton streak stands at 55 games, making it one of the longest in league history.

Let’s look at some ways to visualise how long it’s been since a team scored 100+ points. For years the go-to site for AFL data has been the wonderful AFL Tables. The HTML and text files at this site are relatively easy to scrape into a dataframe using rvest. However, recent years have seen the development of another data source to which we bow down in awe and gratitude: the fitzRoy R package.

The results of every game since 1897 are stored in match_results and look like this:

# A tibble: 15,407 x 16
    Game Date       Round Home.Team Home.Goals Home.Behinds Home.Points Away.Team Away.Goals
 1     1 1897-05-08 R1    Fitzroy            6           13          49 Carlton            2
 2     2 1897-05-08 R1    Collingw~          5           11          41 St Kilda           2
 3     3 1897-05-08 R1    Geelong            3            6          24 Essendon           7
 4     4 1897-05-08 R1    Sydney             3            9          27 Melbourne          6
 5     5 1897-05-15 R2    Sydney             6            4          40 Carlton            5

To explain some basics of the game: a score in AFL can be a goal (between the two big posts) for 6 points, or a behind (the ball hits a big post, goes between big and small post, or is taken through the posts by a defender) for 1 point. So the total points = (6 x goals) + behinds.

Any match statistic can be viewed from the perspective of either the home or away team. For our example we don’t care whether teams were home or away – we just want their total score. So we can simplify the results like this:

# packages for this post
library(fitzRoy)
library(tidyverse)
library(lubridate)
library(ggrepel)

match_results %>% 
  select(Date, Team = Home.Team, Points = Home.Points) %>% 
  bind_rows(select(match_results, Date, Team = Away.Team, Points = Away.Points))

Result:

# A tibble: 30,814 x 3
   Date       Team        Points
 1 1897-05-08 Fitzroy         49
 2 1897-05-08 Collingwood     41
 3 1897-05-08 Geelong         24
 4 1897-05-08 Sydney          27
 5 1897-05-15 Sydney          40

So: how long since your team scored 100+ points? We can plot how long in days with a couple of simple filters:

match_results %>% 
  select(Date, Team = Home.Team, Points = Home.Points) %>% 
  bind_rows(select(match_results, Date, Team = Away.Team, Points = Away.Points)) %>% 
  filter(year(Date) > 1999, 
         Points > 99) %>% 
  mutate(Days = as.numeric(Sys.Date() - Date)) %>% 
  group_by(Team) %>% 
  filter(Date == max(Date)) %>% 
  ggplot(aes(Date, Points)) + 
  geom_point() + 
  geom_text_repel(aes(label = paste(Team, Days)), 
                  size = 3, 
                  force = 2) + 
  scale_x_date(date_breaks = "3 months", 
               date_labels = "%b %Y") + 
  scale_y_continuous(breaks = seq(100, 180, 10)) + 
  labs(title = paste("Days since scoring 100+ points as of ", format(Sys.Date(), "%b %d %Y")))

Days since AFL teams scored 100+ points

It has indeed been a long time for Carlton. Every other team scored 100+ points in at least one game during the 2018 season.

How unusual is this time between 100+ scores, for Carlton or any other club? Let’s filter for the maximum days between 100+ scores:

match_results %>% 
  select(Date, Team = Home.Team, Points = Home.Points) %>% 
  bind_rows(select(match_results, Date, Team = Away.Team, Points = Away.Points)) %>% 
  filter(Points > 99) %>% 
  group_by(Team) %>% 
  arrange(Date) %>% 
  mutate(Days = as.numeric(Date - lag(Date))) %>% 
  filter(Days == max(Days, na.rm = TRUE)) %>% 
  arrange(desc(Days))

Result:

# A tibble: 19 x 4
# Groups:   Team [19]
   Date       Team            Points  Days
 1 1914-06-08 Fitzroy            108  2844
 2 1910-07-02 Geelong            100  2506
 3 1921-09-17 Melbourne          115  2324
 4 1956-07-07 St Kilda           129  1862
 5 1919-06-28 Essendon           113  1799
 6 1968-06-29 Footscray          104  1512
 7 1964-05-02 North Melbourne    121  1365
 8 1918-07-20 Sydney             101  1134
 9 1933-06-10 Hawthorn           105  1134
10 1901-08-24 Collingwood        143  1092
11 1924-08-23 Richmond           121  1064
12 1918-06-15 Carlton            111  1029
13 2012-08-11 Gold Coast         109   462
14 2011-05-07 Brisbane Lions     116   364
15 2018-03-31 Fremantle          106   328
16 2014-06-14 GWS                125   315
17 2001-05-12 Adelaide           103   302
18 2000-04-02 Port Adelaide      149   301
19 2011-04-02 West Coast         116   259

It’s certainly the longest period of the modern era. It’s also likely that Carlton will break their previous record drought of 1029 days which ended in June 1918. Fitzroy hold the record, with 2844 days between 100+ scores. It might seem unlikely that a team could go almost 8 years without scoring 100+ but we can filter the data to show that it is true:

match_results %>% 
  select(Date, Team = Home.Team, Points = Home.Points) %>% 
  bind_rows(select(match_results, Date, Team = Away.Team, Points = Away.Points)) %>% 
  filter(Team == "Fitzroy", 
         between(Date, as.Date("1900-01-01"), as.Date("1914-12-31"))) %>% 
  ggplot(aes(Date, Points)) + 
  geom_point(alpha = 0.4) + 
  geom_hline(yintercept = 100, color  = "red") + 
  scale_x_date(date_breaks = "2 years", date_labels = "%Y") + 
  labs(title = "Fitzroy points scored in VFL matches 1900 - 1914")

Fitzroy scores 1900-1914

“Days since” is maybe not the best measure. The game is not played year-round and players who score goals don’t play every game.

A better measure, as in the linked article, could be “games since scoring 100+ points”. There are undoubtedly more elegant solutions to this question than mine, but here it is:

  • Group the data by team and arrange by ascending date of game
  • Create a new variable is100 with value 1 (Points >= 100) or 0 (Points < 100)
  • Create a second variable is100cs, the cumulative sum of is100
  • Filter for rows where is100cs = its maximum value (the most recent value)
  • That number of rows is the games since (and including) the most recent 100+ score
match_results %>% 
  select(Date, Team = Home.Team, Points = Home.Points) %>% 
  bind_rows(select(match_results, Date, Team = Away.Team, Points = Away.Points)) %>% 
  group_by(Team) %>% 
  arrange(Date) %>% 
  mutate(is100 = ifelse(Points > 99, 1, 0), 
         is100cs = cumsum(is100)) %>% 
  filter(is100cs == max(is100cs)) %>% 
  summarise(n = n()) %>%
  arrange(desc(n))

Result:

# A tibble: 20 x 2
   Team                n
              
 1 University        126
 2 Carlton            56
 3 Gold Coast         21
 4 Port Adelaide      11
 5 Sydney              8
 6 St Kilda            7
 7 Collingwood         6
 8 Hawthorn            6
 9 GWS                 5
10 Richmond            5
11 Brisbane Lions      4
12 Footscray           4
13 Fitzroy             3
14 Fremantle           3
15 Geelong             2
16 Melbourne           2
17 West Coast          2
18 Adelaide            1
19 Essendon            1
20 North Melbourne     1

Two teams in this list no longer play in the AFL: Fitzroy, which merged with Brisbane in 1996 and University. The latter played 7 seasons from 1908-1914 and in fact, never scored 100+ points. Once again this leaves Carlton at the top of the current 100+ drought club.

One last question: is Carlton’s current 55 games without scoring 100+ their longest ever? How about the longest ever of any team?

To find that we can group on team and our is100cs variable and again, count the rows and filter for the maximum count.

match_results %>% 
  select(Date, Team = Home.Team, Points = Home.Points) %>% 
  bind_rows(select(match_results, Date, Team = Away.Team, Points = Away.Points)) %>% 
  group_by(Team) %>% 
  arrange(Date) %>% 
  mutate(Games = row_number(),
         is100 = ifelse(Points > 99, 1, 0), 
         is100cs = cumsum(is100)) %>% 
  group_by(Team, is100cs) %>% 
  summarise(n = n()) %>% 
  filter(n == max(n)) %>%
  arrange(desc(n))

Result:

              Team is100cs   n
1         St Kilda       0 172
2          Fitzroy       5 140
3          Carlton       0 133
4       University       0 126
5           Sydney       0 122
6          Geelong       6 116
7        Melbourne       5 116
8         Essendon       6  85
9        Footscray     120  79
10        Hawthorn       3  60
11 North Melbourne      84  60
12     Collingwood       1  54
13        Richmond      15  48
14      Gold Coast       2  35
15  Brisbane Lions     233  21
16       Fremantle     147  17
17             GWS       0  17
18   Port Adelaide      16  17
19        Adelaide      39  13
20      West Coast     225  12
21      West Coast     303  12

This show us that Saint Kilda did not score more than 100 in their first game, nor in the 171 games that followed. Fitzroy’s aforementioned longest drought began after their fifth game, while Carlton did not score 100+ in their first 133 games.

At the other end of the scale, West Coast have gone at most only 11 games without scoring 100+, following the 225th and 303rd occasions on which they did so.

We can get some sense of how often each team scored 100+ in the following chart:

match_results %>% 
  select(Date, Team = Home.Team, Points = Home.Points) %>% 
  bind_rows(select(match_results, Date, Team = Away.Team, Points = Away.Points)) %>% 
  group_by(Team) %>% 
  arrange(Date) %>% 
  mutate(Games = row_number(),
         is100 = ifelse(Points > 99, 1, 0), 
         is100cs = cumsum(is100)) %>% 
  ggplot(aes(Games, is100cs)) + 
    geom_line() + 
    facet_wrap(~Team) + 
    scale_x_continuous(breaks = seq(0, 3000, 1000)) + 
    labs(y = "Cumulative games scoring 100+", 
         title = "Games scoring 100+ progression by team")

Progression of 100+ scores by team

Summary
In summary: tidy data in a nice package + tidyverse tools = much easier to slice, dice, query and aggregate in order to answer those burning AFL trivia questions.

Sports data science – give it a go! And if this has left you curious about AFL, I leave you with my own subjective assessment of its finest day in the last 10 years.

To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)