Do Ruckman Mature later

[This article was first published on Analysis of AFL, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Recently I saw this tweet at James Coventry the author of footballistics. It got me thinking that this would be an interesting example now that we have players heights, position and weight to answer this kind of question.

This by no means is meant to be a rigorous statistical analysis full of models. But more of a fun one with a few graphs and hopefully for the reader picking up a few R skills along the way.

To answer this question, my first thought that I hope you will explore with me is to just take the player data we have from footywire, join on to the player data the players height, age and position.

Then we will see if we can see visually if it looks like ruckman have a peak at a later age.

The metric I will use will be supercoach scores and the aim would be to explore Supercoach score by age across the various positions.

As always the first step should be to create a dataset which involves two datasets the first one being already in fitzRoy and the second one being a fresh scrape of data

## Warning: package 'fitzRoy' was built under R version 3.5.1
## Warning: package 'tidyverse' was built under R version 3.5.1
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.7
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.5.1
## Warning: package 'tidyr' was built under R version 3.5.1
## Warning: package 'purrr' was built under R version 3.5.1
## Warning: package 'dplyr' was built under R version 3.5.1
## Warning: package 'stringr' was built under R version 3.5.1
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## Loading required package: xml2
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
##     pluck
## The following object is masked from 'package:readr':
##     guess_encoding

  html_nodes("br+ a , .lnormtop a:nth-child(1)")%>%


cbind.fill <- function(...){
  nm <- list(...) 
  nm <- lapply(nm, as.matrix)
  n <- max(sapply(nm, nrow)), lapply(nm, function (x) 
    rbind(x, matrix(, n-nrow(x), ncol(x))))) 

player_info <- function(x){
  # page <- read_html(x)
  player<-       page%>%
    html_nodes(".ldrow .hltitle")%>%
    html_text() %>% as.tibble()
  playing.for<-  page%>%
    html_nodes(".ldrow a b")%>%
    html_text() %>% as.tibble()
  number<-     page%>%
    html_nodes(".ldrow > b")%>%
    html_text() %>% as.tibble()
    html_nodes("form tr:nth-child(4) .ldrow")%>%
    str_replace_all("[\r\n]" , "")%>%
    str_extract(pattern =("(?<=Weight:).*(?=Position:)"))%>%as.tibble()
    html_nodes("form tr:nth-child(4) .ldrow")%>%
    str_replace_all("[\r\n]" , "")%>%
    str_extract(pattern =("(?<=Height:).*(?=Weight:)"))%>%as.tibble()
  draft_position <- page%>%
    html_nodes("tr:nth-child(5) .ldrow")%>%
    str_replace_all("[\r\n]" , "")%>%
    str_extract(pattern =("(?<=Drafted: ).*(?=by)"))%>%as.tibble()
  club_drafted <- page%>%
    html_nodes("tr:nth-child(5) .ldrow")%>%
    html_text()%>%str_replace_all("[\r\n]" , "")%>%
    str_remove(".*by") %>% as.tibble()
  position <-     page%>%
    html_nodes("form tr:nth-child(4) .ldrow")%>%
    str_replace_all("[\r\n]" , "")%>%
    str_remove(".*Position: ")%>%
    str_squish() %>% as.tibble()
    html_nodes("form tr:nth-child(3) .ldrow")%>%
    str_replace_all("[\r\n]" , "")%>%
    str_remove(".*Born: ")%>%
    str_squish() %>% as.tibble()
  player_information <- cbind.fill(player, playing.for, number, weight, height,draft_position, club_drafted, position,born)

  player_information <- as.tibble(player_information)
  # print(x)
  # return(x)
footywire <- purrr::map_df(url_players, player_info)

names(footywire) <- c("player", "club", "number","weight","height",  "draft_position", "club_drafted", "position","born")

Looking at the footywire table we have just created we have a few issues that we need to fix up before we do some graphs.

The first is that we can see by looking at the first row that players who have just been drafted have not had all their information filled out in the table so these players will need to be filtered out.

The easiest way to do this it would seem is to just filter out any row that contains an NA. We can do that using complete.cases


The next thing we have to do, is because we want to join our datasets together we need the team names to be consistent as we would like to join on both player and club.

So to do that what we need is a list from the fitzRoy data.

##  [1] "Richmond"         "Carlton"          "Geelong"         
##  [4] "Essendon"         "Melbourne"        "Hawthorn"        
##  [7] "Brisbane"         "West Coast"       "Sydney"          
## [10] "St Kilda"         "Port Adelaide"    "North Melbourne" 
## [13] "Western Bulldogs" "Collingwood"      "Fremantle"       
## [16] "Adelaide"         "Gold Coast"       "GWS"
##  [1] "Geelong Cats"              "St Kilda Saints"          
##  [3] "Brisbane Lions"            "Collingwood Magpies"      
##  [5] "West Coast Eagles"         "Gold Coast Suns"          
##  [7] "North Melbourne Kangaroos" "Sydney Swans"             
##  [9] "Essendon Bombers"          "Port Adelaide Power"      
## [11] "Richmond Tigers"           "Adelaide Crows"           
## [13] "Melbourne Demons"          "Fremantle Dockers"        
## [15] "Hawthorn Hawks"            "GWS Giants"               
## [17] "Western Bulldogs"          "Carlton Blues"

So basically lets replace one list with the other.

In this example I am going to replace the footywire dataset with the extra information team names with the team names used in the fitzRoy dataset.

footywire$club[footywire$club=="Richmond Tigers"] <- "Richmond"
footywire$club[footywire$club=="St Kilda Saints"] <- "St Kilda" 
footywire$club[footywire$club=="Gold Coast Suns" ] <- "Gold Coast"
footywire$club[footywire$club=="Port Adelaide Power"] <- "Port Adelaide"
footywire$club[footywire$club=="Melbourne Demons"] <- "Melbourne"
footywire$club[footywire$club=="Hawthorn Hawks" ] <- "Hawthorn" 
footywire$club[footywire$club=="Western Bulldogs"] <-"Western Bulldogs" 
footywire$club[footywire$club=="Geelong Cats"] <-  "Geelong"
footywire$club[footywire$club=="Brisbane Lions" ] <- "Brisbane"
footywire$club[footywire$club=="West Coast Eagles" ] <- "West Coast"
footywire$club[footywire$club=="North Melbourne Kangaroos"] <- "North Melbourne"
footywire$club[footywire$club=="Essendon Bombers"] <- "Essendon"
footywire$club[footywire$club=="Adelaide Crows" ] <- "Adelaide"
footywire$club[footywire$club=="Fremantle Dockers"  ] <- "Fremantle"
footywire$club[footywire$club=="GWS Giants" ] <- "GWS" 
footywire$club[footywire$club=="Carlton Blues"  ] <- "Carlton"
footywire$club[footywire$club=="Collingwood Magpies"] <-"Collingwood" 
footywire$club[footywire$club=="Sydney Swans"] <-  "Sydney"  

Now we can left_join

left_join(df, footywire, by=c("Player"="player","Team"="club"))%>%
  View("New issues with dataset")

So what are the new issues, well the page we have scraped from only listed currently active players, so if the player isn’t currently playing but had data for 2010 say then there isn’t any position information to join too so it appears as a NA.

However, it should be noted that you can construct past players positions from footywire.

But lets not get too hung up on that. Lets’ acknowledge it and if anyone wants to they can just go ahead and hopefully edit the scraper above to get past players.

So again lets filter out the NA rows.

data_joined<-left_join(df, footywire, by=c("Player"="player","Team"="club"))


Now to work out a rough age, lets take out the last 4 characters because thats the year the player was born in then we can calculate their rough age.

data_joined$year_born<-str_sub(data_joined$born, start= -4)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1982    1989    1992    1992    1995    1999

Now we can start doing some simple comparisons of say the top ruckman by year age vs the top midfielders by years age.

data_joined%>%select(position, Player, Team, Season, SC, age)%>%
  filter(position %in% c("Midfield", "Ruck"))%>%
  group_by(Player, Team, Season, age, position)%>%
  group_by(position, Season) %>%
  top_n(n = 18, wt = meansc)%>%
  ggplot(aes(x=age,group=position))+geom_bar()+facet_wrap(~position)+ggtitle("Age of top 18 Ruckman and Midfielders within Season") 

So looking at the graph and with all the caveats such as incomplete datasets, only looking at game data from 2010 onwards but only including players that are currently active etc (there are more but I am sharing all code!)

It would seem as though there are more topline ruckman going around past 25 then there are midfielders.

What is going on here? Is this a data issue should we join on the data from players who have retired as well? Is there a confounder at play?

I am not terribly sure, but I hope that by sharing all this script that someone can do some digging.

To leave a comment for the author, please follow the link and comment on their blog: Analysis of AFL. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)