Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the best things about the online R community is there are lots and lots of great people to follow online especially. One person I follow dataandme consistently links great how to do things on R tutorials.

Recently I saw this post where she linked a post by drsimonj running through how to plot background data in R. My first thought as when reading most data viz things is can I do this with footy?

# Step 1 – Figuring out what I want to visualise

Luckily I received this message recently and thought that this would be a pretty good application of visualising background data.

I’m interested in seeing the way the opposition style affects a teams handball vs kick disposals. So for last x seasons, what was each teams unconditional average handball/kicks ratio. Then look at each game based on the opponent, and summarise the average incremental effect that each opposition team has on handballs/kicks. So 16 numbers where positive means they allow more handballs to kicks than league avg.

What I think its handy to do, is to draw using a bit of pen and paper what you think you want the graph to look like.

# Step 2a – Get the data for each team by round

First thing we do is we get the datasets, and then we create the variable we want (`hb2k`) which is our handball to kick ratio for a team.

`library(fitzRoy)`

`## Warning: package 'fitzRoy' was built under R version 3.5.1`

`library(tidyverse)`

`## Warning: package 'tidyverse' was built under R version 3.5.1`

`## -- Attaching packages ----------------------------------- tidyverse 1.2.1 --`

```## v ggplot2 3.1.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.7
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0```

`## Warning: package 'ggplot2' was built under R version 3.5.1`

`## Warning: package 'tidyr' was built under R version 3.5.1`

`## Warning: package 'purrr' was built under R version 3.5.1`

`## Warning: package 'dplyr' was built under R version 3.5.1`

`## Warning: package 'stringr' was built under R version 3.5.1`

```## -- Conflicts -------------------------------------- tidyverse_conflicts() --

```df<-fitzRoy::player_stats

df1<-fitzRoy::get_footywire_stats(9514:9594) #gets 2018 data until end of round 9```

`## Getting data from footywire.com`

`## Finished getting data`

```df<-df%>%
filter(Season != 2018) #filters out the 2018 data (incomeplete that was downloaded when installing fitzRoy for first time)
df2<-rbind(df, df1) #stacks the datasets on top of each other

df2%>%
select(Season, Round, K,HB, Team, Opposition, Date) %>%
group_by(Season, Round, Team, Opposition, Date)%>%
summarise(tk=sum(K), thb=sum(HB)) %>%
mutate(hb2k=thb/tk)```

```## # A tibble: 3,408 x 8
## # Groups:   Season, Round, Team, Opposition [3,406]
##    Season Round          Team      Opposition Date          tk   thb  hb2k
##     <dbl> <chr>          <chr>     <chr>      <date>     <int> <int> <dbl>
##  1   2010 Elimination F~ Carlton   Sydney     2010-09-05   191   128 0.670
##  2   2010 Elimination F~ Fremantle Hawthorn   2010-09-04   216   113 0.523
##  3   2010 Elimination F~ Hawthorn  Fremantle  2010-09-04   197   148 0.751
##  4   2010 Elimination F~ Sydney    Carlton    2010-09-05   173   177 1.02
##  5   2010 Grand Final    Collingw~ St Kilda   2010-09-25   198   128 0.646
##  6   2010 Grand Final    Collingw~ St Kilda   2010-10-02   250   129 0.516
##  7   2010 Grand Final    St Kilda  Collingwo~ 2010-09-25   193   145 0.751
##  8   2010 Grand Final    St Kilda  Collingwo~ 2010-10-02   178   148 0.831
##  9   2010 Preliminary F~ Collingw~ Geelong    2010-09-17   228   101 0.443
## 10   2010 Preliminary F~ Geelong   Collingwo~ 2010-09-17   179   229 1.28
## # ... with 3,398 more rows```

• From `df2` we `select` the columns we want which are `Season`, `Round`, `K` (which is kicks), `HB` which is handballs, `Team` `Opposition` and `Date`
• We then use `group_by` so we can `summarise` the total kicks `tk` and the total handballs `thb` by that a team has done vs a given oppostion (`Team, Opposition`) for a game (`Season, Round, Date`)
• As teams don’t play each other more than once per `Round` you would think `Date` would be redundant right? But no in 2010 we had a replay of a grand final
• lastly we create the variable we want using `mutate` `hb2k` which is the total handballs divided by the total kicks for a given game. . # Step 2b – Get the teams yearly average for the straight lines

```  df2%>%
select(Season, Round, K,HB, Team, Opposition, Date ) %>%
group_by(Season, Round, Team, Opposition, Date)%>%
summarise(tk=sum(K), thb=sum(HB)) %>%
mutate(hb2k=thb/tk) %>%
group_by(Season, Opposition) %>%
summarise(average_hb2k=mean(hb2k)) %>%
filter(Season==2018)```

```## # A tibble: 18 x 3
## # Groups:   Season 
##    Season Opposition       average_hb2k
##     <dbl> <chr>                   <dbl>
##  2   2018 Brisbane                0.810
##  3   2018 Carlton                 0.745
##  4   2018 Collingwood             0.795
##  5   2018 Essendon                0.788
##  6   2018 Fremantle               0.729
##  7   2018 Geelong                 0.701
##  8   2018 Gold Coast              0.721
##  9   2018 GWS                     0.760
## 10   2018 Hawthorn                0.824
## 11   2018 Melbourne               0.836
## 12   2018 North Melbourne         0.741
## 13   2018 Port Adelaide           0.718
## 14   2018 Richmond                0.801
## 15   2018 St Kilda                0.767
## 16   2018 Sydney                  0.657
## 17   2018 West Coast              0.753
## 18   2018 Western Bulldogs        0.717```

Remember we wanted to be able to reference for a given team what is their average handball to kick ratio conceded. One way to do this is to add a reference line. To get the values for the reference line, we use `group_by(Season, Opposition)` which allows us then to summarise by `Season` for each `Opposition` team their average handball to kick ratio. `summarise(average_hb2k=mean(hb2k))`. Then as we want a graph for 2018 we use `filter(Season==2018)` to achieve this. Which means we get the average values by `Oppostion` for 2018 only.

### Step 3 – What team do you want to see?

For tonights game lets look at Collingwood vs the Western Bulldogs. Lets have a look at what the teams handball to kick ratio has been when playing Collingwood and what it has been when playing Western Bulldogs

```df2%>%
select(Season, Round, K,HB, Team, Opposition, Date ) %>%
group_by(Season, Round, Team, Opposition, Date)%>%
summarise(tk=sum(K), thb=sum(HB)) %>%
mutate(hb2k=thb/tk) %>%
filter(Season==2018)%>%
filter(Opposition=="Collingwood")%>%
ggplot(aes(x=Date, y=hb2k)) +geom_point() +
geom_text(aes(label=Team), size=2)``` Our first check is that the teams that are playing the pies are in the right order. We can do this using `geom_text(aes(label=Team))`. We can check the pies footywire page and see that they do play hawks round 1

Our next check is that the datapoints are correct, we can check this by just adding a data label.

```df2%>%
select(Season, Round, K,HB, Team, Opposition, Date ) %>%
group_by(Season, Round, Team, Opposition, Date)%>%
summarise(tk=sum(K), thb=sum(HB)) %>%
mutate(hb2k=thb/tk) %>%
filter(Season==2018)%>%
filter(Opposition=="Collingwood")%>%
ggplot(aes(x=Date, y=hb2k)) +geom_point() +
geom_text(aes(label=Team), size=2) +
geom_text(aes(label=hb2k), vjust=-1, size=1.5)``` We check if Hawthorns handball to kick ratio vs the pies in round 1 was 0.7866.

# Step 4 Get the Competitions yearly average handball to kick ratio

```df2%>%
select(Season, Round, K,HB, Team, Opposition, Date ) %>%
group_by(Season, Round, Team, Opposition, Date)%>%
summarise(tk=sum(K), thb=sum(HB)) %>%
mutate(hb2k=thb/tk) %>%
group_by(Season)%>%
summarise(meanhb2k=mean(hb2k))```

```## # A tibble: 9 x 2
##   Season meanhb2k
##    <dbl>    <dbl>
## 1   2010    0.845
## 2   2011    0.761
## 3   2012    0.739
## 4   2013    0.737
## 5   2014    0.774
## 6   2015    0.788
## 7   2016    0.834
## 8   2017    0.819
## 9   2018    0.757```

From here we can see that the average handball to kick ratio for 2018 so far is 0.757.

# Step 5 – Putting it all together for one team

```df2%>%
select(Season, Round, K,HB, Team, Opposition, Date ) %>%
group_by(Season, Round, Team, Opposition, Date)%>%
summarise(tk=sum(K), thb=sum(HB)) %>%
mutate(hb2k=thb/tk) %>%
filter(Season==2018)%>%
filter(Opposition=="Collingwood")%>%
ggplot(aes(x=Date, y=hb2k)) +geom_point() +
geom_text(aes(label=Team), size=2) +
geom_hline(yintercept = 0.757) +ggtitle("Opponents handball 2 kick ratio vs Collingwood")``` # Step 6 – Compare the pair

From above, we cacn see all we need to do is change “Collingwood” to “Western Bulldogs”

```df2%>%
select(Season, Round, K,HB, Team, Opposition, Date ) %>%
group_by(Season, Round, Team, Opposition, Date)%>%
summarise(tk=sum(K), thb=sum(HB)) %>%
mutate(hb2k=thb/tk) %>%
filter(Season==2018)%>%
filter(Opposition=="Western Bulldogs")%>%
ggplot(aes(x=Date, y=hb2k)) +geom_point() +
geom_text(aes(label=Team), size=2) +
geom_hline(yintercept = 0.757) +ggtitle("Opponents handball 2 kick ratio vs Western Bulldogs")``` ```df2%>%
select(Season, Round, K,HB, Team, Opposition, Date ) %>%
group_by(Season, Round, Team, Opposition, Date)%>%
summarise(tk=sum(K), thb=sum(HB)) %>%
mutate(hb2k=thb/tk) %>%
filter(Season==2018)%>%
filter(Opposition=="Collingwood")%>%
ggplot(aes(x=Date, y=hb2k)) +geom_point() +
geom_text(aes(label=Team), size=2) +
geom_hline(yintercept = 0.757) +ggtitle("Opponents handball 2 kick ratio vs Collingwood")``` # Step 7 – But what about the background stuff?

First lets do Collingwood, I’m thinking lets see a similar plot, but lets change `Date` to `Round` as games are on different days of the week and lets make Collingwood a different colour to the rest of the competition.

```df3<-df2%>%
select(Season, Round, K,HB, Team, Opposition, Date ) %>%
group_by(Season, Round, Team, Opposition, Date)%>%
summarise(tk=sum(K), thb=sum(HB)) %>%
mutate(hb2k=thb/tk) %>%
filter(Season==2018)%>%
filter(Opposition == "Collingwood")

df2%>%
select(Season, Round, K,HB, Team, Opposition, Date ) %>%
group_by(Season, Round, Team, Opposition, Date)%>%
summarise(tk=sum(K), thb=sum(HB)) %>%
mutate(hb2k=thb/tk) %>%
filter(Season==2018)%>%
filter(Opposition != "Collingwood") %>%
ggplot(aes(x=Round,y=hb2k))+
geom_point(colour="grey", alpha=0.5)+
geom_point(data=df3,colour="black")``` We can check our black points are the points we want simply by labelling them.

```df2%>%
select(Season, Round, K,HB, Team, Opposition, Date ) %>%
group_by(Season, Round, Team, Opposition, Date)%>%
summarise(tk=sum(K), thb=sum(HB)) %>%
mutate(hb2k=thb/tk) %>%
filter(Season==2018)%>%
filter(Opposition != "Collingwood") %>%
ggplot(aes(x=Round, y=hb2k))+geom_point(data=df3)+
geom_point(colour="grey") +geom_text(data=df3, aes(label=round(hb2k,2),hjust=-0.2))``` We use `geom_text(data=df3,aes(label=round(hb2k,2),hjust=-.2))` to label our data, what this does is takes the Collingwood only subdataset `data=df3` and we label it by the corresponding `hb2k` variable we created earlier. We also round the numbers so it doesn’t look too messy to 2 decimals places using `round(hb2k, 2)` and we move the label to slightly right of the actual data point using `hjust=-0.2`

# So what are the steps involved in this?

After reading the post, you can think of it as two parts.

• Part one – plot the data faded without the subdata you want to emphasize.

• Part two – plot the subdata you want to emphasize and colour it differently to the rest to make it stand out.

So to do this, first lets get the subdata we want which is Collingwoods handball to kick ratio conceded to teams. We call this subdataset `df3`

```df3<-df2%>%
select(Season, Round, K,HB, Team, Opposition, Date ) %>%
group_by(Season, Round, Team, Opposition, Date)%>%
summarise(tk=sum(K), thb=sum(HB)) %>%
mutate(hb2k=thb/tk) %>%
filter(Season==2018)%>%
filter(Opposition == "Collingwood")```

Then we plot the full dataset without the Collingwood data, we then want to make this look faded we do this by colouring the points in grey `geom_point(colour="grey"` and fading the points by using `, alpha=0.5)`

```df2%>%
select(Season, Round, K,HB, Team, Opposition, Date ) %>%
group_by(Season, Round, Team, Opposition, Date)%>%
summarise(tk=sum(K), thb=sum(HB)) %>%
mutate(hb2k=thb/tk) %>%
filter(Season==2018)%>%
filter(Opposition != "Collingwood") %>%
ggplot(aes(x=Round,y=hb2k))+
geom_point(colour="grey", alpha=0.5)```

Lastly we then add in the Collingwood data and make it black `geom_point(data=df3,colour="black")` so it stands out a bit more from our faded grey earlier for the rest of the competition.

```df2%>%
select(Season, Round, K,HB, Team, Opposition, Date ) %>%
group_by(Season, Round, Team, Opposition, Date)%>%
summarise(tk=sum(K), thb=sum(HB)) %>%
mutate(hb2k=thb/tk) %>%
filter(Season==2018)%>%
filter(Opposition != "Collingwood") %>%
ggplot(aes(x=Round,y=hb2k))+
geom_point(colour="grey", alpha=0.5)+
geom_point(data=df3,colour="black")```

# That’s cool but how do you do the cool small multiple plots that had a bit more colour to it?

```  df4<-df2%>%
select(Season, Round, K,HB, Team, Opposition, Date ) %>%
group_by(Season, Round, Team, Opposition, Date)%>%
summarise(tk=sum(K), thb=sum(HB)) %>%
mutate(hb2k=thb/tk) %>%
filter(Season==2018)
d_bg <- df4[, -4]

ggplot(df4, aes(x = thb, y = tk, colour=Opposition)) +
geom_point(data=d_bg, colour = "grey", alpha = .5)+
geom_point() + guides(colour = FALSE)+ geom_text(aes(label=Team), size=1.5)+

facet_wrap(~ Opposition)+
guides(colour = FALSE) +
theme_bw()``` Ok so what is actually going on here?

This post here has a great explanation and is by far better than anything I could do myself so please read it!