Plotting Background Data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
One of the best things about the online R community is there are lots and lots of great people to follow online especially. One person I follow dataandme consistently links great how to do things on R tutorials.
Recently I saw this post where she linked a post by drsimonj running through how to plot background data in R. My first thought as when reading most data viz things is can I do this with footy?
Step 1 – Figuring out what I want to visualise
Luckily I received this message recently and thought that this would be a pretty good application of visualising background data.
I’m interested in seeing the way the opposition style affects a teams handball vs kick disposals. So for last x seasons, what was each teams unconditional average handball/kicks ratio. Then look at each game based on the opponent, and summarise the average incremental effect that each opposition team has on handballs/kicks. So 16 numbers where positive means they allow more handballs to kicks than league avg.
What I think its handy to do, is to draw using a bit of pen and paper what you think you want the graph to look like.
Step 2a – Get the data for each team by round
First thing we do is we get the datasets, and then we create the variable we want (hb2k
) which is our handball to kick ratio for a team.
library(fitzRoy) ## Warning: package 'fitzRoy' was built under R version 3.5.1 library(tidyverse) ## Warning: package 'tidyverse' was built under R version 3.5.1 ## -- Attaching packages ----------------------------------- tidyverse 1.2.1 -- ## v ggplot2 3.1.0 v purrr 0.2.5 ## v tibble 1.4.2 v dplyr 0.7.7 ## v tidyr 0.8.1 v stringr 1.3.1 ## v readr 1.1.1 v forcats 0.3.0 ## Warning: package 'ggplot2' was built under R version 3.5.1 ## Warning: package 'tidyr' was built under R version 3.5.1 ## Warning: package 'purrr' was built under R version 3.5.1 ## Warning: package 'dplyr' was built under R version 3.5.1 ## Warning: package 'stringr' was built under R version 3.5.1 ## -- Conflicts -------------------------------------- tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() df<-fitzRoy::player_stats df1<-fitzRoy::get_footywire_stats(9514:9594) #gets 2018 data until end of round 9 ## Getting data from footywire.com ## Finished getting data df<-df%>% filter(Season != 2018) #filters out the 2018 data (incomeplete that was downloaded when installing fitzRoy for first time) df2<-rbind(df, df1) #stacks the datasets on top of each other df2%>% select(Season, Round, K,HB, Team, Opposition, Date) %>% group_by(Season, Round, Team, Opposition, Date)%>% summarise(tk=sum(K), thb=sum(HB)) %>% mutate(hb2k=thb/tk) ## # A tibble: 3,408 x 8 ## # Groups: Season, Round, Team, Opposition [3,406] ## Season Round Team Opposition Date tk thb hb2k ## <dbl> <chr> <chr> <chr> <date> <int> <int> <dbl> ## 1 2010 Elimination F~ Carlton Sydney 2010-09-05 191 128 0.670 ## 2 2010 Elimination F~ Fremantle Hawthorn 2010-09-04 216 113 0.523 ## 3 2010 Elimination F~ Hawthorn Fremantle 2010-09-04 197 148 0.751 ## 4 2010 Elimination F~ Sydney Carlton 2010-09-05 173 177 1.02 ## 5 2010 Grand Final Collingw~ St Kilda 2010-09-25 198 128 0.646 ## 6 2010 Grand Final Collingw~ St Kilda 2010-10-02 250 129 0.516 ## 7 2010 Grand Final St Kilda Collingwo~ 2010-09-25 193 145 0.751 ## 8 2010 Grand Final St Kilda Collingwo~ 2010-10-02 178 148 0.831 ## 9 2010 Preliminary F~ Collingw~ Geelong 2010-09-17 228 101 0.443 ## 10 2010 Preliminary F~ Geelong Collingwo~ 2010-09-17 179 229 1.28 ## # ... with 3,398 more rows
- From
df2
weselect
the columns we want which areSeason
,Round
,K
(which is kicks),HB
which is handballs,Team
Opposition
andDate
- We then use
group_by
so we cansummarise
the total kickstk
and the total handballsthb
by that a team has done vs a given oppostion (Team, Opposition
) for a game (Season, Round, Date
) - As teams don’t play each other more than once per
Round
you would thinkDate
would be redundant right? But no in 2010 we had a replay of a grand final - lastly we create the variable we want using
mutate
hb2k
which is the total handballs divided by the total kicks for a given game. . # Step 2b – Get the teams yearly average for the straight lines
df2%>% select(Season, Round, K,HB, Team, Opposition, Date ) %>% group_by(Season, Round, Team, Opposition, Date)%>% summarise(tk=sum(K), thb=sum(HB)) %>% mutate(hb2k=thb/tk) %>% group_by(Season, Opposition) %>% summarise(average_hb2k=mean(hb2k)) %>% filter(Season==2018) ## # A tibble: 18 x 3 ## # Groups: Season [1] ## Season Opposition average_hb2k ## <dbl> <chr> <dbl> ## 1 2018 Adelaide 0.763 ## 2 2018 Brisbane 0.810 ## 3 2018 Carlton 0.745 ## 4 2018 Collingwood 0.795 ## 5 2018 Essendon 0.788 ## 6 2018 Fremantle 0.729 ## 7 2018 Geelong 0.701 ## 8 2018 Gold Coast 0.721 ## 9 2018 GWS 0.760 ## 10 2018 Hawthorn 0.824 ## 11 2018 Melbourne 0.836 ## 12 2018 North Melbourne 0.741 ## 13 2018 Port Adelaide 0.718 ## 14 2018 Richmond 0.801 ## 15 2018 St Kilda 0.767 ## 16 2018 Sydney 0.657 ## 17 2018 West Coast 0.753 ## 18 2018 Western Bulldogs 0.717
Remember we wanted to be able to reference for a given team what is their average handball to kick ratio conceded. One way to do this is to add a reference line. To get the values for the reference line, we use group_by(Season, Opposition)
which allows us then to summarise by Season
for each Opposition
team their average handball to kick ratio. summarise(average_hb2k=mean(hb2k))
. Then as we want a graph for 2018 we use filter(Season==2018)
to achieve this. Which means we get the average values by Oppostion
for 2018 only.
Step 3 – What team do you want to see?
For tonights game lets look at Collingwood vs the Western Bulldogs. Lets have a look at what the teams handball to kick ratio has been when playing Collingwood and what it has been when playing Western Bulldogs
df2%>% select(Season, Round, K,HB, Team, Opposition, Date ) %>% group_by(Season, Round, Team, Opposition, Date)%>% summarise(tk=sum(K), thb=sum(HB)) %>% mutate(hb2k=thb/tk) %>% filter(Season==2018)%>% filter(Opposition=="Collingwood")%>% ggplot(aes(x=Date, y=hb2k)) +geom_point() + geom_text(aes(label=Team), size=2)
Our first check is that the teams that are playing the pies are in the right order. We can do this using geom_text(aes(label=Team))
. We can check the pies footywire page and see that they do play hawks round 1
Our next check is that the datapoints are correct, we can check this by just adding a data label.
df2%>% select(Season, Round, K,HB, Team, Opposition, Date ) %>% group_by(Season, Round, Team, Opposition, Date)%>% summarise(tk=sum(K), thb=sum(HB)) %>% mutate(hb2k=thb/tk) %>% filter(Season==2018)%>% filter(Opposition=="Collingwood")%>% ggplot(aes(x=Date, y=hb2k)) +geom_point() + geom_text(aes(label=Team), size=2) + geom_text(aes(label=hb2k), vjust=-1, size=1.5)
We check if Hawthorns handball to kick ratio vs the pies in round 1 was 0.7866.
Step 4 Get the Competitions yearly average handball to kick ratio
df2%>% select(Season, Round, K,HB, Team, Opposition, Date ) %>% group_by(Season, Round, Team, Opposition, Date)%>% summarise(tk=sum(K), thb=sum(HB)) %>% mutate(hb2k=thb/tk) %>% group_by(Season)%>% summarise(meanhb2k=mean(hb2k)) ## # A tibble: 9 x 2 ## Season meanhb2k ## <dbl> <dbl> ## 1 2010 0.845 ## 2 2011 0.761 ## 3 2012 0.739 ## 4 2013 0.737 ## 5 2014 0.774 ## 6 2015 0.788 ## 7 2016 0.834 ## 8 2017 0.819 ## 9 2018 0.757
From here we can see that the average handball to kick ratio for 2018 so far is 0.757.
Step 5 – Putting it all together for one team
df2%>% select(Season, Round, K,HB, Team, Opposition, Date ) %>% group_by(Season, Round, Team, Opposition, Date)%>% summarise(tk=sum(K), thb=sum(HB)) %>% mutate(hb2k=thb/tk) %>% filter(Season==2018)%>% filter(Opposition=="Collingwood")%>% ggplot(aes(x=Date, y=hb2k)) +geom_point() + geom_text(aes(label=Team), size=2) + geom_hline(yintercept = 0.757) +ggtitle("Opponents handball 2 kick ratio vs Collingwood")
Step 6 – Compare the pair
From above, we cacn see all we need to do is change “Collingwood” to “Western Bulldogs”
df2%>% select(Season, Round, K,HB, Team, Opposition, Date ) %>% group_by(Season, Round, Team, Opposition, Date)%>% summarise(tk=sum(K), thb=sum(HB)) %>% mutate(hb2k=thb/tk) %>% filter(Season==2018)%>% filter(Opposition=="Western Bulldogs")%>% ggplot(aes(x=Date, y=hb2k)) +geom_point() + geom_text(aes(label=Team), size=2) + geom_hline(yintercept = 0.757) +ggtitle("Opponents handball 2 kick ratio vs Western Bulldogs")
df2%>% select(Season, Round, K,HB, Team, Opposition, Date ) %>% group_by(Season, Round, Team, Opposition, Date)%>% summarise(tk=sum(K), thb=sum(HB)) %>% mutate(hb2k=thb/tk) %>% filter(Season==2018)%>% filter(Opposition=="Collingwood")%>% ggplot(aes(x=Date, y=hb2k)) +geom_point() + geom_text(aes(label=Team), size=2) + geom_hline(yintercept = 0.757) +ggtitle("Opponents handball 2 kick ratio vs Collingwood")
Step 7 – But what about the background stuff?
First lets do Collingwood, I’m thinking lets see a similar plot, but lets change Date
to Round
as games are on different days of the week and lets make Collingwood a different colour to the rest of the competition.
df3<-df2%>% select(Season, Round, K,HB, Team, Opposition, Date ) %>% group_by(Season, Round, Team, Opposition, Date)%>% summarise(tk=sum(K), thb=sum(HB)) %>% mutate(hb2k=thb/tk) %>% filter(Season==2018)%>% filter(Opposition == "Collingwood") df2%>% select(Season, Round, K,HB, Team, Opposition, Date ) %>% group_by(Season, Round, Team, Opposition, Date)%>% summarise(tk=sum(K), thb=sum(HB)) %>% mutate(hb2k=thb/tk) %>% filter(Season==2018)%>% filter(Opposition != "Collingwood") %>% ggplot(aes(x=Round,y=hb2k))+ geom_point(colour="grey", alpha=0.5)+ geom_point(data=df3,colour="black")
We can check our black points are the points we want simply by labelling them.
df2%>% select(Season, Round, K,HB, Team, Opposition, Date ) %>% group_by(Season, Round, Team, Opposition, Date)%>% summarise(tk=sum(K), thb=sum(HB)) %>% mutate(hb2k=thb/tk) %>% filter(Season==2018)%>% filter(Opposition != "Collingwood") %>% ggplot(aes(x=Round, y=hb2k))+geom_point(data=df3)+ geom_point(colour="grey") +geom_text(data=df3, aes(label=round(hb2k,2),hjust=-0.2))
We use geom_text(data=df3,aes(label=round(hb2k,2),hjust=-.2))
to label our data, what this does is takes the Collingwood only subdataset data=df3
and we label it by the corresponding hb2k
variable we created earlier. We also round the numbers so it doesn’t look too messy to 2 decimals places using round(hb2k, 2)
and we move the label to slightly right of the actual data point using hjust=-0.2
So what are the steps involved in this?
After reading the post, you can think of it as two parts.
Part one – plot the data faded without the subdata you want to emphasize.
Part two – plot the subdata you want to emphasize and colour it differently to the rest to make it stand out.
So to do this, first lets get the subdata we want which is Collingwoods handball to kick ratio conceded to teams. We call this subdataset df3
df3<-df2%>% select(Season, Round, K,HB, Team, Opposition, Date ) %>% group_by(Season, Round, Team, Opposition, Date)%>% summarise(tk=sum(K), thb=sum(HB)) %>% mutate(hb2k=thb/tk) %>% filter(Season==2018)%>% filter(Opposition == "Collingwood")
Then we plot the full dataset without the Collingwood data, we then want to make this look faded we do this by colouring the points in grey geom_point(colour="grey"
and fading the points by using , alpha=0.5)
df2%>% select(Season, Round, K,HB, Team, Opposition, Date ) %>% group_by(Season, Round, Team, Opposition, Date)%>% summarise(tk=sum(K), thb=sum(HB)) %>% mutate(hb2k=thb/tk) %>% filter(Season==2018)%>% filter(Opposition != "Collingwood") %>% ggplot(aes(x=Round,y=hb2k))+ geom_point(colour="grey", alpha=0.5)
Lastly we then add in the Collingwood data and make it black geom_point(data=df3,colour="black")
so it stands out a bit more from our faded grey earlier for the rest of the competition.
df2%>% select(Season, Round, K,HB, Team, Opposition, Date ) %>% group_by(Season, Round, Team, Opposition, Date)%>% summarise(tk=sum(K), thb=sum(HB)) %>% mutate(hb2k=thb/tk) %>% filter(Season==2018)%>% filter(Opposition != "Collingwood") %>% ggplot(aes(x=Round,y=hb2k))+ geom_point(colour="grey", alpha=0.5)+ geom_point(data=df3,colour="black")
That’s cool but how do you do the cool small multiple plots that had a bit more colour to it?
df4<-df2%>% select(Season, Round, K,HB, Team, Opposition, Date ) %>% group_by(Season, Round, Team, Opposition, Date)%>% summarise(tk=sum(K), thb=sum(HB)) %>% mutate(hb2k=thb/tk) %>% filter(Season==2018) d_bg <- df4[, -4] ggplot(df4, aes(x = thb, y = tk, colour=Opposition)) + geom_point(data=d_bg, colour = "grey", alpha = .5)+ geom_point() + guides(colour = FALSE)+ geom_text(aes(label=Team), size=1.5)+ facet_wrap(~ Opposition)+ guides(colour = FALSE) + theme_bw()
Ok so what is actually going on here?
This post here has a great explanation and is by far better than anything I could do myself so please read it!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.