Regression to the Mean

[This article was first published on Analysis of AFL, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We also see this in sports, Hollinger refers to it as the fluke rule. Bill James calls it the plexiglas principle, you might call it the Sports Illustrated Jinx. They all are different names from different people who have observed the same thing mean reversion or regression to the mean.

Regression to the mean as a concept was first introduced in 1886 by Galton who observed that on average taller parents have tall kids but not as tall as them. In otherwords their heights ‘shrunk’ towards the overall average at the time.

Regression to the mean is a powerful concept in statistics and how can we use it?

The following will be a simple goal-kicking example but if you wanted to, you could use something like supercoach scores (SC) to beat your friends or the bookie. Maybe you want to use it more at a team level? Will total scores return via some regression to the mean?

Step 1 create the dataset

Thankfully we can use fitzRoy for the data we need pretty easily. Here we will take goals (G), behinds (B) and create a variable using mutate called shots which just represent total shots on goals (we don’t have access to kicks that were out on the full or dropped short) so this becomes the sum of goals and behinds.

We are looking at regression to the mean for goal kicking accuracy on a year to year bases on a player level. Hence our group_by(Player, Team, Season)

library(tidyverse)
## -- Attaching packages ---------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.6
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0
## -- Conflicts ------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(fitzRoy)
data<-fitzRoy::player_stats

data<-data%>%
  group_by(Player, Team, Season)%>%
  summarise(TG=sum(G), TB=sum(B))




data%>%
  group_by(Player,Team, Season)%>%
  mutate(shots= TG + TB)%>%
  ggplot(aes(x=shots))+geom_histogram(binwidth = 5)

data<-data%>%
  group_by(Player,Team, Season)%>%
  mutate(shots= TG + TB)

Step Two Subsetting our data, finding ours means

Looking at our earlier plot we see that we have lots of players who have failed to take a single shot at goal. We don’t want this to effect what we are doing down the track so lets arbitrary cut off shots at say 20 or more.

subset = data$shots > 19


good = data$TG[subset]
had_a_go = data$shots[subset]
names(good) = names(had_a_go) = paste(data$Player[subset], data$Season[subset], data$Team[subset])

I know the above is tidyverse, but sometimes its nice to contrast it with base.

indl_player_shot_pert = good/had_a_go

league_wide_pert = sum(good)/sum(had_a_go)
print(league_wide_pert)
## [1] 0.6007096
sigma2L = league_wide_pert*(1-league_wide_pert)/had_a_go

sigma2T = 0

for (i in 1:100000) {
  weight = 1/(2*(sigma2T + sigma2L)^2)
  sigma2T = sum(weight*((indl_player_shot_pert - league_wide_pert)^2 - sigma2L))/sum(weight)
}
print(sigma2T)
## [1] 0.001483404
sqrt(sigma2T)
## [1] 0.03851498
skill = (league_wide_pert/sigma2T + indl_player_shot_pert/sigma2L)/(1/sigma2T + 1/sigma2L)


head(sort(skill, decreasing = TRUE))
##          Luke Breust 2014 Hawthorn Tory Dickson 2015 Western Bulldogs 
##                          0.6681194                          0.6577340 
##   Christopher Mayne 2012 Fremantle   Drew Petrie 2012 North Melbourne 
##                          0.6554409                          0.6526506 
##   Barry Hall 2011 Western Bulldogs          Cyril Rioli 2016 Hawthorn 
##                          0.6509844                          0.6501355
tail(sort(skill, decreasing = TRUE))
## Marcus Bontempelli 2017 Western Bulldogs 
##                                0.5559301 
##              Dustin Martin 2013 Richmond 
##                                0.5543815 
##             Daniel Jackson 2011 Richmond 
##                                0.5518519 
##                  Lewis Jetta 2010 Sydney 
##                                0.5489428 
##              Jack Billings 2017 St Kilda 
##                                0.5443335 
##      Lindsay Thomas 2011 North Melbourne 
##                                0.5401664
plot(indl_player_shot_pert, skill, xlim = c(.2, .5), col = 'orange')
abline(0, 1, col = 'blue')

plot(density(indl_player_shot_pert), ylim = c(0, 30), col = 'orange',
     main = 'Distribution of Goal kicking skill in AFL',
     xlab = 'Goal Kicking  percentage')
lines(density(skill), col = 'blue')
legend('topright', c('True skill', 'Observed'),
       col = c('blue', 'orange'), lty = 1)

How do we interpret these results?

You might first look at Luke Breust 2014 and it looks as though hes leading in accuracy you see the number below him 0.6701927 and are now thinking maybe that’s how many shots he converted. That is easy to check and we can do that using the script below.

library(tidyverse)
library(fitzRoy)
data<-fitzRoy::player_stats

data%>%filter(Player=="Luke Breust" & Season==2014)%>%
  summarise(TG=sum(G), TB=sum(B))%>%
  mutate(shotpert=TG/(TG+TB))
##   TG TB shotpert
## 1 57 12 0.826087

We get 0.826087 which is not 0.6701927 so what is going on here?

Well our 0.6701927 is what we would call our luck adjusted percentage for Luke Breuest in 2014.

But what does this mean and how do we check it?

Our skill formula is as follow in R

skill = (league_wide_pert/sigma2T + indl_player_shot_pert/sigma2L)/(1/sigma2T + 1/sigma2L)

But what does this mean in english?

When we think about regression to the mean really we want to know these three things. * How good is the league or what is the population average * How good was the individual * After regressesing the indivual back what is left is their underlying true skill.

Lets look at the skill formula above, we have our

  • league_wide_pert = 0.6009457 (League wide scoring shot percentage)
  • indl_player_shot_pert= 57/(57+12) (Luke Breust shot percentage for 2014)
  • sigma2T =0.001543792 (underlying true skill standard deviation of league)
  • sigma2L = 0.003475507 (underlying Luke Breust standard deviation for 2014) subbing all that into our skill formula above we get
((0.6009457/0.001543792) + (57/(57+12))/0.003475507 )/((1/0.001543792) + (1/0.003475507))

Which just happens to be our skill for Luke Breust show earlier.

With an estimated 0.6701927 or 67.02% true skill for goal conversion, Luke Bruest is above average in true league average skill (0.6009457) and well outside one standard deviation of the population. His observed goal conversation percentage is 0.8261, which our estimate using regression to the mean is 0.6701927 which means he is kicking above his true talent level. It can be thought of as hes getting really lucky. Looking at his most recent completed year, he had a observed percentage of 0.66 (33/33+17) in other words as he aged and should have been getting better, he regressed towards his true mean which is still above league average. This year he is again below his best but still above league average.

 fitzRoy::player_stats%>%
filter(Player=="Luke Breust" &Season==2018)%>%
  summarise(TG=sum(G),TB=sum(B)) %>%
  mutate(accuracy=TG/(TG+TB))
##   TG TB  accuracy
## 1 45 21 0.6818182

This shouldn’t surprise us, to put in perspective just how good Luke Breusts raw accuracy is lets create a table ranking players by accuracy over the years. Where not surprisingly he comes out as top of the list, while the mean accuracy of that group is 0.6369235. So it should be no surprise that Lukes accuracy was not sustainable

fitzRoy::get_afltables_stats(start_date = "1897-01-01", end_date = Sys.Date())%>%
  select(First.name, Surname, Playing.for, Season, ID, Goals,Behinds)%>%
  group_by(First.name, Surname, Playing.for, Season, ID)%>%
  summarise(TG=sum(Goals),TB=sum(Behinds))%>%
  mutate(accuracy=(TG/(TG+TB)))%>%
  filter(TG>49 & Season>1964)%>%
  arrange(desc(accuracy)) 
## Returning data from 1897-01-01 to 2018-09-04
## Downloading data
## 
## Finished downloading data. Processing XMLs
## Warning: Unknown columns: `Substitute`
## Finished getting afltables data
## Adding missing grouping variables: `Round`, `Home.team`, `Away.team`
## # A tibble: 504 x 8
## # Groups:   First.name, Surname, Playing.for, Season [504]
##    First.name Surname    Playing.for     Season    ID    TG    TB accuracy
##    <chr>      <chr>      <chr>            <dbl> <int> <dbl> <dbl>    <dbl>
##  1 Luke       Breust     Hawthorn          2014 11954    57    12    0.826
##  2 Stephen    Milne      St Kilda          2002   978    50    11    0.820
##  3 Tony       Lockett    St Kilda          1993   990    53    12    0.815
##  4 Tory       Dickson    Western Bulldo~   2015 12043    50    12    0.806
##  5 Matthew    Lloyd      Essendon          2008   336    62    16    0.795
##  6 Barry      Richardson Richmond          1971  2414    50    13    0.794
##  7 Tony       Lockett    St Kilda          1985   990    79    22    0.782
##  8 George     Young      St Kilda          1978  2299    70    21    0.769
##  9 Peter      Hudson     Hawthorn          1970  1822   146    44    0.768
## 10 Tony       Lockett    St Kilda          1989   990    78    24    0.765
## # ... with 494 more rows

To leave a comment for the author, please follow the link and comment on their blog: Analysis of AFL.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)