FPL Point Projection in R – Part 2

[This article was first published on Sport Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hello, welcome to part 2 of the series building a model to predict fantasy premier league points. If you missed the first part which i did some initial EDA go check it out here

Building a Model in R to Predict FPL Points

As I identified before the first part of this they key parts of the model are predicting how many goals might be in the match and the type of player and therefore do they have a scoring impact on a match or not. So first things first I need to create a score forecasting model.


In order to do this I need to have a methodology for identifying how strong each team is in the match and then correlate that difference in team strength to an estimation of the scores for both teams in the match. Fundamentally I’m going to work out how the average team performs in the selected metric and then compare each teams performances to average and the difference between to two will be the variable to compare to the score.

Metric options

  • Goals scored – Teams goals scored is the truest sense of how good a team is. Can be effected by short term variability in performances. This is freely available data
  • xG – xG or expected goals is the expected scoring rate when you take into account various details about the shot in question
  • shots taken- better teams will probably take more shots

I have decided to use expected goals data for each team. As a prediction metric it is highly correlated to actual performance and there is a source for the data thanks to fivethirtyeight.com. They publish team level expected goals data for many leagues freely available on their GitHub. I can read the data into R using the below code:

matchdat <- read_csv("https://projects.fivethirtyeight.com/soccer-api/club/spi_matches.csv")

Therefore I can use that to rate a team over the season and compare that to the average team. The data is easily read into R using the read_csv function.

## calaculating what the average performances 

homeav <- matchdat %>% filter(league_id == 2411) %>% #working out the average home xg 
                          summarise(avfor = mean(xg1, na.rm = T), avag = mean(xg2, na.rm = T)) 

homfor <- homeav[[1]] # extracting the number for home and away

homeag <- homeav[[2]]

# comparing each teams performance to average for xg for and against
homeratdat2 <- matchdat %>% filter(league_id == 2411) %>%
                        group_by(team1) %>%
                        summarise(avfor = mean(xg1, na.rm = T), avag = mean(xg2, na.rm = T)) %>% 
                        mutate(xgfh = avfor - homfor, xgah = avag - homeag) %>%
                        select(team1, xgfh, xgah)


The code above calculates what the average team performs like for expected goals for and against. This is then compared to the teams actual performance to give each team in each season a delta. The code above is the calculation for home teams but the same code is used for way teams.

## creating a data frame with matches the for and against scores and the xg deltas 
matches <- matchdat %>% filter(league_id == 2411) %>%
                              left_join(homeratdat2, by = "team1") %>%
                               left_join(awayratdat2, by = "team2") %>%
                                  mutate(deltafh =  xgfh + xgaa, deltagh = xgah+xgfa) %>%
                                    select(season, score1, score2, deltafh, deltagh)

## splitting it up for home and away

mat1 <- matches %>% select(season, score1, deltafh) %>%
                      mutate(loc = "home")
colnames(mat1)[2] <- "score"
colnames(mat1)[3] <- "delta"
mat2 <- matches %>% select(season, score2, deltagh) %>%
                      mutate(loc = "away")
colnames(mat2)[2] <- "score"
colnames(mat2)[3] <- "delta"

## putting it together and creating the score as a factor
matall <- mat1 %>% bind_rows(mat2) %>%
                  mutate(scorecat = as.factor(if_else(score > 5,"5", as.character(score)))) %>%
                    select(-score, -season) ### data for calculating the chance of goals scored 

Next, I joined each teams xG deltas to the actual scores for historical seasons. Therefore I have data which links each teams xG delta to the actual goals they scored and conceded in an historical match. I also split the data out so I could add the category if the team was home or away. Finally, as matches with more then 5 goals are rare I changed each time a team scored of conceded more then 5 to 5. Matches with more goals then that are so rare that the model will not be able to make good predictions on that.


##splitting the data into training and testing
score_split <- initial_split(matall, prop = 0.9, strata = scorecat) 
score_train <- training(score_split)
score_test <- testing(score_split)

##creating the classification random forest
rand1 <- rand_forest() %>% #type of model 
                set_engine("ranger") %>% # the engine used to fit the model. For randomm forest rf is a another one
                  set_mode("classification") %>% # the  mode for the model as random forest can do both classification and regression 
                  fit(scorecat ~., data = score_train) # the function to fit the model
### rand1 is the random forest for the score of the match

above I then used the functions from the tidy models packages to simply fit a random forest for the score prediction for a match. Based on if the team is home or away and the different in the expected goals ratings.

Above you can see the output from the initial model for the the Wolves vs Burnley game in 2019. For both teams there’s an estimate of how likely they are to score that amount of goals. Wolves are likely to score more then Burnley but there isn’t too much difference in the teams. Now for a player playing in this match I can have 10000 runs of this match and just over 30% of them Wolves will score 1 goal. This then becomes one of the arguments in the models to predict goals scored by a player and assists by the player. Defensive points can be directly calculated from the Oppositions estimated goals scored. Now I the basis an a rough prediction of match outcomes I can move onto the precise player predictions to make this relevant to FPL

To leave a comment for the author, please follow the link and comment on their blog: Sport Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)