Programming with R – Processing Football League Data Part II

[This article was first published on Software for Exploratory Data Analysis and Statistical Modelling, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Following on from the previous post about creating a football result processing function for data from the football-data.co.uk website we will add code to the function to generate a league table based on the results to date.

To create the league table we need to count various things such as the number of games played, number of wins/draws/losses, goals scored etc. This information is available in the results object that is loaded from a csv file in the function as it stands.

To facilitate these calculations we create a data frame with a row for each team in the division and then calculate the statistics required – this was a reason for ordering the factors in the HomeTeam and AwayTeam columns of the results table. The data frame is created with the code below:

tmpTable = data.frame(Team = teams,
    Games = 0, Win = 0, Draw = 0, Loss = 0,
    HomeGames = 0, HomeWin = 0, HomeDraw = 0, HomeLoss = 0,
    AwayGames = 0, AwayWin = 0, AwayDraw = 0, AwayLoss = 0,
    Points = 0,
    HomeFor = 0, HomeAgainst = 0,
    AwayFor = 0, AwayAgainst = 0,
    For = 0, Against = 0, GoalDifference = 0)

There are a number of slots that are may be redundant in a league table but are used for intermediate calculations, such as HomeWin and AwayWin that are combined to find the total number of victories for a team.

The number of games played by each team home and away are counted using the table command for the two columns respectively.

tmpTable$HomeGames = as.numeric(table(tmpResults$HomeTeam))
tmpTable$AwayGames = as.numeric(table(tmpResults$AwayTeam))

The labels created by the table command are discarded using the as.numeric function to retain only the number of games. The table command is also used to count the number of wins, draws and losses at home and away for each team. The commands are shown here:

tmpTable$HomeWin =
    as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == "H"]))
tmpTable$HomeDraw =
    as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == "D"]))
tmpTable$HomeLoss =
    as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == "A"]))
 
tmpTable$AwayWin =
    as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == "A"]))
tmpTable$AwayDraw =
    as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == "D"]))
tmpTable$AwayLoss =
    as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == "H"]))

Note that we subset on the values in the FTR column, which is full-time result, and then count. The subsetting is reversed when looking at the away fixtures because a victory for the team is now an away win rather than a home win.

This information is then combined to get total games played, won etc.

tmpTable$Games = tmpTable$HomeGames + tmpTable$AwayGames
tmpTable$Win = tmpTable$HomeWin + tmpTable$AwayWin
tmpTable$Draw = tmpTable$HomeDraw + tmpTable$AwayDraw
tmpTable$Loss = tmpTable$HomeLoss + tmpTable$AwayLoss

The total points is calclated by multiplying the number of wins, draws and losses by the number of points awarded for each match outcome.

tmpTable$Points = winPoints * tmpTable$Win +
    drawPoints * tmpTable$Draw + lossPoints * tmpTable$Loss

The next set of calculations are to count the number of goals scored, goals conceeded and goal difference. The tapply function is used for these calculations.

tmpTable$HomeFor =
    as.numeric(tapply(tmpResults$FTHG, tmpResults$HomeTeam, sum, na.rm = TRUE))
tmpTable$HomeAgainst =
    as.numeric(tapply(tmpResults$FTAG, tmpResults$HomeTeam, sum, na.rm = TRUE))
 
tmpTable$AwayFor =
    as.numeric(tapply(tmpResults$FTAG, tmpResults$AwayTeam, sum, na.rm = TRUE))
tmpTable$AwayAgainst =
    as.numeric(tapply(tmpResults$FTHG, tmpResults$AwayTeam, sum, na.rm = TRUE))

The tapply function applies the sum to the number of goals scored at home or away, and the number of goals conceeded by each team in the division. These are then combined to create totals home and away:

tmpTable$For =
    ifelse(is.na(tmpTable$HomeFor), 0, tmpTable$HomeFor) +
    ifelse(is.na(tmpTable$AwayFor), 0, tmpTable$AwayFor)
tmpTable$Against =
    ifelse(is.na(tmpTable$HomeAgainst), 0, tmpTable$HomeAgainst) +
    ifelse(is.na(tmpTable$AwayAgainst), 0, tmpTable$AwayAgainst)

The ifelse statement is used to handle situations where a team hasn’t played a home and/or away fixture yet. The goal difference is easy to calculate:

tmpTable$GoalDifference = tmpTable$For - tmpTable$Against

Now that all of the statistics have been calculated we sort the table based on the number of points, goal difference and finally alphabetically. There might be different ways that we can order the teams but this is what we will use for the time being:

tmpTable =
  tmpTable[order(- tmpTable$Points, - tmpTable$GoalDifference, tmpTable$Team),]

The ordering might look odd but we want to ranking from highest to lowest points and goal difference but then in ascending alphabetical order for the teams.

The whole function is now:

football.process.v2 = function(datafile, country, divname, season, teams, winPoints = 3, drawPoints = 1, lossPoints = 0)
{
## Validation Function Arguments
 
if (missing(datafile))
{
stop("Results csv file not specified.")
}
 
if (missing(country))
{
warning("Country of league not specified.")
country = ""
}
 
if (missing(divname))
{
warning("Name of league division not specified.")
divname = ""
}
 
## Import Results
 
tmpResults = read.csv(datafile)[,c("Date","HomeTeam","AwayTeam","FTR","FTHG","FTAG")]
 
if (missing(teams))
{
warning("Team names not specified - extracted from results data.")
teams = sort(unique(c(as.character(tmpResults$HomeTeam), as.character(tmpResults$AwayTeam))))
}
 
tmpResults$HomeTeam = factor(tmpResults$HomeTeam, levels = teams)
tmpResults$AwayTeam = factor(tmpResults$AwayTeam, levels = teams)
 
## Create Empty League Table
 
tmpTable = data.frame(Team = teams,
Games = 0, Win = 0, Draw = 0, Loss = 0,
HomeGames = 0, HomeWin = 0, HomeDraw = 0, HomeLoss = 0,
AwayGames = 0, AwayWin = 0, AwayDraw = 0, AwayLoss = 0,
Points = 0,
HomeFor = 0, HomeAgainst = 0,
AwayFor = 0, AwayAgainst = 0,
For = 0, Against = 0, GoalDifference = 0)
 
## Count Number of Games Played
 
tmpTable$HomeGames = as.numeric(table(tmpResults$HomeTeam))
tmpTable$AwayGames = as.numeric(table(tmpResults$AwayTeam))
 
## Count Number of Wins/Draws/Losses
 
tmpTable$HomeWin = as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == "H"]))
tmpTable$HomeDraw = as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == "D"]))
tmpTable$HomeLoss = as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == "A"]))
 
tmpTable$AwayWin = as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == "A"]))
tmpTable$AwayDraw = as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == "D"]))
tmpTable$AwayLoss = as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == "H"]))
 
tmpTable$Games = tmpTable$HomeGames + tmpTable$AwayGames
tmpTable$Win = tmpTable$HomeWin + tmpTable$AwayWin
tmpTable$Draw = tmpTable$HomeDraw + tmpTable$AwayDraw
tmpTable$Loss = tmpTable$HomeLoss + tmpTable$AwayLoss
tmpTable$Points = winPoints * tmpTable$Win + drawPoints * tmpTable$Draw + lossPoints * tmpTable$Loss
 
## Count Goals Scored and Conceeded
 
tmpTable$HomeFor = as.numeric(tapply(tmpResults$FTHG, tmpResults$HomeTeam, sum, na.rm = TRUE))
tmpTable$HomeAgainst = as.numeric(tapply(tmpResults$FTAG, tmpResults$HomeTeam, sum, na.rm = TRUE))
 
tmpTable$AwayFor = as.numeric(tapply(tmpResults$FTAG, tmpResults$AwayTeam, sum, na.rm = TRUE))
tmpTable$AwayAgainst = as.numeric(tapply(tmpResults$FTHG, tmpResults$AwayTeam, sum, na.rm = TRUE))
 
tmpTable$For = ifelse(is.na(tmpTable$HomeFor), 0, tmpTable$HomeFor) +
ifelse(is.na(tmpTable$AwayFor), 0, tmpTable$AwayFor)
tmpTable$Against = ifelse(is.na(tmpTable$HomeAgainst), 0, tmpTable$HomeAgainst) +
ifelse(is.na(tmpTable$AwayAgainst), 0, tmpTable$AwayAgainst)
 
tmpTable$GoalDifference = tmpTable$For - tmpTable$Against
 
## Sort Table
## By Points
## By Goal Difference
## By Team Name (Alphabetical)
 
tmpTable = tmpTable[order(- tmpTable$Points, - tmpTable$GoalDifference, tmpTable$Team),]
 
tmpTable = tmpTable[,c("Team", "Games", "Win", "Draw", "Loss", "Points", "For", "Against", "GoalDifference")]
 
## Return Division Information
 
tmpSummary = list(Country = country, Division = divname, Season = season, Teams = teams,
Results = tmpResults, Table = tmpTable)
 
invisible(tmpSummary)
}

There are other functionality that we might want to add to the function.

To leave a comment for the author, please follow the link and comment on their blog: Software for Exploratory Data Analysis and Statistical Modelling.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)