Programming with R – Processing Football League Data Part I

November 23, 2010
By

(This article was first published on Software for Exploratory Data Analysis and Statistical Modelling, and kindly contributed to R-bloggers)

In this post we will make use of football results data from the football-data.co.uk website to demonstrate creating functions in R to automate a series of standard operations that would be required for results data from various leagues and divisions.

The first step is to consider what control options should be available as part of the function and here is a list of some arguments that will be used for this implementation of a football result data processing function:

  • The name of a csv data file from the football-data.co.uk website.
  • A text string to specify the country and division for the data.
  • A text string specifying the season.
  • A list of teams in the division (optional), which could be used to test for data entry errors in the data file.
  • The number of points for a win, draw or loss. This might seem a strange option initially but different leagues might award different points for the three outcomes.

Some of this information might appear optional but is included so that we can write a custom print function at a later date to display a meaningful summary of the object (list) that will be created by the function.

The first part of our function is concerned with checking the various values provided to the function arguments. Our skeleton function is as follows:

football.process.v1 = function(datafile, country, divname, season,
  teams, winPoints = 3, drawPoints = 1, lossPoints = 0)
{
 
}

Here we have specified default options for three of the arguments with the most likely number of points for each match outcome, i.e. 3 points for a win and 1 point for a draw.

To illustrate the working of the result processing function we will use a small exert from the start of the 2010/2011 English Premiership season which is shown below:

Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,Referee
E0,14/8/2010,Aston Villa,West Ham,3,0,H,2,0,H,M Dean
E0,14/8/2010,Blackburn,Everton,1,0,H,1,0,H,P Dowd
E0,14/8/2010,Bolton,Fulham,0,0,D,0,0,D,S Attwell
E0,14/8/2010,Chelsea,West Brom,6,0,H,2,0,H,M Clattenburg
E0,14/8/2010,Sunderland,Birmingham,2,2,D,1,0,H,A Taylor
E0,14/8/2010,Tottenham,Man City,0,0,D,0,0,D,A Marriner
E0,14/8/2010,Wigan,Blackpool,0,4,A,0,3,A,M Halsey
E0,14/8/2010,Wolves,Stoke,2,1,H,2,0,H,L Probert
E0,15/8/2010,Liverpool,Arsenal,1,1,D,0,0,D,M Atkinson
E0,16/8/2010,Man United,Newcastle,3,0,H,2,0,H,C Foy
E0,21/8/2010,Arsenal,Blackpool,6,0,H,3,0,H,M Jones
E0,21/8/2010,Birmingham,Blackburn,2,1,H,0,0,D,M Oliver
E0,21/8/2010,Everton,Wolves,1,1,D,1,0,H,L Mason
E0,21/8/2010,Stoke,Tottenham,1,2,A,1,2,A,C Foy
E0,21/8/2010,West Brom,Sunderland,1,0,H,0,0,D,K Friend
E0,21/8/2010,West Ham,Bolton,1,3,A,0,0,D,A Marriner
E0,21/8/2010,Wigan,Chelsea,0,6,A,0,1,A,M Dean
E0,22/8/2010,Fulham,Man United,2,2,D,0,1,A,P Walton
E0,22/8/2010,Newcastle,Aston Villa,6,0,H,3,0,H,M Atkinson
E0,23/8/2010,Man City,Liverpool,3,0,H,1,0,H,P Dowd
E0,28/8/2010,Blackburn,Arsenal,1,2,A,1,1,D,C Foy
E0,28/8/2010,Blackpool,Fulham,2,2,D,0,1,A,M Oliver
E0,28/8/2010,Chelsea,Stoke,2,0,H,1,0,H,M Atkinson
E0,28/8/2010,Man United,West Ham,3,0,H,1,0,H,M Clattenburg
E0,28/8/2010,Tottenham,Wigan,0,1,A,0,0,D,P Dowd
E0,28/8/2010,Wolves,Newcastle,1,1,D,1,0,H,S Attwell
E0,29/8/2010,Aston Villa,Everton,1,0,H,1,0,H,M Jones
E0,29/8/2010,Bolton,Birmingham,2,2,D,0,1,A,K Friend
E0,29/8/2010,Liverpool,West Brom,1,0,H,0,0,D,L Probert
E0,29/8/2010,Sunderland,Man City,1,0,H,0,0,D,M Dean

This is stored in a file E0test.csv so that we can use the read.csv function to import the results data and then process it.

The first series of commands that we add to the function are for checking various function arguments specified by the user to ensure that they are sensible. First up we check whether a results data file has been specified as we cannot do any processing without any results. The simple test is for whether a file name has been specified:

if (missing(datafile))
{
    stop("Results csv file not specified.")
}

It might be sensible to check whether the object datafile is actually a character string specifying a file, but this hasn’t been done for now. We then check whether the country name and division have been specified and set them to blank strings if they haven’t been set by the user.

if (missing(country))
{
    warning("Country of league not specified.")
    country = ""
}
 
if (missing(divname))
{
    warning("Name of league division not specified.")
    divname = ""
}

Next up we import the data file and only save the columns of interest (at this point of the development of the function at least. There are many more columns of information that we need in the raw data from the website,

tmpResults =
    read.csv(datafile)[,c("Date","HomeTeam","AwayTeam","FTR","FTHG","FTAG")]

The square brackets are used to subset on a part set of columns and only save these. Then we check whether the team names have been specified by the user and if not extract them from the data provided:

if (missing(teams))
{
    warning("Team names not specified - extracted from results data.")
    teams = sort(unique(c(as.character(tmpResults$HomeTeam),
        as.character(tmpResults$AwayTeam))))
}

The sort function is used to order the team names alphabetically which is the order often used in league tables, especially when no games have been played. We then convert the columns HomeTeam and AwayTeam into factors, which allows teams that haven’t played a fixture yet to be included in the table.

tmpResults$HomeTeam = factor(tmpResults$HomeTeam, levels = teams)
tmpResults$AwayTeam = factor(tmpResults$AwayTeam, levels = teams)

To round off the first part of creating the result processing function we create a list object to return at the end of the function.

tmpSummary = list(Country = country, Division = divname,
    Season = season, Teams = teams, Results = tmpResults)

The function so far:

football.process.v1 = function(datafile, country, divname, season, teams, winPoints = 3, drawPoints = 1, lossPoints = 0)
{
## Validation Function Arguments
 
if (missing(datafile))
{
stop("Results csv file not specified.")
}
 
if (missing(country))
{
warning("Country of league not specified.")
country = ""
}
 
if (missing(divname))
{
warning("Name of league division not specified.")
divname = ""
}
 
## Import Results
 
tmpResults = read.csv(datafile)[,c("Date","HomeTeam","AwayTeam","FTR","FTHG","FTAG")]
 
if (missing(teams))
{
warning("Team names not specified - extracted from results data.")
teams = sort(unique(c(as.character(tmpResults$HomeTeam), as.character(tmpResults$AwayTeam))))
}
 
tmpResults$HomeTeam = factor(tmpResults$HomeTeam, levels = teams)
tmpResults$AwayTeam = factor(tmpResults$AwayTeam, levels = teams)
 
## Return Division Information
 
tmpSummary = list(Country = country, Division = divname, Season = season, Teams = teams,
Results = tmpResults)
 
invisible(tmpSummary)
}

We then test this function with the data file shown above. First up we create our own list of teams in the English Premiership for 2010/2011 and specify some of the other function arguments while using the defaults for points.

> E0teams.1011 = c("Arsenal", "Aston Villa", "Birmingham", "Blackburn",
+ "Blackpool", "Bolton", "Chelsea", "Everton", "Fulham", "Liverpool",
+ "Man City", "Man United", "Newcastle", "Stoke", "Sunderland",
+ "Tottenham", "West Brom", "West Ham", "Wigan", "Wolves")
> print(football.process.v1("E0test.csv", "England", "Premiership",
    "2010-2011", E0teams.1011))
$Country
[1] "England"
 
$Division
[1] "Premiership"
 
$Season
[1] "2010-2011"
 
$Teams
 [1] "Arsenal"     "Aston Villa" "Birmingham"  "Blackburn"   "Blackpool"  
 [6] "Bolton"      "Chelsea"     "Everton"     "Fulham"      "Liverpool"  
[11] "Man City"    "Man United"  "Newcastle"   "Stoke"       "Sunderland" 
[16] "Tottenham"   "West Brom"   "West Ham"    "Wigan"       "Wolves"     
 
$Results
        Date    HomeTeam    AwayTeam FTR FTHG FTAG
1  14/8/2010 Aston Villa    West Ham   H    3    0
2  14/8/2010   Blackburn     Everton   H    1    0
3  14/8/2010      Bolton      Fulham   D    0    0
4  14/8/2010     Chelsea   West Brom   H    6    0
5  14/8/2010  Sunderland  Birmingham   D    2    2
6  14/8/2010   Tottenham    Man City   D    0    0
7  14/8/2010       Wigan   Blackpool   A    0    4
8  14/8/2010      Wolves       Stoke   H    2    1
9  15/8/2010   Liverpool     Arsenal   D    1    1
10 16/8/2010  Man United   Newcastle   H    3    0
11 21/8/2010     Arsenal   Blackpool   H    6    0
12 21/8/2010  Birmingham   Blackburn   H    2    1
13 21/8/2010     Everton      Wolves   D    1    1
14 21/8/2010       Stoke   Tottenham   A    1    2
15 21/8/2010   West Brom  Sunderland   H    1    0
16 21/8/2010    West Ham      Bolton   A    1    3
17 21/8/2010       Wigan     Chelsea   A    0    6
18 22/8/2010      Fulham  Man United   D    2    2
19 22/8/2010   Newcastle Aston Villa   H    6    0
20 23/8/2010    Man City   Liverpool   H    3    0
21 28/8/2010   Blackburn     Arsenal   A    1    2
22 28/8/2010   Blackpool      Fulham   D    2    2
23 28/8/2010     Chelsea       Stoke   H    2    0
24 28/8/2010  Man United    West Ham   H    3    0
25 28/8/2010   Tottenham       Wigan   A    0    1
26 28/8/2010      Wolves   Newcastle   D    1    1
27 29/8/2010 Aston Villa     Everton   H    1    0
28 29/8/2010      Bolton  Birmingham   D    2    2
29 29/8/2010   Liverpool   West Brom   H    1    0
30 29/8/2010  Sunderland    Man City   H    1    0

To leave a comment for the author, please follow the link and comment on his blog: Software for Exploratory Data Analysis and Statistical Modelling.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , , , , , , , ,

Comments are closed.