Site icon R-bloggers

Which English County Has Won the Most Points

[This article was first published on rstats on Robert Hickman, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Every so often a question on The Guardian’s The Knowledge football trivia section piques my interest and is amenable to analysis using R. Previously, I looked at club name suffixes and young World Cup winners last August. This week (give or take), a question posed on twitter caught my attention:

@TheKnowledge_GU was just chatting to some colleagues in the kitchen at work about why Essex doesn’t have many big football clubs and it got me thinking. If you combined all the points from every league team in the ceremonial counties in England, which county would be on top?

— BoxBoron (@Rutland_Walker) January 8, 2019

To start with as always load the libraries needed to analyse this

#get data
library(engsoccerdata)
library(rvest)
#munging
library(tidyverse)
library(magrittr)
#spatial analysis
library(sf)
library(rgdal)
#for plotting maps
library(ggthemes)

The easiest way to get a total of points is using the engsoccerdata:: packages database of every English football match from the top four divisions (this does not include data from the 2017-2018, or 2018-2019 seasons). We can work out the points easily from the goals scored for each team

#load the data
match_data <- engsoccerdata::england %>%
  #select only the necessary columns and melt
  select(season = Season, home, visitor, hgoal, vgoal, tier) %>%
  reshape2::melt(id.vars = c("season", "hgoal", "vgoal", "tier"),
                 variable.name = "location",
                 value.name = "team") %>%
  #will need to match this to location data so some club names need cleaning
  mutate(team_subbed = case_when(
    team == "Yeovil" ~ "Yeovil Town",
    team == "AFC Bournemouth" ~ "A.F.C. Bournemouth",
    team == "Halifax Town" ~ "F.C. Halifax Town",
    team == "Aldershot" ~ "Aldershot Town F.C",
    team == "Wimbledon" ~ "A.F.C. Wimbledon",
    team == "AFC Wimbledon" ~ "A.F.C. Wimbledon",
    team == "Macclesfield" ~ "Macclesfield Town",
    team == "Rushden & Diamonds" ~ "A.F.C. Rushden & Diamonds",
    team == "Milton Keynes Dons" ~ "Milton Keynes",
    team == "Dagenham and Redbridge" ~ "Dagenham & Redbridge",
    team == "Stevenage Borough" ~ "Stevenage"
  )) %>%
  #if cleaning isnt required, take original
  mutate(team_subbed = ifelse(is.na(team_subbed), team, team_subbed))

#peek at the data
head(match_data)
##   season hgoal vgoal tier location            team     team_subbed
## 1   1888     1     1    1     home Accrington F.C. Accrington F.C.
## 2   1888     0     2    1     home Accrington F.C. Accrington F.C.
## 3   1888     2     3    1     home Accrington F.C. Accrington F.C.
## 4   1888     5     1    1     home Accrington F.C. Accrington F.C.
## 5   1888     6     2    1     home Accrington F.C. Accrington F.C.
## 6   1888     3     1    1     home Accrington F.C. Accrington F.C.

The 388k (194k matches) data.frame seems daunting, but actually only results in many fewer unique teams that have played at least one match in the top 4 divisions in England

length(unique(match_data$team_subbed))
## [1] 141

The location of each club can then be found using the wikipedia pages for them/their stadia. This matches 121 of the 141 clubs pretty nicely which is a fairly good percentage all things considered

#find the links to each clubs wikipedia page
wiki <- read_html("https://en.wikipedia.org/wiki/List_of_football_clubs_in_England") %>%
  html_nodes("td:nth-child(1)") %>%
  .[which(grepl("href", .))]

#get the names for each club
wiki_clubs <- wiki %>% html_text() %>% gsub(" \\(.*\\)$", "", .)

#can match 121/141 right off the bat
(unique(match_data$team_subbed) %in% wiki_clubs) %>%
  which() %>%
  length()
## [1] 121

We can find the location of these matching clubs by finding the page for their stadia and then finding the coordinates. It’s a bit of a messy function because I was just jamming stuff together to get data out as best as possible. This takes ~1 minute to run through all 121 teams (for the blog post I actually saved an RDS of the output from this and load it just to save time/server calls)

matching_club_locations <- wiki %>% 
  #take only the matching clubs
  .[which(wiki_clubs %in% unique(match_data$team_subbed))] %>%
  html_nodes("a") %>%
  #get the wiki page link
  html_attr("href") %>%
  paste0("https://en.wikipedia.org", .) %>%
  #for each club page find the stadium and its coordinates
  lapply(., function(team) {
    link <- read_html(team) %>%
      html_nodes(".label a") %>%
      .[1] %>%
      html_attr("href") %>% 
      paste0("https://en.wikipedia.org",. )
    coords <- link %>%
      read_html() %>% 
      html_nodes("#coordinates a") %>%
      html_attr("href") %>%
      .[2]
    #if coords not found use NA
    if(is.na(coords)) {
      coord_df <- data.frame(lat = NA,
                             lon = NA)
    } else {
      coords <- coords %>%
        paste0("https:", .) %>%
        read_html() %>%
        html_nodes(".geo") %>%
        html_text() %>%
        strsplit(., split = ", ")
      coord_df <- data.frame(lat = as.numeric(coords[[1]][1]),
                             lon = as.numeric(coords[[1]][2]))
    }
    return(coord_df)
  })  %>%
  #bind everything together
  do.call(rbind, .) %>%
  #add the club name as a new column
  mutate(team = wiki_clubs[
    which(wiki_clubs %in% unique(match_data$team_subbed))
    ]) %>%
  #filter out missing data
  filter(!is.na(lat) & !is.na(lon))

Which gives us the location of 114 of our 141 clubs. Most of the remaining ones are now-defunct clubs (e.g. Middlesbrough Ironopolis, Leeds City etc.)

missing_teams <- unique(match_data$team_subbed)[which(!unique(match_data$team_subbed) %in% matching_club_locations$team)]
missing_teams
##  [1] "Accrington F.C."           "Darwen"                   
##  [3] "Burton Swifts"             "Port Vale"                
##  [5] "Middlesbrough Ironopolis"  "Rotherham Town"           
##  [7] "Burton Wanderers"          "Loughborough"             
##  [9] "Blackpool"                 "New Brighton Tower"       
## [11] "Burton United"             "Leeds City"               
## [13] "Rotherham County"          "Bristol Rovers"           
## [15] "Darlington"                "Wigan Borough"            
## [17] "Aberdare Athletic"         "New Brighton"             
## [19] "Thames"                    "Aldershot Town F.C"       
## [21] "Hereford United"           "Scarborough"              
## [23] "Cheltenham"                "A.F.C. Rushden & Diamonds"
## [25] "Accrington"                "Crawley Town"             
## [27] "Fleetwood Town"

Given it was a Saturday morning where I had nothing better to do, I simply located these clubs home grounds manually and created a data.frame for their locations. It’s not really great practice but whatever.

These are then all bound together and converted to an sf spatial object with the correct projection

#add in the missing locations
missing_locations <- data.frame(
  lat = c(53.7646, 53.711772, 52.799, 53.049722, 54.5641, 53.42644, 52.8146,
          52.7743, 53.804722, 53.4359, 52.799, 53.7778, 53.428367, 51.48622,
          54.508425, 53.554914, 51.7127, 53.4292, 51.514431, 51.248386,
          52.060719, 54.265478, 51.906158, 52.328033, 53.7646, 51.405083, 53.9165),
  lon = c(-2.358, -2.477292, -1.6354, -2.1925, -1.2456, -1.34377, -1.6335, -1.1992,
          -3.048056, -3.0377, -1.6354, -1.5722, -1.370231, -2.583134, -1.534394,
          -2.650661, -3.4374, -3.0407, 0.034739, -0.754789, -2.717711, -0.418247,
          -2.060211, -0.5999, -2.358, -0.281944, -3.0247),
    team = as.character(missing_teams)
)

#bind together and convert to sf
all_locations <- rbind(matching_club_locations,
                       missing_locations) %>%
  st_as_sf(coords = c("lon", "lat"), crs = st_crs("+init=epsg:4326")) %T>%
  #make a quick plot of locations for sanity check
  plot()

Now that we have all the teams, we need the English historical county boundaries to group them by. I’d actually already used these for football analysis, looknig at if an independent Yorkshire could win the World Cup.

Load the data (the boundary file can be download from the Ordnance Survey) and make a quick plot of the boundaries and teams

(I also created an sf object engwal which is just the counties from England and Wales selected out for background plotting)

## OGR data source with driver: ESRI Shapefile 
## Source: "C:\Users\Alaa\Desktop\geo_data\boundary\Data\Supplementary_Ceremonial", layer: "Boundary-line-ceremonial-counties_region"
## with 91 features
## It has 2 fields
#load the boundary file
counties <- readOGR(dsn = "path/to/file",
                    layer = "county_boundaries") %>%
  #convert to sf and project as northing/easting
  st_as_sf(., crs = st_crs("+init=epsg:27700")) %>%
  #only interested in the county name
  select(county = NAME) %>%
  #transform the projection to match that of the club locations
  st_transform(., crs = st_crs("+init=epsg:4326"))

engwal <- counties %>%
  .[c(1:54, 88, 90),]
#make a quick plot of counties and teams
ggplot() +
  geom_sf(data = counties, fill = NA) +
  geom_sf(data = all_locations) +
  ggtitle("Location of Teams to have Played Top\n 4 English Football Divisions") +
  theme_minimal() +
  theme(axis.title=element_blank(),
        axis.text=element_blank(),
        axis.ticks=element_blank())

(by the way the artifacts around Bristol and the Wirral are from the OS dataset- it’s very annoying)

Then we need to determine which teams are within which counties. The easiest way to do this is to use a spatial join of the team names in all_locations by which county they fall into (using st_contains from the sf package)

#bind the team names to each county
counties %<>%
  st_join(., all_locations, join = st_contains) %>%
  #remove counties that contain zero teams
  filter(!is.na(team)) %>%
  mutate(county = as.character(county))
## although coordinates are longitude/latitude, st_contains assumes that they are planar
#quick plot of number of teams per county (missing = 0)
counties %>%
  group_by(county) %>%
  summarise(n_clubs = n()) %>%
  ggplot(data = .) +
  geom_sf(data = engwal) +
  geom_sf(aes(fill = n_clubs), colour = "black") +
  scale_fill_viridis_c(option = "plasma", name = "# clubs") +
  ggtitle("Number of Top 4 Division Playing\n Teams in each Ceremonial County") +
  theme_minimal() +
  theme(axis.title=element_blank(),
        axis.text=element_blank(),
        axis.ticks=element_blank())

Which shows that most English historic counties (and a few Welsh ones due to teams like Cardiff City/ Swansea City etc.) have at least 1 team that has competed in the top 4 flights of English football at some point (those that do not are: Isle of Wight, Rutland, Surrey, Warwickshire, West Sussex and Cornwall).

To finally get the total number of points won by these teams, the county data needs to be joined back onto the match data from the top. First I clean it up a bit then make the left_join by team name. Finally the number of points per match is calculated using case_when and points are grouped by county and summed

county_match_data <- match_data %>% 
  mutate(team = team_subbed) %>%
  select(-team_subbed) %>%
  left_join(., counties, by = "team") %>%
  mutate(points = case_when(
    location == "home" & hgoal > vgoal ~ 3,
    location == "visitor" & vgoal > hgoal ~ 3,
    location == "home" & hgoal < vgoal ~ 0,
    location == "visitor" & vgoal < hgoal ~ 0,
    hgoal == vgoal ~ 1
  ))

county_points <- county_match_data %>%
  group_by(county) %>%
  summarise(total_points = sum(points))

Perhaps unsurprisingly, the county with the most points is Greater London, with Greater Manchester following and other footballing hotspots/ large counties in the West Midlands, Lancashire and around Yorkshire in the trailing group

head(arrange(county_points, -total_points))
## # A tibble: 6 x 2
##   county             total_points
##   <chr>                     <dbl>
## 1 Greater London            67189
## 2 Greater Manchester        47203
## 3 West Midlands             37413
## 4 Lancashire                30808
## 5 South Yorkshire           30061
## 6 West Yorkshire            24947

By contrast, Worcestshire and Northumberland barely have any points, with a few Welsh counties also struggling

head(arrange(county_points, total_points))
## # A tibble: 6 x 2
##   county          total_points
##   <chr>                  <dbl>
## 1 Worcestershire           275
## 2 Northumberland           398
## 3 Mid Glamorgan            744
## 4 Somerset                 813
## 5 Gloucestershire          994
## 6 Herefordshire           1739

If we group by tier as well as county, it’s possible to see how well each county has done at specific tiers.

county_match_data %>%
  group_by(county, tier) %>%
  summarise(total_points = sum(points)) %>%
  left_join(.,
            select(counties, county),
            by = "county") %>%
  ggplot(data = .) +
  geom_sf(data = engwal) +
  geom_sf(aes(fill = total_points), colour = "black") +
  scale_fill_viridis_c(option = "plasma", name = "total points") +
  ggtitle("Number of Points Won by each County\n per Tier of English Football") +
  facet_wrap(~tier) +
  theme_minimal() +
  theme(axis.title=element_blank(),
        axis.text=element_blank(),
        axis.ticks=element_blank())

And for the Premier League era this clears up to

county_match_data %>%
  filter(season > 1991) %>%
  group_by(county, tier) %>%
  summarise(total_points = sum(points)) %>%
  left_join(.,
            select(counties, county),
            by = "county") %>%
  ggplot(data = .) +
  geom_sf(data = engwal) +
  geom_sf(aes(fill = total_points), colour = "black", name = "total points") +
  scale_fill_viridis_c(option = "plasma") +
  ggtitle("Number of Points Won by each County\n per Tier of English Football",
          subtitle = "From Begining of 1992/1993 Season") +
  facet_wrap(~tier) +
  theme_minimal() +
  theme(axis.title=element_blank(),
        axis.text=element_blank(),
        axis.ticks=element_blank())
## Warning: Ignoring unknown parameters: name

Which shows just how dominant London has been in the top division of English football (especially as it is only competitive at lower levels).

I had wanted to weight points by the average ELO of that league and see which county has the most weight-adjusted points but got bored for this small blog post.

Best,

To leave a comment for the author, please follow the link and comment on their blog: rstats on Robert Hickman.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.