Site icon R-bloggers

Analyzing English Team of the Year Data Since 1973

[This article was first published on World Soccer Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Professional Footballers’ Association (PFA) Team of the Year is released in England at the end of each season, picking the 11 most influential players in each of Britain’s leagues.

The Team of the Year award was launched in the 1973-1974 season, meaning there was 44 years worth of data to web scrape. Using Wikipedia’s PFA Team of the Year pages (filtered by decade) and the rvest package, I was left with a dataframe of 484 soccer players (44 years * 11 players per/year).

Here are some visualizations I thought were cool:

 

 

 

I think that’s enough visualizations for today, but there’s definitely a lot more we can analyze with this data. Let me know if you have any questions or feedback.

R Code Snapshot (full code can be found on Github):

Step 1 – Web Scraping:

#initialize data.frame
df <- data.frame(`Pos.`= as.character(),
Player = as.character(),
Club = as.character(),
`App.` = as.double(),
year = as.character())

div_table_numbers <- c(1,5,9,13,17,21,25,29,33,37)

urls <- c("https://en.wikipedia.org/wiki/PFA_Team_of_the_Year_(2000s)",
"https://en.wikipedia.org/wiki/PFA_Team_of_the_Year_(1990s)",
"https://en.wikipedia.org/wiki/PFA_Team_of_the_Year_(1980s)")


for (i in 1:length(urls)){

  for (j in 1:length(div_table_numbers)) {
    xpath_base <- '//*[@id="mw-content-text"]/div/table['
    new_data <- urls[i] %>%
      html() %>%
      html_nodes(xpath = paste0(xpath_base,
                              div_table_numbers[j],
                              "]")) %>%
      html_table()

     new_data <- new_data[[1]]

     year_xpath_base <- '//*[@id="mw-content-text"]/div/h3['
     year <- urls[i] %>%
      html() %>%
      html_nodes(xpath = paste0(year_xpath_base,
                              j,
                              "]")) %>%
      html_text()

      year <- year %>% str_remove_all(fixed("[edit]"))
    
    new_data$year <- year
     
    df <- bind_rows(df,
                    new_data)
  }
}

Step 2: Exploratory data analysis and visualizations

club_count <- df %>% count(Club, sort = TRUE)

club_count %>% 
  top_n(n = 15) %>% 
  ggplot(aes(x = reorder(Club,n), 
                  y = n,
                  fill = Club)) +
  geom_col() +
  coord_flip()+
  theme_few()+
  guides(fill=FALSE)+
  labs(y = "# of Players",
       x = "Club",
       title = "Number of Players in PFA Team of the Year (1973 - 2017)",
       caption = "Data Source: Wikipedia")+
  theme(plot.title = element_text(hjust = 0.5)) +
   scale_fill_manual(values = c("Manchester United" = "darkred", 
                                 "Liverpool" = "orange",
                                 "Arsenal" = "yellow",
                                 "Chelsea" = "blue",
                                 "Blackburn Rovers" = "lightblue",
                                 "Leeds United" = "grey",
                                 "Manchester City" = "dark green",
                               "Derby County" = "black",
                               "Everton" = "gold",
                               "Nottingham Forest" = "red",
                               "Ipswich Town" = "darkblue",
                               "Southampton" = "orange",
                               "Newcastle United" = "darkgrey",
                               "Aston Villa" = "purple", 
                               "Tottenham Hotspur" = "navy"
                               ))
df <- df %>%
  mutate(short_year = str_sub(year,1,4) %>% 
           as.numeric() + 1)

order_positions <- c("GK","DF","MF","FW")

df <- df %>% mutate(Pos. = fct_relevel(Pos., order_positions))

count_position <- df %>%
  filter(Pos. %in% c("MF","FW")) %>%
    count(Pos.,short_year, sort = TRUE)

count_position %>%
  ggplot(aes(x=short_year,y=n,color=Pos.,group=Pos.)) + 
  geom_point(position=position_jitter(h=0.005))+
  geom_smooth(method = "loess")+
  scale_x_continuous(breaks = seq(1973, 2017, 5))+
  scale_y_continuous(breaks=seq(2,4,1))+
  labs(x = "Year",
       y = "Number of Players per Position",
       title = "Count of Midfielders and Forwards in English PFA Team of the Year (1973 - 2017)",
       caption = "Data Source: Wikipedia")+
  theme_few()+
  theme(plot.title = element_text(hjust = 0.5))

Step 3: More web scraping and merging datasets

first_div_top_three_url <- "https://en.wikipedia.org/wiki/List_of_English_football_champions"
first_div_top_three_xpath <- '//*[@id="mw-content-text"]/div/table[2]'

first_div_top_three <- first_div_top_three_url %>%
  html() %>%
  html_nodes(xpath = first_div_top_three_xpath) %>%
  html_table()

first_div_top_three <- first_div_top_three[[1]]           

first_div_top_three <- first_div_top_three %>% filter(!(Year %in% c("1915/16–1918/19",
                                                                    "1939/40–1945/46")))

first_div_top_three$Goals <-first_div_top_three$Goals %>% as.numeric()

first_div_top_three <- first_div_top_three %>% 
  rename(`Champions` = `Champions(number of titles)`,
         `Top goalscorer` = `Leading goalscorer`)

epl_top_three_url <- "https://en.wikipedia.org/wiki/List_of_English_football_champions"
epl_top_three_xpath <- '//*[@id="mw-content-text"]/div/table[3]'

epl_top_three <- epl_top_three_url %>%
  html() %>%
  html_nodes(xpath = epl_top_three_xpath) %>%
  html_table()

epl_top_three <- epl_top_three[[1]] 

epl_top_three <- epl_top_three %>% 
  rename(`Champions` = `Champions (number of titles)`)

english_top_three_total <- bind_rows(first_div_top_three,
                                     epl_top_three)

english_top_three_total$Champions <- english_top_three_total$Champions %>% str_remove_all(regex("\\([^)]*\\)"))

english_top_three_total$Champions <- english_top_three_total$Champions %>% str_remove_all(regex("\\[.*?\\]"))

english_top_three_total <- english_top_three_total %>%
  mutate(short_year = str_sub(Year,1,4) %>% as.numeric() + 1)

english_top_three_total <- english_top_three_total %>%
  filter(short_year > 1973) %>%
  select(-c(`Top goalscorer`,Goals))

english_top_three_total_melted <- english_top_three_total %>%
  melt(id.vars=c("Year","short_year"),
       value.name = "Club",
       variable.name = "Team_Ranking")

english_top_three_total_melted$Club <- english_top_three_total_melted$Club %>% str_trim(side = c( "right"))

df_merged <- df %>% left_join(english_top_three_total_melted,
                              by = c("Club","short_year"))

Step 4: More data visualizations with merged dataset

club_count_year <- df_merged %>% 
  count(Club, year, Team_Ranking, sort = TRUE) %>%
  mutate(club_year = paste(Club, year))

club_count_year %>% 
  top_n(n = 10, wt = n) %>% 
  ggplot(aes(x = reorder(club_year,n), 
                  y = n))+
  geom_col(aes(fill = factor(ifelse(Team_Ranking == "Champions", 
                                1,
                                2)))) +
  coord_flip()+
  theme_few()+
  guides(fill=FALSE)+
  labs(y = "# of Players",
       x = "Club",
       title = "Teams with Most Representation in PFA Team of the Year (1973 - 2017)",
       caption = "Data Source: Wikipedia")+
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_solarized()
club_count_year_champions <- df_merged %>% 
  count(Club, year, Team_Ranking, sort = TRUE) %>%
  mutate(club_year = paste(Club, year)) %>%
  filter(Team_Ranking == "Champions")

club_count_year_champions %>% 
  top_n(n = -10, wt = n) %>% 
  ggplot(aes(x = reorder(club_year,n), 
                  y = n))+
  geom_col(aes(fill = Club)) +
  coord_flip()+
  theme_few()+
  guides(fill=FALSE)+
  labs(y = "# of Players",
       x = "Club",
       title = "English Champions with Fewest Players in PFA Team of the Year (1973 - 2017)",
       caption = "Data Source: Wikipedia")+
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(breaks = seq(0,2,1)) +
  scale_fill_manual(values = c("Manchester United" = "darkred", 
                                 "Liverpool" = "orange",
                                 "Arsenal" = "yellow",
                                 "Chelsea" = "blue",
                                 "Blackburn Rovers" = "lightblue",
                                 "Leeds United" = "grey",
                                 "Manchester City" = "dark green",
                               "Derby County" = "black",
                               "Everton" = "gold",
                               "Nottingham Forest" = "red"))

To leave a comment for the author, please follow the link and comment on their blog: World Soccer Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.