Analyzing the Greatest Strikers in Football I: Getting Data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I do not always come up with new ideas for my blog, but rather get inspired by the great work of others.
In this case, it was a reddit post by u/Cheapo_Sam, who charted world footballs greatest goal scorers in a marvelous way.
According to the post, the data was gathered manually which I thought is too tedious (Ain’t nobody got time for that!).
So I decided to automatize this step.
This post exclusively deals with acquisition and cleaning of the data. The next (one or two)
posts will handle the analysis.
library(tidyverse) # for data wrangling library(rvest) # for web scraping library(lubridate) # for date formats
As a data source, I found transfermarkt.co.uk to be a good place.
It has very detailed records of goals scored by players, dating back to the 1960’s.
As an example, we will get the data of Zlatan Ibrahimovic.
url <- "https://www.transfermarkt.co.uk/zlatan-ibrahimovic/alletore/spieler/3455/saison//verein/0/liga/0/wettbewerb//pos/0/trainer_id/0/minute/0/torart/0/plus/1" web <- read_html(url) doc <- html_table(web, fill = TRUE)
The URL contains a very detailed table of all goals scored by Zlatan.
read_html()
reads the html content of the page and html_table()
puts all
html tables (
) which have a link ( )and within the link an image ( ). The name of the club is stored as the alt attribute of theimage, which we grab with html_attr() . The problem is, that the vs. column above also contains thesame structure. Since the table is read rowwise, we can skip every other entry in the club vector. The same principle can be used for the crests, yet here we need the src attribute which points to theimage file. You may have noticed some empty entries in several columns. We can’t do anything about for(i in 1:nrow(tab)){ if(tab$competition[i]==""){ tab$competition[i]=tab$competition[i-1] tab$date[i]=tab$date[i-1] tab$day[i]=tab$day[i-1] tab$venue[i]=tab$venue[i-1] tab$against[i]=tab$against[i-1] club <- append(club,club[i-1],after = i-1) club_crest <- append(club_crest,club_crest[i-1],after = i-1) } } We can also delete the last row of the data frame because it only contains the totals. tab <- tab[-nrow(tab),] tab$date <- as.Date(tab$date,format = "%m/%d/%y") tab$club <- club tab$club_crest <- club_crest idx <- (lubridate::year(tab$date)>2020)+0 tab$date <- tab$date-lubridate::years(idx*100) You may wonder about the last two rows. This is a dirty hack to get around the following issue. as.Date("10/02/65",format = "%m/%d/%y") ## [1] "2065-10-02" transfermarkt displays the year with two digits. So as soon as we get the data from a We finish the data cleaning process by adding and tweaking some variables. # add a goal count column tab <- tab %>% mutate(goals=row_number()) # get bigger crests tab$club_crest <- tab$club_crest %>% str_remove("_.*") %>% paste0(".png") %>% str_replace("tiny","head") #convert club to factor tab$club <- factor(tab$club,levels = unique(tab$club)) glimpse(tab) ## Observations: 426 ## Variables: 12 ## $ competition <chr> "Eredivisie", "Eredivisie", "UEFA Cup", "UEFA Cup"... ## $ day <chr> "3", "4", "First Round", "First Round", "10", "10"... ## $ date <date> 2001-08-26, 2001-09-08, 2001-09-20, 2001-09-27, 2... ## $ venue <chr> "A", "A", "H", "A", "H", "H", "A", "A", "A", "H", ... ## $ against <chr> "Feyenoord", "FC Twente", "Apol. Limassol", "Apol.... ## $ minute <dbl> 64, 19, 4, 64, 68, 79, 68, 78, 95, 47, 11, 34, 58,... ## $ standing <chr> "0:1", "0:1", "1:0", "0:2", "3:0", "4:0", "1:1", "... ## $ type <chr> "Tap-in", "", "", "Header", "", "Tap-in", "", "", ... ## $ provider <chr> "Hatem Trabelsi", "", "Nikolaos Machlas", "Wambert... ## $ club <fct> Ajax Amsterdam, Ajax Amsterdam, Ajax Amsterdam, Aj... ## $ club_crest <chr> "https://tmssl.akamaized.net//images/wappen/head/6... ## $ goals <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,... That’s it! We now have a clean data frame of all goals scored by Zlatan Ibrahimovic with a lot You can wrap all the steps above into a funtion to easily get the data of other players. get_goals <- function(player = "",id = ""){ url <- paste0("https://www.transfermarkt.co.uk/",player,"/alletore/spieler/",id,"/saison//verein/0/liga/0/wettbewerb//pos/0/trainer_id/0/minute/0/torart/0/plus/1") web <- read_html(url) #extract and clean data return(tab) } All you need to do is navigate to the page of the player you want and get the player name You may also consider to grab some additional data. For instance, the players birthday. birthdate <- html_node(web,".dataDaten .dataValue") %>% html_text(birthdate) %>% str_squish() %>% as.Date(format="%b %d, %Y") This is where you need some basic knowledge about css again. If you look at the . The
html_node() in conjunction with html_text() allowsus to extract the birthday as plain text by navigating along these css classes. The str_squish() eliminates all the excessive whitespace.
This allows us now to add yet another column to the data frame which holds the exact age of tab$age <- time_length(difftime(tab$date,birthdate),"years") In a similar manner, you can get the clear name of the player and a link to his profile player <- html_node(web,"h1") %>% html_text() portrait <- html_node(web,".dataBild img") %>% html_attr("src") You can also put these three variables into the function and return everything as a list. In the next post, we will use this data to explore the career goals of (not only!) Codeget_goals <- function(player="zlatan-ibrahimovic",id="3455"){ url <- paste0("https://www.transfermarkt.co.uk/",player,"/alletore/spieler/",id,"/saison//verein/0/liga/0/wettbewerb//pos/0/trainer_id/0/minute/0/torart/0/plus/1") web <- xml2::read_html(url) player <- rvest::html_text(rvest::html_node(web,"h1")) portrait <- rvest::html_attr(rvest::html_node(web,".dataBild img"),"src") birthdate <- rvest::html_node(web,".dataDaten .dataValue") birthdate <- stringr::str_squish(rvest::html_text(birthdate)) birthdate <- as.Date(birthdate,format="%b %d, %Y") doc <- rvest::html_table(web,fill=TRUE) tab <- doc[[2]] tab <- tab[,c(2,3,4,5,9,12,13,14,15)] names(tab) <- c("competition","day","date","venue","against","minute","standing","type","provider") tab <- as.data.frame(tab) tab <- tab[!grepl("Season",tab$venue),] tab$against <- stringr::str_remove_all(tab$against,"\\(.*\\)") tab$against <- stringr::str_trim(tab$against) club <- web %>% rvest::html_nodes(xpath = "//td/a/img") %>% rvest::html_attr("alt") club <- club[seq(1,length(club),2)] club_crest <- web %>% rvest::html_nodes(xpath = "//td/a/img") %>% rvest::html_attr("src") club_crest <- club_crest[seq(1,length(club_crest),2)] for(i in 1:nrow(tab)){ if(tab$competition[i]==""){ tab$competition[i] <- tab$competition[i-1] tab$date[i] <- tab$date[i-1] tab$day[i] <- tab$day[i-1] tab$venue[i] <- tab$venue[i-1] tab$against[i] <- tab$against[i-1] club <- append(club,club[i-1],after = i-1) club_crest <- append(club_crest,club_crest[i-1],after = i-1) } } tab$minute <- readr::parse_number(tab$minute) tab <- tab[-nrow(tab),] tab$club <- club tab$club_crest <- club_crest tab$date <- as.Date(tab$date,format = "%m/%d/%y") idx <- (lubridate::year(tab$date)>2020)+0 tab$date <- tab$date-lubridate::years(idx*100) tab <- tab[!is.na(tab$minute),] tab$age <- lubridate::time_length(difftime(tab$date,birthdate),"years") tab <- tab %>% dplyr::mutate(goals=dplyr::row_number()) tab$club_crest <- tab$club_crest %>% stringr::str_remove("_.*") %>% paste0(".png") %>% stringr::str_replace("tiny","head") tab$club <- factor(tab$club,levels = unique(tab$club)) return(list(data=tab,name=player,birthday=birthdate,portrait=portrait)) } To leave a comment for the author, please follow the link and comment on their blog: schochastics.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Copyright © 2020 | MH Corporate basic by MH Themes Never miss an update!
|