Soccer is all about money (?) – Part 1: Getting the Data

October 18, 2012
By

(This article was first published on Rcrastinate, and kindly contributed to R-bloggers)

Teams with more money always win, right? At least they have a bigger chance of having success in their national championship. Let’s look into that…

NOTE: If you are not interested in the details of programming with R or getting data from the internet with it, you want to skip this post and have a look at Part 2.

First, we have to get some data. transfermarkt.de is a german site hosting statistics of all kinds of stuff around soccer. With a few simple calls, we can get a table from their site into R:

Let’s load some packages first:

library(RCurl)
library(XML)
library(gdata)
Now define the URL where the table is located:
buli.url <- “http://www.transfermarkt.de/de/1-bundesliga/startseite/wettbewerb_L1.html”
Now parse this URL and extract the table body from it:
parsedHTML <- htmlParse(getURL(buli.url), asText = T)
raw.tab <- xpathSApply(parsedHTML, “//tbody”, xmlValue)[1]

Now let’s clean up that messy HTML stuff:
cl.tab <- trim(gsub(“[\r\n]+”, ” “, raw.tab))
spl.tab <- trim(strsplit(cl.tab, “\t”, fixed = T)[[1]])
spl.tab <- spl.tab[spl.tab != “”]

And build a dataframe from spl.tab. The first block of calls only creates vectors of indices we need to get the correct information from the long character vector spl.tab into the dataframe table.

team.inds <- seq(1, length(spl.tab), 5)

squad.inds <- seq(2, length(spl.tab), 5)
age.inds <- seq(3, length(spl.tab), 5)
value.inds <- seq(4, length(spl.tab), 5)
mvalue.inds <- seq(5, length(spl.tab), 5)

We have to get rid of Euro signs and dots as 1000 separators!

spl.tab[value.inds] <- trim(gsub(“[.€]+”, “”, spl.tab[value.inds]))
spl.tab[mvalue.inds] <- trim(gsub(“[.€]+”, “”, spl.tab[mvalue.inds]))

Now build the dataframe.

table <- data.frame(Team = spl.tab[team.inds],
                    Squad = spl.tab[squad.inds],
                    Age = spl.tab[age.inds],
                    Value = spl.tab[value.inds],
                    mValue = spl.tab[mvalue.inds])

We do all this stuff again in a slightly modified script to get another table which holds the current standing of the national championship (e.g., 1. Bundesliga or Premier League).


buli.stand.url <- “http://www.transfermarkt.de/de/1-bundesliga/tabelle/wettbewerb_L1.html”


parsedHTML <- htmlParse(getURL(premleague.stand.url), asText = T)
raw.tab <- xpathSApply(parsedHTML, “//table”, xmlValue)[2]
cl.tab <- trim(gsub(“[\r\n#]+”, ” “, raw.tab))
spl.tab <- trim(strsplit(cl.tab, “\t”, fixed = T)[[1]])
spl.tab <- spl.tab[spl.tab != “”]
spl.tab <- spl.tab[(grep(“Pkt.”, spl.tab, fixed = T)+1):(grep(“Jahrestabelle”, spl.tab, fixed = T)-1)]

pos.inds <- seq(1, length(spl.tab), 9)
team.inds <- seq(2, length(spl.tab), 9)
games.inds <- seq(3, length(spl.tab), 9)
won.inds <- seq(4, length(spl.tab), 9)
draw.inds <- seq(5, length(spl.tab), 9)
lost.inds <- seq(6, length(spl.tab), 9)
goals.inds <- seq(7, length(spl.tab), 9)
diff.inds <- seq(8, length(spl.tab), 9)
points.inds <- seq(9, length(spl.tab), 9)

stand.table <- data.frame(Pos = as.numeric(spl.tab[pos.inds]),
                          Team = spl.tab[team.inds],
                          Games = as.numeric(spl.tab[games.inds]),
                          Won = as.numeric(spl.tab[won.inds]),
                          Draw = as.numeric(spl.tab[draw.inds]),
                          Lost = as.numeric(spl.tab[lost.inds]),
                          Goals = spl.tab[goals.inds],
                          Diff = as.numeric(spl.tab[diff.inds]),
                          Points = as.numeric(spl.tab[points.inds]))

The number of goals for and against a team is only available within the column “Goals”, so we have to extract them into two new columns.

stand.table$Goals.for <- as.numeric(sapply(stand.table$Goals, function (x) strsplit(x, “:”)[[1]][1]))
stand.table$Goals.aga <- as.numeric(sapply(stand.table$Goals, function (x) strsplit(x, “:”)[[1]][2]))
Now we sort both tables (table which holds the values of the teams and stand.table which holds the standing of the current championship season) by team name. Then we cbind the two tables and check if the team names from both tables are equal. They should be. Last step: Sort the big table by position in the current championship and get rid of the redundant “Team2” column.
stand.table <- stand.table[order(stand.table$Team),]
table <- table[order(table$Team),]
names(table)[names(table) == “Team”] <- “Team2”

big.tab <- cbind(stand.table, table)
all(big.tab$Team == big.tab$Team2) # should be TRUE

big.tab <- big.tab[order(big.tab$Pos),-grep(“Team2”, names(big.tab), fixed = T)]

After all this “magic”, we get a table looking like this (only first 5 rows are shown here, I used the Premier League as an example):

   Pos                 Team Games Won Draw Lost Goals Diff Points Goals.for Goals.aga Squad  Age     Value   mValue
1    1           FC Chelsea     7   6    1    0  15:4   11     19        15         4    23 27.0 387000000 16826087
2    2    Manchester United     7   5    0    2  17:9    8     15        17         9    28 26.1 441000000 15750000
3    3      Manchester City     7   4    3    0  15:8    7     15        15         8    25 26.5 498250000 19930000
4    4           FC Everton     7   4    2    1  14:8    6     14        14         8    22 27.1 128250000  5829545
5    5    Tottenham Hotspur     7   4    2    1  13:8    5     14        13         8    24 27.5 254500000 10604167

In the next post, I will demonstrate how to get nice plots and analyses out of this kind of dataframe… and maybe we will answer the question if soccer is really only about the money.

To leave a comment for the author, please follow the link and comment on their blog: Rcrastinate.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)