Soccer is all about money (?) – Part 1: Getting the Data

Posted on October 18, 2012 by Sascha W. in R bloggers | 0 Comments

[This article was first published on Rcrastinate, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Teams with more money always win, right? At least they have a bigger chance of having success in their national championship. Let’s look into that…

NOTE: If you are not interested in the details of programming with R or getting data from the internet with it, you want to skip this post and have a look at Part 2.

First, we have to get some data. transfermarkt.de is a german site hosting statistics of all kinds of stuff around soccer. With a few simple calls, we can get a table from their site into R:

Let’s load some packages first:

library(RCurl)

library(XML)

library(gdata)

Now define the URL where the table is located:

buli.url <- “http://www.transfermarkt.de/de/1-bundesliga/startseite/wettbewerb_L1.html”

Now parse this URL and extract the table body from it:

parsedHTML <- htmlParse(getURL(buli.url), asText = T)

raw.tab <- xpathSApply(parsedHTML, “//tbody”, xmlValue)[1]

Now let’s clean up that messy HTML stuff:

cl.tab <- trim(gsub(“[\r\n]+”, ” “, raw.tab))

spl.tab <- trim(strsplit(cl.tab, “\t”, fixed = T)[[1]])

spl.tab <- spl.tab[spl.tab != “”]

And build a dataframe from spl.tab. The first block of calls only creates vectors of indices we need to get the correct information from the long character vector spl.tab into the dataframe table.

team.inds <- seq(1, length(spl.tab), 5)

squad.inds <- seq(2, length(spl.tab), 5)

age.inds <- seq(3, length(spl.tab), 5)

value.inds <- seq(4, length(spl.tab), 5)

mvalue.inds <- seq(5, length(spl.tab), 5)

We have to get rid of Euro signs and dots as 1000 separators!

spl.tab[value.inds] <- trim(gsub(“[.€]+”, “”, spl.tab[value.inds]))

spl.tab[mvalue.inds] <- trim(gsub(“[.€]+”, “”, spl.tab[mvalue.inds]))

Now build the dataframe.

table <- data.frame(Team = spl.tab[team.inds],

                    Squad = spl.tab[squad.inds],

                    Age = spl.tab[age.inds],

                    Value = spl.tab[value.inds],

                    mValue = spl.tab[mvalue.inds])

We do all this stuff again in a slightly modified script to get another table which holds the current standing of the national championship (e.g., 1. Bundesliga or Premier League).

buli.stand.url <- “http://www.transfermarkt.de/de/1-bundesliga/tabelle/wettbewerb_L1.html”

parsedHTML <- htmlParse(getURL(premleague.stand.url), asText = T)

raw.tab <- xpathSApply(parsedHTML, “//table”, xmlValue)[2]

cl.tab <- trim(gsub(“[\r\n#]+”, ” “, raw.tab))

spl.tab <- trim(strsplit(cl.tab, “\t”, fixed = T)[[1]])

spl.tab <- spl.tab[spl.tab != “”]

spl.tab <- spl.tab[(grep(“Pkt.”, spl.tab, fixed = T)+1):(grep(“Jahrestabelle”, spl.tab, fixed = T)-1)]

pos.inds <- seq(1, length(spl.tab), 9)

team.inds <- seq(2, length(spl.tab), 9)

games.inds <- seq(3, length(spl.tab), 9)

won.inds <- seq(4, length(spl.tab), 9)

draw.inds <- seq(5, length(spl.tab), 9)

lost.inds <- seq(6, length(spl.tab), 9)

goals.inds <- seq(7, length(spl.tab), 9)

diff.inds <- seq(8, length(spl.tab), 9)

points.inds <- seq(9, length(spl.tab), 9)

stand.table <- data.frame(Pos = as.numeric(spl.tab[pos.inds]),

Team = spl.tab[team.inds],

Games = as.numeric(spl.tab[games.inds]),

Won = as.numeric(spl.tab[won.inds]),

Draw = as.numeric(spl.tab[draw.inds]),

Lost = as.numeric(spl.tab[lost.inds]),

Goals = spl.tab[goals.inds],

Diff = as.numeric(spl.tab[diff.inds]),

Points = as.numeric(spl.tab[points.inds]))

The number of goals for and against a team is only available within the column “Goals”, so we have to extract them into two new columns.

stand.table$Goals.for <- as.numeric(sapply(stand.table$Goals, function (x) strsplit(x, “:”)[[1]][1]))

stand.table$Goals.aga <- as.numeric(sapply(stand.table$Goals, function (x) strsplit(x, “:”)[[1]][2]))

Now we sort both tables (table which holds the values of the teams and stand.table which holds the standing of the current championship season) by team name. Then we cbind the two tables and check if the team names from both tables are equal. They should be. Last step: Sort the big table by position in the current championship and get rid of the redundant “Team2” column.

stand.table <- stand.table[order(stand.table$Team),]

table <- table[order(table$Team),]

names(table)[names(table) == “Team”] <- “Team2”

big.tab <- cbind(stand.table, table)

all(big.tab$Team == big.tab$Team2) # should be TRUE

big.tab <- big.tab[order(big.tab$Pos),-grep(“Team2”, names(big.tab), fixed = T)]

After all this “magic”, we get a table looking like this (only first 5 rows are shown here, I used the Premier League as an example):

Pos Team Games Won Draw Lost Goals Diff Points Goals.for Goals.aga Squad Age Value mValue

1 1 FC Chelsea 7 6 1 0 15:4 11 19 15 4 23 27.0 387000000 16826087

2 2 Manchester United 7 5 0 2 17:9 8 15 17 9 28 26.1 441000000 15750000

3 3 Manchester City 7 4 3 0 15:8 7 15 15 8 25 26.5 498250000 19930000

4 4 FC Everton 7 4 2 1 14:8 6 14 14 8 22 27.1 128250000 5829545

5 5 Tottenham Hotspur 7 4 2 1 13:8 5 14 13 8 24 27.5 254500000 10604167

In the next post, I will demonstrate how to get nice plots and analyses out of this kind of dataframe… and maybe we will answer the question if soccer is really only about the money.

To leave a comment for the author, please follow the link and comment on their blog: Rcrastinate.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Soccer is all about money (?) – Part 1: Getting the Data

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)