Charting the World Cup

July 12, 2010
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Now that Spain has won the World Cup, it’s interesting to go back and
look at some metrics from the matches and see if we can tease out what
characteristics made for a winning Cup team this time around.
Fortunately, the Guardian’s Data Blog has made a wealth of World Cup statistics available, with data on every player of every team (position, shots at goal, passes, tackles made, and saves), plus aggregate statistics for each team (goals, % shots on target, fouls, and much more). The data are ripe for analysis in R, especially given that you can download the data directly from the cloud as an R object with the following commands:

players <- read.csv("http://spreadsheets.google.com/pub?key=tOM2qREmPUbv76waumrEEYg&single=true&gid=1&range=A1%3AH596&output=csv")

teams <- read.csv("http://spreadsheets.google.com/pub?key=tOM2qREmPUbv76waumrEEYg&single=true&gid=0&range=A1%3AAG15&output=csv")

The method I’ve described before for accessing a Google Spreadsheet from R didn’t quite apply here, as those instructions assume you own the document (and have access to the Publish menu). But some experimentation and tweaking of the spreadsheet URL made it work: the key parameters seem to be the "&gid=" (sheet number) and "%range=" (cell ranges, use %3A to encode the colon) and "&output=csv" to download in CSV format. It would be nice if Google published the specs to form URLs like these, but as far as I know they don’t. 

Anyway, a couple of bloggers have used these data to great effect to express the results of the World Cup visually using R graphics. For example, the R Charts blog used ggplot2 to look at the number of fouls committed by each team during the tournament:

World_Cup_2010_Fouls  

(Personally, I would have sorted the rows by descending number of fouls, rather than alphabetically.) Interesting to see that Cup champions Spain are in the middle of the pack on fouls, whereas runners-up Netherlands lead this table (boosted heavily by their performance in the final).

Blogger Jason Priem also took a look at the data, this time with a scatterplot of goals per game by fouls per game, related to how far each team advanced in the competition:

World-cup-2010-vis 

(Download Jason’s code for this chart here.) Again it’s interesting to see the positions of the two finalists here, with Netherlands on the extreme frontier for both fouls and goals, while spain is moderate on goals per game and near the lowest on fouls per games.

It’s a rich dataset and I’m sure other Revolutions readers could come up with some equally interesting visualizations. If you do, tell us about it in the comments.

Guardian Data Blog: World Cup 2010 statistics: every match and every player in data

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , ,

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)