Opta agreed to let the UK Guardian Data Blog publish 2010 World Cup Team and Player statistics. The data is available in a Google Docs spreadsheet. There are two tabs on this spreadsheet – one is PLAYERS the other is TEAM statistics. I chose File -> Download As -> CSV and downloaded the files through a web browser, then moved them to my working R directory. I named the first World Cup 2010 data.csv (Player Data) and the second World Cup 2010 TEAM data.csv (Team Data).
By the way, if anyone knows how individual Google Docs spreadsheets can be downloaded as CSVs via URLs, please let me know by commenting on this post. I could not figure out how to do this straight from R by reading a URL (which is my preference).
The following are a few charts that can be created with the data. You might also want to do more sophisticated predictive analysis, by I will leave that to Paul.
The sheet with player data can be read in as a CSV
DF=read.csv(‘World Cup 2010 data.csv’)
The following attributes are available for each player.
The base graphics package can be used to produce the following chart of the USA team’s shots attempted by player.
# Create a smaller data frame that
# contains only USA player names
# and shots attempted.
# Make the player Names the rownames
# Flip the X axis labels and provide enough room in the margins to print the names
par(las=2,mar=c(8, 4, 1, 2) + 0.1)
# Pivot the table, print the barplot and add a title
title(‘2010 World Cup USA Total Shots Attempted’)
Now an example with the Team data. In this case, the column names are actually the names of the countries.
DF2=read.csv(‘World Cup 2010 TEAM data.csv’)
The attributes about each team are available in the first column.
Ave Goals per game
Shots (excl blocked shots)
% Shots on Target
% Goals to Shots
Overall Pass Completion %
Cross Completion %
Ave goals conceded per game
Tackles Won %
I prefer these attributes as row names – so moved them there using the following:
|This time, we will use qqplot and create a horizontal barchart that includes a gradient that increases to highlight the countries with the most fouls. I think you will agree – qqplot produces much better results. The author of the (Hadley Wickham) just released a new version of this package. He also has written a book on it which goes into greater depth about its use and design (based upon Leland Wilkison’s Grammar of Graphics). The example that follows uses the simpler qplot call, the team names as the x axis, and the number of fouls as the y axis. The “Geometry” specified indicates that we are using a bar chart, and we specify coord_flip to switch the x and y axis.|
qplot(names(FOULS), as.numeric(FOULS), geom=”bar”, stat=’identity’, fill=Fouls) + xlab(‘Country’) + ylab(‘Fouls’) + coord_flip() + scale_fill_continuous(low=”black”, high=”red”) + labs(fill=’Fouls’)
When t was used to pivot the data.frame, it changed it to a matrix and the type of the numeric values became character. The as.numeric function was used to cast it back.
The following is the same type of plot for Goals. This chart also includes a title. It appears at the top of this post.
qplot(names(GOALS), as.numeric(GOALS), geom=”bar”, stat=’identity’, fill=as.numeric(GOALS)) + xlab(‘Country’) + ylab(‘Goals’) + coord_flip() + scale_fill_continuous(low=”yellow”, high=”blue”) + labs(fill=’Goals’) + opts(title = “2010 World Cup Goals (as of 07/10/2010)”)
Hope you enjoyed this little excursion.