World Cup 2010 Statistics Plotted with R

July 11, 2010
By

(This article was first published on R-Chart, and kindly contributed to R-bloggers)


Opta  agreed to let the UK Guardian Data Blog publish 2010 World Cup Team and Player statistics.  The data is available in a Google Docs spreadsheet.  There are two tabs on this spreadsheet - one is PLAYERS the other is TEAM statistics.  I chose File -> Download As -> CSV and downloaded the files through a web browser, then moved them to my working R directory.  I named the first World Cup 2010 data.csv (Player Data) and the second World Cup 2010 TEAM data.csv (Team Data).

By the way, if anyone knows how individual Google Docs spreadsheets can be downloaded as CSVs via URLs, please let me know by commenting on this post.  I could not figure out how to do this straight from R by reading a URL (which is my preference).

The following are a few charts that can be created with the data.  You might also want to do more sophisticated predictive analysis, by I will leave that to Paul.  

The sheet with player data can be read in as a CSV

DF=read.csv('World Cup 2010 data.csv')

The following attributes are available for each player.

names(DF)


Player.Surname
Team
Position
Time.Played
Total.Shots.Attempted
Total.Passes      
Tackles.Attempted
Saves.Made

The base graphics package can be used to produce the following chart of the USA team's shots attempted by player.

# Create a smaller data frame that 
# contains only USA player names 
# and shots attempted.
PS=DF[DF$Team=='USA',c('Player.Surname','Total.Shots.Attempted')]


# Make the player Names the rownames
rownames(PS)=PS[,1]
PS=PS[-1]


# Flip the X axis labels and provide enough room in the margins to print the names
par(las=2,mar=c(8, 4, 1, 2) + 0.1)


# Pivot the table, print the barplot and add a title
barplot(t(PS))
title('2010 World Cup USA Total Shots Attempted')

Now an example with the Team data.  In this case, the column names are actually the names of the countries.

DF2=read.csv('World Cup 2010 TEAM data.csv')
names(DF2)

The attributes about each team are available in the first column.

DF2[1]

Games Played
Goals
Ave Goals per game
Shots (excl blocked shots)
% Shots on Target
% Goals to Shots
Overall Pass Completion %
Cross Completion %
Goals Conceded
Ave goals conceded per game
Tackles Won %
Fouls
Yellow Cards
Red Cards

I prefer these attributes as row names - so moved them there using the following:

rownames(DF2)=DF2[,1]
DF2=DF2[,-1]



This time, we will use qqplot and create a horizontal barchart that includes a gradient that increases to highlight the countries with the most fouls.  I think you will agree - qqplot produces much better results.  The author of the (Hadley Wickham) just released a new version of this package.  He also has written a book on it  which goes into greater depth about its use and design (based upon Leland Wilkison’s Grammar of Graphics).  The example that follows uses the simpler qplot call, the team names as the x axis, and the number of fouls as the y axis.  The "Geometry" specified indicates that we are using a bar chart, and we specify coord_flip to switch the x and y axis.

library(qqplot2)
FOULS=t(DF2)[,c('Fouls')]
qplot(names(FOULS), as.numeric(FOULS), geom="bar", stat='identity', fill=Fouls) + xlab('Country') + ylab('Fouls') + coord_flip() + scale_fill_continuous(low="black", high="red") + labs(fill='Fouls')



When t was used to pivot the data.frame, it changed it to a matrix and the type of the numeric values became character. The as.numeric function was used to cast it back.

The following is the same type of plot for Goals.  This chart also includes a title.  It appears at the top of this post.

GOALS=t(DF2)[,c('Goals')]
qplot(names(GOALS), as.numeric(GOALS), geom="bar", stat='identity', fill=as.numeric(GOALS)) + xlab('Country') + ylab('Goals') + coord_flip() + scale_fill_continuous(low="yellow", high="blue") + labs(fill='Goals') + opts(title = "2010 World Cup Goals (as of 07/10/2010)")

Hope you enjoyed this little excursion.

To leave a comment for the author, please follow the link and comment on his blog: R-Chart.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.