Site icon R-bloggers

Craig Bellamy – quite dplyr

[This article was first published on PremierSoccerStats » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This weekend brought a couple of firsts in Cardiff’s winner against Norwich

After a wretched time at Manchester United, Wilfried Zaha recorded his first Premiership assist, whilst, more interestingly, Craig Bellamy became the first player in history to score for seven different Premier League clubs

To celebrate, I thought it was worth taking a quick data dip with the new dplyr package for R, a souped up version of plyr for data.frames.

A main advantage of dplyr is that is way faster than plyr but it also offers the option to chain operations, utilizing %.%. This encourages the good discipline of planning logically ahead of coding, something I am not naturally inclined to, and should make the code more readable

I have loaded into R a largish, (270,000 row) data.frame, playerGames, of players’ appearances in the English Premier League

My target is a graph showing for each the players who have scored for the most different clubs how many games it has taken them to score their first goal for each of these teams.

The process uses several of the dplyr functions. Firstly, I want to tidy up the data, reduce the data to variables of interest and then add some required columns. I then want to find out who these itinerant players are and ascertain when they got off the mark with each club Finally I will knock out a ggplot

?View Code RSPLUS
# load packages - make sure plyr is not running as this may cause issues
library(dplyr)
library(ggplot2)
library(scales)
 
# convert the data.frame to a tbl_df: 
#this is a wrapper around a data frame that won't accidentally print a lot of data to 
playerGames_df0) %.%
 
# set to required columns
select(playerID,teamID,goals,gameDate) %.%
 
# sort on game date
arrange(gameDate) %.%
 
# group each player by team
group_by(playerID,teamID) %.%
 
# so that we can set a game order and cumulate goals for each #player/team
mutate(
game = 1:NROW(Goals),
cumGoals = cumsum(Goals)
)
 
# example row
tail(allGames,1)
       playerID teamID goals   gameDate game cumGoals
222249    OSCAR    CHL     0 2014-02-03   56       10
 
# now we need to find these players
topPlayers0) %.%
 
# and sum the number of clubs by player
group_by(playerID) %.%
summarise(teams=n()) %.%
 
# now just show Bellamy and the others who were also on six teams
filter(teams==max(teams)|teams==max(teams)-1))$playerID
 
topPlayers
#[1] "BARMBYN" "COLEA1" "BENTD" "BELLAMC" "KEANER2" #"CROUCHP" "ANELKAN" "FERDINL"
 
# now for these players calculate the debut goal data
firstGoal0) %.%
 
# and then select first row for each player/club
group_by(playerID,teamID) %.%
summarise(first=min(game))
 
head(firstGoal,1)
#  playerID teamID first
#1    BENTD    ASV     1

 

At this point, my computer, WordPress and the coding wrapper decided to screw up. The rest of the code just replaces playerID with real names and uses ggplot to create a chart

A few football points to note

To leave a comment for the author, please follow the link and comment on their blog: PremierSoccerStats » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.