Craig Bellamy – quite dplyr

February 4, 2014
By

(This article was first published on PremierSoccerStats » R, and kindly contributed to R-bloggers)

This weekend brought a couple of firsts in Cardiff’s winner against Norwich

After a wretched time at Manchester United, Wilfried Zaha recorded his first Premiership assist, whilst, more interestingly, Craig Bellamy became the first player in history to score for seven different Premier League clubs

To celebrate, I thought it was worth taking a quick data dip with the new dplyr package for R, a souped up version of plyr for data.frames.

A main advantage of dplyr is that is way faster than plyr but it also offers the option to chain operations, utilizing %.%. This encourages the good discipline of planning logically ahead of coding, something I am not naturally inclined to, and should make the code more readable

I have loaded into R a largish, (270,000 row) data.frame, playerGames, of players’ appearances in the English Premier League

My target is a graph showing for each the players who have scored for the most different clubs how many games it has taken them to score their first goal for each of these teams.

The process uses several of the dplyr functions. Firstly, I want to tidy up the data, reduce the data to variables of interest and then add some required columns. I then want to find out who these itinerant players are and ascertain when they got off the mark with each club Finally I will knock out a ggplot

?View Code RSPLUS
# load packages - make sure plyr is not running as this may cause issues
library(dplyr)
library(ggplot2)
library(scales)
 
# convert the data.frame to a tbl_df: 
#this is a wrapper around a data frame that won't accidentally print a lot of data to 
playerGames_df0) %.%
 
# set to required columns
select(playerID,teamID,goals,gameDate) %.%
 
# sort on game date
arrange(gameDate) %.%
 
# group each player by team
group_by(playerID,teamID) %.%
 
# so that we can set a game order and cumulate goals for each #player/team
mutate(
game = 1:NROW(Goals),
cumGoals = cumsum(Goals)
)
 
# example row
tail(allGames,1)
       playerID teamID goals   gameDate game cumGoals
222249    OSCAR    CHL     0 2014-02-03   56       10
 
# now we need to find these players
topPlayers0) %.%
 
# and sum the number of clubs by player
group_by(playerID) %.%
summarise(teams=n()) %.%
 
# now just show Bellamy and the others who were also on six teams
filter(teams==max(teams)|teams==max(teams)-1))$playerID
 
topPlayers
#[1] "BARMBYN" "COLEA1" "BENTD" "BELLAMC" "KEANER2" #"CROUCHP" "ANELKAN" "FERDINL"
 
# now for these players calculate the debut goal data
firstGoal0) %.%
 
# and then select first row for each player/club
group_by(playerID,teamID) %.%
summarise(first=min(game))
 
head(firstGoal,1)
#  playerID teamID first
#1    BENTD    ASV     1

 

At this point, my computer, WordPress and the coding wrapper decided to screw up. The rest of the code just replaces playerID with real names and uses ggplot to create a chart

bellamy

A few football points to note

  • Bellamy took 13 appearances to score his first Premiership goal fro Cardiff, although he had scored plenty for them in the division below. This is the longest due in part to many sub appearances, playing with a weak team and old age
  • Darren Bent  scored on his debut on four occasions. Anelka never managed it before game 4
  • Out of roughly 4,000 players who have appeared in the Premiership, both with surname, Bent, figure. One of the two A Coles and one of the two R Keanes also appear in the list of nine
  • Liverpool and Tottenham figure the most with five stops. Crouch, Keane and Bellamy have each appeared for both clubs
  • All five Spurs players scored in their first four appearances. By contrast, none of the Liverpool five got off the mark before game 7 (Bellamy) with all the other is the 10-12 range

To leave a comment for the author, please follow the link and comment on his blog: PremierSoccerStats » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.