Quick off the mark

[This article was first published on PremierSoccerStats » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

With none of the top teams overimpressing this season, Alan Pardew’s performance with Newcastle – especially in the transfer market -is likely to see him receive coniderable recognition in the Manager of the Year award

Recent acquisition, Papiss Cissé, has proved particularly fruitful with a brace against Swansea last time out taking him to nine goals in just eight outings. At half-time, a question posed in the Guardian was whether or not he was the fastest in EPL careers to reach the eight strikes mark. Sounded like a good excuse to haul Rs. If you are only interested in results check out the tables below

The data I need to utilize is on a MSSQL database so first thing to do is load the RODBC package and execute a query
The query extracts every match played by all players (the Where statement ensures that bench-only appearances are not included) and the goals scored.

?View Code RSPLUS
library(RODBC)
channel <- odbcConnect("eplR")
 
goalGames <- sqlQuery(channel,paste(
"
SELECT     soccer.tblTeam_Names.TEAMNAME as team, soccer.tblPlayers.PLAYERID, 
CASE when soccer.tblPlayers.FIRSTNAME is null then soccer.tblPlayers.LASTNAME else soccer.tblPlayers.FIRSTNAME + ' ' + soccer.tblPlayers.LASTNAME end as name, soccer.tblMatch.DATE
as gameDate, soccer.tblPlayer_Match.GOALS as goals
FROM         soccer.tblPlayer_Match INNER JOIN
                      soccer.tblPlayers INNER JOIN
                      soccer.tblPlayerClub ON soccer.tblPlayers.PLAYERID = soccer.tblPlayerClub.PLAYERID ON soccer.tblPlayer_Match.PLAYER_TEAM = soccer.tblPlayerClub.PLAYER_TEAM INNER JOIN
                      soccer.tblMatchTeam ON soccer.tblPlayer_Match.TEAMMATCHID = soccer.tblMatchTeam.TEAMMATCHID INNER JOIN
                      soccer.tblMatch ON soccer.tblMatchTeam.MATCHID = soccer.tblMatch.MATCHID INNER JOIN
                      soccer.tblTeam_Names ON soccer.tblMatchTeam.TEAMID = soccer.tblTeam_Names.TEAMID
where (soccer.tblPlayer_Match.START + soccer.tblPlayer_Match.[ON]) > 0
ORDER BY soccer.tblMatch.DATE
"
));
odbcClose(channel)

I want to add columns to show both the game order – there are no double-headers in the EPL so the date reflect the game – and also a cumulative sum of the goals scored. To assist in this I utilize the popular, plyr package

?View Code RSPLUS
library(plyr)
 
# PLAYERID is unique, name is user-friendly
goalGames <- ddply(goalGames,c("PLAYERID","name"), transform,
games = 1:NROW(piece),cumGoals = cumsum(goals))
 
# this what the dataframe now looks like
head(goalGames,1)
team         PLAYERID            name   gameDate goals games cumGoals
1 Ipswich T  ABIDALN Nabil Abidallah 2001-02-24     0     1        0

The original question posed was whether or not Cisse was the quickest to reach the eight goal mark. The following code shows that his impressive goal per game mark still did not hack it

?View Code RSPLUS
# Note the inequality as a player might score more than one goal in a game
goalCount <- 8
minGames <- min(subset(goalGames,cumGoals>=goalCount)$games) #  Ans 5
 
# Now find the rows that fit the 8 goals in 5 games criteria
# and reduce the answer to relevant columns only
fastest <-subset(goalGames,cumGoals>=goalCount&games==minGames)[,c(2,3,6,7)]
 
print(fastest) # Uh-Oh we have a tie
PLAYERID          name games cumGoals
1215    AGUEROS Sergio Aguero     5        8
149644   QUINNM    Mick Quinn     5        8

In order to determine exactly who takes the biscuit, we need to obtain data on the actual time of goals scored, apply a ddply
to this result and then filter th dataframe for the players, games and goals under consideration

?View Code RSPLUS
 
channel <- odbcConnect("eplR")
goalTimes <- sqlQuery(channel,paste(
"
SELECT     soccer.tblTeam_Names.TEAMNAME as team, soccer.tblPlayers.PLAYERID,
 CASE when soccer.tblPlayers.FIRSTNAME is null then soccer.tblPlayers.LASTNAME else soccer.tblPlayers.FIRSTNAME + ' ' + soccer.tblPlayers.LASTNAME end as name, soccer.tblMatch.DATE as gameDate, soccer.tblGoals.time
 
FROM         soccer.tblPlayer_Match INNER JOIN
                      soccer.tblPlayers INNER JOIN
                      soccer.tblPlayerClub ON soccer.tblPlayers.PLAYERID = soccer.tblPlayerClub.PLAYERID ON soccer.tblPlayer_Match.PLAYER_TEAM = soccer.tblPlayerClub.PLAYER_TEAM INNER JOIN
                      soccer.tblMatchTeam ON soccer.tblPlayer_Match.TEAMMATCHID = soccer.tblMatchTeam.TEAMMATCHID INNER JOIN
                      soccer.tblMatch ON soccer.tblMatchTeam.MATCHID = soccer.tblMatch.MATCHID INNER JOIN
                      soccer.tblTeam_Names ON soccer.tblMatchTeam.TEAMID = soccer.tblTeam_Names.TEAMID  INNER JOIN
                      soccer.tblGoals ON soccer.tblPlayer_Match.PLAYER_MATCH = soccer.tblGoals.PLAYER_MATCH
WHERE     (soccer.tblPlayer_Match.START + soccer.tblPlayer_Match.[ON] > 0)  
ORDER BY soccer.tblMatch.DATE,soccer.tblGoals.time 
 
"
));
 
odbcClose(channel)
 
# obtain ID's of tied game players
ties <- fastest$PLAYERID
 
# add cumulative goals to the goalTimes dataframe
goalTimes <- ddply(goalTimes,c("PLAYERID","name"), transform,sumGoals = 1:NROW(piece) )
final <- arrange(subset(goalTimes,PLAYERID %in% ties&sumGoals==goalCount),time)
 
print(final[1,3,5,6]) # and ladies and gentleman the winner is...
team          name time sumGoals
1  Man. City Sergio Aguero   46        8
2 Coventry C    Mick Quinn   73        8

That’s a lot of work for one result. It is a pretty simple matter to extend the result to show data for fastest player to one, two, three goals etc. Open code box below to see full details

?View Code RSPLUS
library(RODBC) 
 
channel <- odbcConnect("eplR")
 
goalGames <- sqlQuery(channel,paste(
"
SELECT     soccer.tblTeam_Names.TEAMNAME as team, soccer.tblPlayers.PLAYERID, 
CASE when soccer.tblPlayers.FIRSTNAME is null then soccer.tblPlayers.LASTNAME else soccer.tblPlayers.FIRSTNAME + ' ' + soccer.tblPlayers.LASTNAME end as name, soccer.tblMatch.DATE
as gameDate, soccer.tblPlayer_Match.GOALS as goals
FROM         soccer.tblPlayer_Match INNER JOIN
                      soccer.tblPlayers INNER JOIN
                      soccer.tblPlayerClub ON soccer.tblPlayers.PLAYERID = soccer.tblPlayerClub.PLAYERID ON soccer.tblPlayer_Match.PLAYER_TEAM = soccer.tblPlayerClub.PLAYER_TEAM INNER JOIN
                      soccer.tblMatchTeam ON soccer.tblPlayer_Match.TEAMMATCHID = soccer.tblMatchTeam.TEAMMATCHID INNER JOIN
                      soccer.tblMatch ON soccer.tblMatchTeam.MATCHID = soccer.tblMatch.MATCHID INNER JOIN
                      soccer.tblTeam_Names ON soccer.tblMatchTeam.TEAMID = soccer.tblTeam_Names.TEAMID
where (soccer.tblPlayer_Match.START + soccer.tblPlayer_Match.[ON]) > 0
ORDER BY soccer.tblMatch.DATE
"
));
goalTimes <- sqlQuery(channel,paste(
"
SELECT     soccer.tblTeam_Names.TEAMNAME as team, soccer.tblPlayers.PLAYERID,
 CASE when soccer.tblPlayers.FIRSTNAME is null then soccer.tblPlayers.LASTNAME else soccer.tblPlayers.FIRSTNAME + ' ' + soccer.tblPlayers.LASTNAME end as name, soccer.tblMatch.DATE as gameDate, soccer.tblGoals.time
 
FROM         soccer.tblPlayer_Match INNER JOIN
                      soccer.tblPlayers INNER JOIN
                      soccer.tblPlayerClub ON soccer.tblPlayers.PLAYERID = soccer.tblPlayerClub.PLAYERID ON soccer.tblPlayer_Match.PLAYER_TEAM = soccer.tblPlayerClub.PLAYER_TEAM INNER JOIN
                      soccer.tblMatchTeam ON soccer.tblPlayer_Match.TEAMMATCHID = soccer.tblMatchTeam.TEAMMATCHID INNER JOIN
                      soccer.tblMatch ON soccer.tblMatchTeam.MATCHID = soccer.tblMatch.MATCHID INNER JOIN
                      soccer.tblTeam_Names ON soccer.tblMatchTeam.TEAMID = soccer.tblTeam_Names.TEAMID  INNER JOIN
                      soccer.tblGoals ON soccer.tblPlayer_Match.PLAYER_MATCH = soccer.tblGoals.PLAYER_MATCH
WHERE     (soccer.tblPlayer_Match.START + soccer.tblPlayer_Match.[ON] > 0)  
ORDER BY soccer.tblMatch.DATE,soccer.tblGoals.time 
 
"
));
 
odbcClose(channel)
 
goalGames <- ddply(goalGames,c("PLAYERID","name"), transform,
games = 1:NROW(piece), cumGoals = cumsum(goals))
 
goalTimes <- ddply(goalTimes,c("PLAYERID","name"), transform,
sumGoals = 1:NROW(piece) )
 
# create df to hold results. count is number of players reaching each goal mark
myTable <- data.frame(player=character(),goals=integer(),
game=integer(),time=integer(),count=integer())
 
for(goalCount in 1:200) {
#goalCount <- 8
minGames <- min(subset(goalGames,cumGoals>=goalCount)$games)
playerCount <-  subset(goalGames,cumGoals>=goalCount)
count <-length(unique(playerCount$name))
 
fastest <-subset(goalGames,cumGoals>=goalCount&games==minGames)[,c(2,3,6,7)]
 
if (nrow(fastest) > 1)
{
 
ties <- fastest$PLAYERID
final <- head(arrange(subset(goalTimes,PLAYERID %in% ties&sumGoals==goalCount),time),1)
answer <- data.frame(player=final$name,goals=goalCount,game=minGames,minute=final$time,count=count)
} else {
answer <- data.frame(player=fastest$name,goals=goalCount,game=minGames,minute="",count=count)
}
 
# add number of players who have achieved it
myTable <- rbind(myTable,answer)
 
}
print(myTable)

All time Leaders

An abbreviated result is shown below

There are several points of interest

  • Nasri was only 21 when he converted a penalty in his debut. Brian Deane scored the very first goal in the EPL five minutes in
  • Ravanelli is the only player to score a hat-trick on his debut
  • Pogrebybak was fastest to five goals earlier this season
  • Cole and Shearer vied for the lead at around the 50 goal mark. The last occasion where a time tie break is required was
    when both of them scored their 43rd goal in 52 games. Shearer left it late but his 82nd minute strike topped Cole by four minutes
  • van Nistelrooy grabbed his 56th goal in his 74th appearance but failed to score in his subsequent four outings. Shearer nabbed his 55th in game 74, powered in eight more in the next five and holds the record for all subsequent goals reached

Individual Club

Further analysis can be done by club by simply extending the ddply to include team

Here are Man U’s figures

  • 10 different Man U players scored on their debut
  • van Nistelrooy played for the team in his prime. Players like Ronaldo and Rooney started much younger
  • Uniteds two other 100 goal scorers, Scholes and Giggs, converted at half the rate Rooney has

Individual Season

To look at this season’s data I need to take the bin the game date data I have into its appropriate EPL season. For this I use the cut function and then apply a new ddply function

?View Code RSPLUS
years <- 1992:2012
goalGames$season <- cut(goalGames$gameDate,  breaks=as.POSIXct(paste(years,"-08-01",sep="")),
  labels=paste(years[-length(years)],years[-length(years)]+1,sep="/"))
goalTimes$season <- cut(goalTimes$gameDate,  breaks=as.POSIXct(paste(years,"-08-01",sep="")),
  labels=paste(years[-length(years)],years[-length(years)]+1,sep="/"))

  • Suarez actually tied Djibrill Cisse on the, surprisingly late, 12 minute mark of the season
  • Dzeko’s early season success was not maintained. typifying Man City’s campaign
  • Rooney has played in less games so his scoring has lagged van Persie. However, they both reached the 20 goal mark in their 24th game

To leave a comment for the author, please follow the link and comment on their blog: PremierSoccerStats » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)