Cluster Analysis of the NFL’s Top Wide Receivers

[This article was first published on Fear and Loathing in Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

“The time has come to get deeply into football. It is the only thing we have left that ain’t fixed.”
Hunter S. Thompson, Hey Rube Column, November 9, 2004

I have to confess that I haven’t been following the NFL this year as much as planned or hoped.  On only 3 or 4 occasions this year have I been able to achieve a catatonic state while watching NFL RedZone.  Nonetheless, it is easy to envision how it is all going to end.  Manning will throw four picks on a cold snowy day in Foxboro in the AFC Championship game and the Seahawk defense will curb-stomp Aaron Rodgers and capture a consecutive NFC crown.  As for the Super Bowl, well who cares other than the fact that we must cheer against the evil Patriot empire, rooting for their humiliating demise.  One can simultaneously hate and admire that team.  I prefer to do the former publicly and the latter in private.

We all seem to have a handle on the good and bad quarterbacks out there, but what about their wide receivers?  With the playoffs at the doorstep and ignorance of the situation, I had to bring myself up to speed.  This is a great opportunity to do a cluster analysis of the top wide receivers and see who is worth keeping an eye on in the upcoming spectacle.

A good source for interesting statistics and articles on NFL players and teams is http://www.advancedfootballanalytics.com/ .  Here we can download the rankings for the top 40 wide receivers based on Win Probability Added or “WPA”.  To understand the calculation you should head over to the site and read the details.  The site provides a rationale on WPA by saying “WPA has a number of applications. For starters, we can tell which plays were truly critical in each game. From a fan’s perspective, we can call a play the ‘play of the week’ or the ‘play of the year.’ And although we still can’t separate an individual player’s performance from that of his teammates’, we add up the total WPA for plays in which individual players took part. This can help us see who really made the difference when it matters most. It can help tell us who is, or at least appears to be, “clutch.” It can also help inform us who really deserves the player of the week award, the selection to the Pro Bowl, or even induction into the Hall of Fame.”  

I put the website’s wide receiver data table in a .csv and we can start the analysis, reading in the file and examining its structure.

> receivers <- read.csv(file.choose())
> str(receivers)
‘data.frame’: 40 obs. of  19 variables:
 $ Rank     : int  1 2 3 4 5 6 7 8 9 10 …
 $ Player   : Factor w/ 40 levels “10-E.Sanders”,..: 16 33 1 23 37 24 36 2 4 13 …
 $ Team     : Factor w/ 28 levels “ARZ”,”ATL”,”BLT”,..: 11 23 10 22 9 12 12 28 17 19 …
 $ G        : int  16 16 16 16 16 16 16 14 14 12 …
 $ WPA      : num  2.43 2.4 2.33 2.33 2.3 2.27 2.19 1.91 1.89 1.76 …
 $ EPA      : num  59 95.7 81.3 56.8 78.8 86.2 97.3 54.3 63.6 64.6 …
 $ WPA_G    : num  0.15 0.15 0.15 0.15 0.14 0.14 0.14 0.14 0.14 0.15 …
 $ EPA_P    : num  0.38 0.48 0.5 0.38 0.54 0.6 0.58 0.54 0.41 0.43 …
 $ SR_PERC  : num  55.4 62.8 61.3 52 58.5 59.4 60.5 54.5 58.1 55 …
 $ YPR      : num  13.4 13.2 13.9 15.5 15 14.1 15.5 20.2 10.6 14.3 …
 $ Rec      : int  99 129 101 86 88 91 98 52 92 91 …
 $ Yds      : int  1331 1698 1404 1329 1320 1287 1519 1049 972 1305 …
 $ RecTD    : int  4 13 9 10 16 12 13 5 4 12 …
 $ Tgts     : int  143 181 141 144 136 127 151 88 134 130 …
 $ PER_Tgt  : num  24.2 30.2 23.4 23.5 28.9 23.9 28.4 16.2 22.3 21.7 …
 $ YPT      : num  9.3 9.4 10 9.2 9.7 10.1 10.1 11.9 7.3 10 …
 $ C_PERC   : num  69.2 71.3 71.6 59.7 64.7 71.7 64.9 59.1 68.7 70 …
 $ PERC_DEEP: num  19.6 26.5 36.2 30.6 30.9 23.6 31.8 33 16.4 28.5 …
 $ playoffs : Factor w/ 2 levels “n”,”y”: 2 2 2 1 2 2 2 1 2 1 …

> head(receivers$Player)
[1] 15-G.Tate    84-A.Brown   10-E.Sanders 18-J.Maclin  88-D.Bryant
[6] 18-R.Cobb

Based on WPA, Golden Tate of Detroit is the highest ranked wide receiver; in contrast his highly-regarded teammate is ranked 13th.

We talked about WPA so here is a quick synopsis on the other variables; again, please go to the website for detailed explanations:

  • EPA – Expected Points Added
  • WPA_G – WPA per game
  • EPA_P – Expected Points Added per Play
  • SR_PERC – Success Rate of plays the receiver was involved that are considered successful
  • YPR – Yards Per Reception
  • Rec – Total Receptions
  • Yds – Total Reception Yards
  • RecTD – Receiving Touchdowns
  • Tgts – The times a receiver was targeted in the passing game
  • PER_Tgts – Percentage of time a team’s passes were targeted to the receiver
  • YPT – Yards per times targeted by a pass
  • C_PERC – Completion percentage
  • PERC_DEEP – Percent of passes targeted deep
  • playoffs – A factor I coded on whether the receiver’s team is in the playoffs or not

To do hierarchical clustering with this data, we can use the hclust() function available in base R.  In preparation for that, we should scale the data and we must create a distance matrix.

> r.df <- receivers[,c(4:18)]
> rownames(r.df) <- receivers[,2]
> scaled <- scale(r.df)
> d <- dist(scaled)

With the data prepared, produce the cluster object and plot it.

> hc <- hclust(d, method="ward.D")
> plot(hc, hang=-1, xlab=””, sub=””)


This is the standard dendrogram produced with hclust.  We now need to select the proper number of clusters and produce a dendrogram that is easier to examine.  For this, I found some interesting code to adapt on Gaston Sanchez’s blog: http://gastonsanchez.com/blog/how-to/2012/10/03/Dendrograms.html .  Since I am leaning towards 5 clusters, let’s first create a vector of colors.  (Note: you can find/search for color codes on colorhexa.com)

> labelColors = c(“#FF0000”,  “#800080″,”#0000ff”, “#ff8c00″,”#013220”)

Then use the cutree() function to specify 5 clusters

> clusMember = cutree(hc, 5)

Now, we create a function (courtesy of Gaston) to apply colors to the clusters in the dendrogram.

> colLab <- function(n) {
+   if (is.leaf(n)) {
+     a <- attributes(n)
+     labCol <- labelColors[clusMember[which(names(clusMember) == a$label)]]
+     attr(n, “nodePar”) <- c(a$nodePar, lab.col = labCol)
+   }
+   n
+ }

Finally, we turn “hc” into a dendrogram object and plot the new results.

> hcd <- as.dendrogram(hc)
> clusDendro = dendrapply(hcd, colLab)
> plot(clusDendro, main = “NFL Receiver Clusters”, type = “triangle”)



That is much better.  For more in-depth analysis you can put the clusters back into the original dataframe.

> receivers$cluster <- as.factor(cutree(hc, 5))

It is now rather interesting to plot the variables by cluster to examine the differences.  In the interest of time and space, I present just a boxplot of WPA by cluster.

> boxplot(WPA~cluster, data=receivers, main=”Receiver Rank by Cluster”)



Before moving on, I present this simple table of the clusters of the receivers by playoff qualification.  Interesting to note that cluster 1 with the high WPA, also has 10 of 13 receivers in the playoffs.  One of the things that would be worth a look I think is to adjust wide receiver WPA by some weight based on their QB quality.  Note that Randall Cobb and Jordy Nelson of the Packers have high WPA, ranked 6 and 7 respectively, but have the privilege of having Rodgers as QB.  Remember, in the quote above WPA does not have the ability to separate an individual’s success from a teammate’s success.  This raises some interesting questions for me that require further inquiry.

> table(receivers$cluster, receivers$playoff)
 
        n     y
  1    3   10
  2    4     2
  3    7     4
  4    4     0
  5    3     3

In closing the final blog of the year, I must make some predictions for the College Football playoffs.  I hate to say it, but I think Alabama will roll over Ohio State.  In the Rose Bowl, FSU comes from behind to win…again!  I’d really like to see the Ducks win it all, but I just don’t see their defense being of the quality to stop Winston when it counts, which will be in the fourth quarter.  ‘Bama has that defense, well, the defensive line and backers anyway.  Therefore, I have to give the Crimson Tide the nod in the championship.  The news is not all bad.  Nebraska finally let go of Bo Pelini.  I was ecstatic about his hire, but the paucity of top-notch recruits finally manifested itself with perpetual high-level mediocrity.  His best years were with Callaghan’s recruits, Ndamukong Suh among many others.  They should have hired Paul Johnson from Georgia Tech, at least it would have been fun and somewhat nostalgic to watch Husker football again.

Mahalo,

Cory

To leave a comment for the author, please follow the link and comment on their blog: Fear and Loathing in Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)