Are you ready for some Football? (No not soccer)

[This article was first published on Fear and Loathing in Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

With two weeks of NFL football under our belts, it is time to start peaking under the proverbial hood at some of the statistics.  What better way than with R?  If you want the best stats out there, I recommend the website http://www.advancednflstats.com/  .   In order to understand the variables you will need to spend some time looking at the glossary.  An excellent in-depth companion book to these advanced statistics is Mathletics, authored by Wayne Winston of Indiana University.  Wayne also publishes a blog http://waynewinston.com/wordpress/.  As such, I’m not going to get into the nitty gritty of these variables.

I’ve downloaded the Quarterback stats through week 2 and will do some simple data visualization, a scatterplot matrix and a correlation heatmap.  This is some simple code to get you on your way to multivariate visualization.

> str(qb)  #structure of the data named “qb”
‘data.frame’:   33 obs. of  18 variables:
 $ Rank   : int  1 2 3 4 5 6 7 8 9 10 …
 $ Player : Factor w/ 33 levels “1-C.Newton”,”10-E.Manning”,..: 13 22 8 11 31 28 12 18 7 5 …
 $ Team   : Factor w/ 32 levels “ARZ”,”ATL”,”BLT”,..: 10 6 12 26 20 24 17 4 14 16 …
 $ G      : int  2 2 2 2 2 2 2 2 2 2 …
 $ WPA    : num  1.03 1.02 0.91 0.85 0.82 0.76 0.64 0.53 0.53 0.51 …
 $ EPA    : num  41.2 6.3 42.1 41.5 25.7 10.4 14.6 2.1 19.5 6.4 …
 $ WPA.G  : num  0.52 0.51 0.46 0.43 0.41 0.38 0.32 0.27 0.27 0.26 …
 $ EPA.P  : num  0.44 0.08 0.48 0.48 0.28 0.12 0.17 0.03 0.23 0.07 …
 $ SR…  : num  55.9 55.3 61.4 55.2 50.5 50 51.7 45.5 52.4 42.7 …
 $ Att    : int  85 72 79 76 81 62 72 66 66 70 …
 $ Cmp    : int  57 49 55 50 52 39 47 45 43 42 …
 $ Cmp.   : num  67.1 68.1 69.6 65.8 64.2 62.9 65.3 68.2 65.2 60 …
 $ PassYds: int  769 532 813 614 679 631 591 446 499 396 …
 $ Sk     : int  3 1 6 3 6 4 9 1 7 5 …
 $ SkYds  : int  17 8 50 18 42 29 39 9 37 26 …
 $ Int    : int  0 3 1 1 3 0 1 1 1 0 …
 $ X.Deep : num  17.6 15.3 17.7 22.4 24.7 30.6 12.5 21.2 25.8 8.6 …
 $ AYPA   : num  8.5 5.3 8.4 7 5.8 9.1 6.3 5.9 5.7 4.9 …

Of the 18 variables, 16 are continuous, but we not concerned with “Rank” (at least not in week2) and”G”, which is number of games played.

> pairs(qb[ ,5:18])  #base package scatterplot matrix























Yawn! 

We could improve this with more code, but it still just won’t “pop” visually.  An option would be to use the lattice package, which I describe in a previous post.  However, I’m intrigued by heatmaps, in particular as a way to portray correlations.

For this, you will need to load the ggplot2 and reshape2 packages.

> library(ggplot2)
> library(reshape2)
> # simple code to create a correlation data set and put it into a heatmap
> corqb = cor(qb[ ,5:18])
> qplot(x=Var1, y=Var2, data=melt(cor(corqb)), fill=value, geom=”tile”)  #Note: depending on your system, you may need to use X1 and X2 in place of Var1 and Var2























Let’s take a look at a very simple correlation on this chart.  Find the variables “Sk” and “SkYds” and look at their high level of correlation.  This should be no surprise as Sk is for the number of times sacked and yes, you guessed it, SkYds is the total yards lost as a result of those sacks.

Let’s look at QB rank, sacks, yards lost by sacks and interceptions
> corqb2 = qb[c(1,14,15,16)]
> qplot(x=Var1, y=Var2, data=melt(cor(corqb2)), fill=value, geom=”tile”)


























And, here are the correlation numbers…

> cor(corqb2)
              Rank               Sk                     SkYds                 Int
Rank     1.0000000      0.27489633      0.3330972          0.32814607
Sk         0.2748963     1.00000000       0.9117308         0.06699875
SkYds  0.3330972      0.91173078      1.0000000          0.13743870
Int        0.3281461      0.06699875      0.1374387          1.00000000


At this point in the season, the QB rank is not highly correlated with these bad things happening.  It will be interesting to see this change as the season progresses.

To leave a comment for the author, please follow the link and comment on their blog: Fear and Loathing in Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)