[This article was first published on Fear and Loathing in Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

With two weeks of NFL football under our belts, it is time to start peaking under the proverbial hood at some of the statistics.  What better way than with R?  If you want the best stats out there, I recommend the website http://www.advancednflstats.com/  .   In order to understand the variables you will need to spend some time looking at the glossary.  An excellent in-depth companion book to these advanced statistics is Mathletics, authored by Wayne Winston of Indiana University.  Wayne also publishes a blog http://waynewinston.com/wordpress/.  As such, I’m not going to get into the nitty gritty of these variables.

I’ve downloaded the Quarterback stats through week 2 and will do some simple data visualization, a scatterplot matrix and a correlation heatmap.  This is some simple code to get you on your way to multivariate visualization.

> str(qb)  #structure of the data named “qb”
‘data.frame’:   33 obs. of  18 variables:
$Rank : int 1 2 3 4 5 6 7 8 9 10 …$ Player : Factor w/ 33 levels “1-C.Newton”,”10-E.Manning”,..: 13 22 8 11 31 28 12 18 7 5 …
$Team : Factor w/ 32 levels “ARZ”,”ATL”,”BLT”,..: 10 6 12 26 20 24 17 4 14 16 …$ G      : int  2 2 2 2 2 2 2 2 2 2 …
$WPA : num 1.03 1.02 0.91 0.85 0.82 0.76 0.64 0.53 0.53 0.51 …$ EPA    : num  41.2 6.3 42.1 41.5 25.7 10.4 14.6 2.1 19.5 6.4 …
$WPA.G : num 0.52 0.51 0.46 0.43 0.41 0.38 0.32 0.27 0.27 0.26 …$ EPA.P  : num  0.44 0.08 0.48 0.48 0.28 0.12 0.17 0.03 0.23 0.07 …
$SR… : num 55.9 55.3 61.4 55.2 50.5 50 51.7 45.5 52.4 42.7 …$ Att    : int  85 72 79 76 81 62 72 66 66 70 …
$Cmp : int 57 49 55 50 52 39 47 45 43 42 …$ Cmp.   : num  67.1 68.1 69.6 65.8 64.2 62.9 65.3 68.2 65.2 60 …
$PassYds: int 769 532 813 614 679 631 591 446 499 396 …$ Sk     : int  3 1 6 3 6 4 9 1 7 5 …
$SkYds : int 17 8 50 18 42 29 39 9 37 26 …$ Int    : int  0 3 1 1 3 0 1 1 1 0 …
$X.Deep : num 17.6 15.3 17.7 22.4 24.7 30.6 12.5 21.2 25.8 8.6 …$ AYPA   : num  8.5 5.3 8.4 7 5.8 9.1 6.3 5.9 5.7 4.9 …

Of the 18 variables, 16 are continuous, but we not concerned with “Rank” (at least not in week2) and”G”, which is number of games played.

> pairs(qb[ ,5:18])  #base package scatterplot matrix

Yawn!

We could improve this with more code, but it still just won’t “pop” visually.  An option would be to use the lattice package, which I describe in a previous post.  However, I’m intrigued by heatmaps, in particular as a way to portray correlations.

For this, you will need to load the ggplot2 and reshape2 packages.

> library(ggplot2)
> library(reshape2)
> # simple code to create a correlation data set and put it into a heatmap
> corqb = cor(qb[ ,5:18])
> qplot(x=Var1, y=Var2, data=melt(cor(corqb)), fill=value, geom=”tile”)  #Note: depending on your system, you may need to use X1 and X2 in place of Var1 and Var2

Let’s take a look at a very simple correlation on this chart.  Find the variables “Sk” and “SkYds” and look at their high level of correlation.  This should be no surprise as Sk is for the number of times sacked and yes, you guessed it, SkYds is the total yards lost as a result of those sacks.

Let’s look at QB rank, sacks, yards lost by sacks and interceptions
> corqb2 = qb[c(1,14,15,16)]
> qplot(x=Var1, y=Var2, data=melt(cor(corqb2)), fill=value, geom=”tile”)

And, here are the correlation numbers…

> cor(corqb2)
Rank               Sk                     SkYds                 Int
Rank     1.0000000      0.27489633      0.3330972          0.32814607
Sk         0.2748963     1.00000000       0.9117308         0.06699875
SkYds  0.3330972      0.91173078      1.0000000          0.13743870
Int        0.3281461      0.06699875      0.1374387          1.00000000

At this point in the season, the QB rank is not highly correlated with these bad things happening.  It will be interesting to see this change as the season progresses.