Why Do the New Orleans Saints Lose? Data Visualization II

December 26, 2012
By

(This article was first published on Climate Change Ecology » R, and kindly contributed to R-bloggers)

I’m going to continue with my ‘making data visually appealing to the masses’ kick. I happen to like graphics and graphing data. I also happen to like American football (For the record, however, I’m a soccer player first, a rugby player second, an aussie rules player third, and an American football player never). Specifically, being from the area, I am a big New Orleans Saints fan. That said, they weren’t exactly lighting it up this year. In fact, for the first half of the season, they were downright horrible.

I like looking for trends in data, and I like football, so I put the two things together to see if there were potential explanations. I went through ESPN box scores and collected a few cursory statistics for each game (yards allowed, yards gained, pass attempts, rush attemps, etc). Granted, this is the most shallow analysis of football stats ever, but it’s a great vehicle for data presentation. You can find the data here.

This post will continue using graphics to explore data and using ‘ggplot2′ to make visually appealing graphs. I’ve discovered ways to improve my workflow, mess with colors, and other tricks.

In my last post, I said that you use aes( ) to specify the x and y variables in the original ggplot call. For example:

```p <- ggplot(SaintsStats, aes(Result, ydsAllowed))
p + geom_boxplot()
```

That’s useful if you know exactly what you want to graph beforehand. You can change the variables by resetting aes( ) in the geom_boxplot( ) call. There are times, like this one, during exploratory data analysis that you might want to make many graphs of many different relationships (walking a fine line between EDA and data dredging). If that’s the case, then the original call can be just the dataframe.

```p <- ggplot(SaintsStats)
```

In this case, any geom object MUST have x and y variables specified by aes( ). For example, I’m inclined to believe that the Saints give up more yardage when they lose. I can check this with a boxplot:

```p + geom_boxplot(aes(Result, ydsAllowed))
```

Doesn’t look to be the case. Let’s focus on the graph properties for a second. I’m down with the colors, I like them for the most part. I have a couple of tweaks I’d like to make. The axis titles and text need to be bigger, they’re hard to read. I’d like the axis text in black. I also want to put a black box around the plot panel. You can change these settings using the theme( ) command as follows:

```p + geom_boxplot(aes(Result, ydsAllowed)) +
theme(axis.text.y=element_text(size=14, color='black'),
axis.text.x=element_text(size=14, color='black'),
axis.title.x=element_text(size=16),
axis.title.y=element_text(size=16),
panel.background=element_rect(color='black'))
```

There are MANY elements you can control directly and each element has a large number of things you can modify. For example, text elements can be altered by size, color, font face (bold, italics), angle, etc. I’ll leave it to you to find the specifics. The end result is a much nicer graph.

I’ve noticed something else. The x- and y-axis titles are too close to the axis text. ‘ydsAllowed’ is almost right on top of the text. I want to scoot it back a little using vadjust as follows. The actual value of vadjust is something you’ll have to manually play with for your own graphics device and plot size.

```p + geom_boxplot(aes(Result, ydsAllowed)) +
theme(axis.text.y=element_text(size=14, color='black'),
axis.text.x=element_text(size=14, color='black'),
axis.title.x=element_text(size=16, vjust=0.2),
axis.title.y=element_text(size=16, vjust=0.2),
panel.background=element_rect(color='black'))
```

Much better!

But here’s the thing. I’m lazy. I’m also convinced laziness is a great trait for a programmer because it means you look for ways to be efficient. I don’t want to have to type all of that theme( ) nonsense for every graph I make, and I’ll be making a lot because I have a number of relationships to examine. I also want these same theme settings regardless of boxplot, line graph, scatterplot, etc. You can ‘update’ the theme in ggplot to do this:

```theme_old <- theme_update(
axis.text.y=element_text(size=14, color='black'),
axis.text.x=element_text(size=14, color='black'),
axis.title.x=element_text(size=16, vjust=0.2),
axis.title.y=element_text(size=16, vjust=0.2, angle=90),
panel.background=element_rect(color='black', fill='grey90')
)
```

Notice that panel.background now has the ‘fill’ argument. If you update the current theme, you’re overwriting the OLD line of code. So, essentially, you’re replacing the panel.background of the old code with the new one. If you leave out the ‘fill’ command, the theme will have a blank background (i.e. no fill). I happen to like the grey, so I stuck it in there. Any graphs you make will now have these default settings, and you can continue to use theme( ) to tweak things for specific graphs. If you want to revert to the default grey theme at any time, type theme_set(theme_grey()).

Now I can examine a large number of relationships my lazy way, with the theme already set:

```p + geom_boxplot(aes(Result, ydsAllowed))
p + geom_boxplot(aes(Result, ydsGained))
p + geom_boxplot(aes(Result, ydsGained))
p + geom_boxplot(aes(Result, rushAtt))
p + geom_boxplot(aes(Result, rushYds))
```

What’s this? It looks like the Saints rush more when the win!

That’s a great exploratory graph, but it’s nothing I want to present to the public. It’s kind of bland. I want to make a couple of changes: 1) Plot the results by week, 2) color code the points for wins and losses, change the axis labels, 4) Change the legend key to say ‘Loss’ and ‘Win’ rather than ‘L’ and ‘W’, and 5) remove the minor gridlines between weeks because they are meaningless in this case. I also want to report the percent of total plays that are rushing. I do this for a couple of reasons: 1) most people (I think) find percentages easier to grasp than proportions because the numbers are easier to understand (50% vs. 0.5), and 2) I want to remove any possible effect of increased number of offensive plays in general. Fortunately, we can transform this in the ggplot call. The full code is as follows:

```p + geom_line(aes(Week, rushAtt/(passAtt+rushAtt)*100), linetype=2) + # Make the line
geom_point(aes(Week, rushAtt/(passAtt+rushAtt)*100, fill=Result), shape=21, size=10) + # Make the points
ylab('Percent of Rushing Play Calls') + # y-axis label
xlab('Week') + # x-axis label
scale_x_continuous(breaks=1:15) + # x-axis tick marks
scale_fill_hue(labels=c('Loss', 'Win')) + # legend key definitions
theme(
panel.grid.minor=element_blank() # remove minor gridlines
)
```

Now that’s a graph I’d be proud of. It shows many things: the weekly wins and losses, the percentage of rushing play calls, and it’s easy to see that the Saints win when there are a greater percentage of rushing plays! When the Saints rely too heavily on Brees and become one-dimensional, they lose (but see below).

Notice that, as I said above, I can update my new theme with a second theme( ) call to remove the gridlines, just as I would ordinarily. There’s also one more thing. I’m not crazy about the near-pastel colors. I want bold, eye-catching colors. You can modify the code to set the color palette or define one yourself. I prefer to use preset color palettes, and ggplot2 will accept any named palettes from the RColorBrewer package. Set1 provides bold colors that differ distinctly among levels, and would be good for discrete groups.

```p + geom_line(aes(Week, rushAtt/(passAtt+rushAtt)*100), linetype=2) +
geom_point(aes(Week, rushAtt/(passAtt+rushAtt)*100, fill=Result), shape=21, size=10) +
ylab('Percent of Rushing Play Calls') +
xlab('Week') +
scale_x_continuous(breaks=1:15) +
scale_fill_brewer(labels=c('Loss', 'Win'), palette='Set1') + # Notice the change from above to scale_fill_brewer
theme(
panel.grid.minor=element_blank()
)
```

Voila!

Side note: I mentioned a caveat above. I can’t strictly interpret this as ‘the Saints win when they rush more’. It could just as easily be ‘the Saints rush more when they are winning’. This is also a cursory look at their stats that neglects turnovers and other crucial information (aside from the Madden-esque ‘the Saints win by outscoring the other team’ trend).

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...