sab-R-metrics: Beginning with Boxplots, Scatterplots, and Histograms

[This article was first published on The Prince of Slides, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today I decided to begin more with visualizations and less with basic statistical analysis for sabermetrics using R. I’m not really here to teach the ins and outs of regressions and statistical tests, so once I get there, I’m hoping that those who have read this already have a decent understanding of those subjects before implementing them. I’ll being with scatterplots and boxplots.

First, go ahead and load in your data from last time (the Albert Pujols data). Use the following code, making sure to subset out the data without Pitch f/x data in it:

##working with pujols pitch fx data

pujols <- read.csv(file="pujols.csv", h=T)

#subsetting data into pitches with f/x data only
pitchfx <- subset(pujols, start_speed > 0)

Alright, now we’re ready to go. I’ll begin with the most basic function, “plot()“. Let’s to a quick, unedited scatterplot just to get things working. There are two ways to do this, one using the ‘formula’ version and another using just (x, y). I’ll show below (note, this won’t work because we don’t have variables x and y in our data set, this is just for demonstration):

plot(x, y)
plot(y ~ x)

If you are familiar with middle school algebra, you know that the y-axis is the vertical axis, while the x-axis is the horizontal axis. I prefer the formula version for most plotting when I already have the variables, but when we want to customize our axes, I’ll get away from this a bit. I’ll talk about that a little more in the “Intermediate Visualization” post. So, let’s plot the starting speed (Y) of all pitches that Pujols sees by the inning (X) and also end speed by start speed:

#plot velocity by inning
plot(pitchfx$start_speed ~ pitchfx$inning)

#plot ending speed as a function of starting speed
plot(pitchfx$end_speed ~ pitchfx$start_speed)

Not surprisingly, starting speed and ending speed are strongly correlated. You’ll notice that because the innings are not continuous variables, it looks like you have stacks of points. Perhaps a bar plot or box plot would work better for this type of data. Let’s try both below:

#barplot of start speed by inning
vel_by_inn <- tapply(pitchfx$start_speed, pitchfx$inning, mean)

#boxplot of start speed by inning
boxplot(pitchfx$start_speed ~ pitchfx$inning)

#OR horizontally
boxplot(pitchfx$start_speed ~ pitchfx$inning, horizontal=TRUE)

As you can see, these graphs are pretty boring. The worst part is that we don’t even know what the graph is showing us because there is NO TITLE! In addition, the axes are ugly (do we really want to call them pitchfx$start_speed?). We can clean this portion of the plots up using simple commands within the plot function(s): “xlab=“, “ylab=”, and “main=“. These are pretty straight forward, as we already know which are the X and Y axes. Be sure to use quotes when writing your axis names and titles:

#adding axis labels and a title to the plot

plot(pitchfx$end_speed ~ pitchfx$start_speed, xlab=”Speed Out of Hand”, ylab=”Speed Crossing the Plate”, main=”Starting Speed vs. Ending Speed (Albert Pujols’ Pitches Seen)”)

barplot(vel_by_inn, main=”Starting Speed by Inning (Albert Pujols’ Pitches Seen)”, xlab=”Inning”, ylab=”Speed Out of Hand”)

boxplot(pitchfx$start_speed ~ pitchfx$inning, xlab=”Inning”, ylab=”Speed Out of Hand”, main=”Boxplot of Pitch Speed by Inning”)

I really like boxplots for this exercise because they not only give us the median speed but also the variability in velocity (the barplot isn’t very useful here, especially in the form shown above, and is generally pretty “bleh!”). In general, it seems more instructive to do this by pitch type, rather than for all pitches, so let’s remember our code that gives conditions for the data to be shown:

#boxplots of fastballs (generic FA version) only

boxplot(pitchfx$start_speed[pitchfx$pitch_type==”FF”] ~ pitchfx$inning[pitchfx$pitch_type==”FF”], xlab=”Inning”, ylab=”Speed Out of Hand”, main=”Boxplot of Fastball Speed by Inning”)

We can see from the plots that Pujols likely sees more hard-throwing relievers late in the game. But we can actually get a bit more information out of our boxplots. One of the options for this type of plot includes making the width of the boxes a function of the number of observations in each group. We use the “varwidth=T’ command for this. My prediction is that the boxes after the 8th inning will begin to get skinnier and skinnier if we do this. Let’s check it out:

#boxplots of fastballs (generic FA version) only and box width using num. of observations

boxplot(pitchfx$start_speed[pitchfx$pitch_type==”FA”] ~ pitchfx$inning[pitchfx$pitch_type==”FA”], xlab=”Inning”, ylab=”Speed Out of Hand”, main=”Boxplot of Fastball Speed by Inning”, varwidth=T)

Interestingly, there aren’t many pitches classified as generic fastballs in the 2nd inning for Pujols in 2008. We’d have to do a little more snooping to figure out what is going on here (is he swinging more, are pitchers throwing him junk after he rips a seed off them in the first inning–Adden(DUH)m: most likely it’s simply that Pujols bats early in the order, so he usually gets up in the 1st Inning and only in the 2nd if the team quickly bats around!). There are all sorts of things to play with here. R has an insane number of graphical commands and parameters for most of its visuals. I won’t go into all of them here, and will leave those for next time.

Finally, I’d like to get into histograms. Again, these are very easy and straight forward using the “hist()” function. A histogram shows the proportion of observations at different levels of a single variable. It puts the variables into ‘bins’ using a procedure finding the ‘optimal’ size of bins for observations to be placed in. For example, if you have pitches between 90 and 95 mph, it may choose 5 bins (90 to 90.9, 91 to 91.9, 92 to 92.9, 93 to 93.9, 94 to 94.9) or something of that sort. You can adjust what R uses as the bins if you’d like as well (to figure this out, use the ‘help’ function). Here’s the basic histogram for fastballs seen by Pujols:

#histogram of fastballs

hist(pitchfx$start_speed[pitchfx$pitch_type==”FA”] , xlab=”Inning”, main=”Histogram of Fastball Speed”)

Above, we see the frequency of each bin on the y-axis, with the speed of the pitches on the x-axis. We can also adjust this code to show the proportion (as in percent) of pitches within each bin. Just use an additional command, “freq=FALSE“.

#histogram of fastballs (proportion)

hist(pitchfx$start_speed[pitchfx$pitch_type==”FA”] , freq=FALSE, xlab=”Inning”, main=”Histogram of Fastball Speed”)

Most of you that have looked at graphics I’ve posted here before may notice that the above plots aren’t indicative of how I normally use visuals from R. These are about as bland as it gets. However, we have to start with the very basics to work up to blending colors, drawing loess lines, using text as data points, and so on. Once you have this portion of the graphing down, the rest is simply adding some new commands for pretty things in the plot window and/or getting comfortable with setting your own graphing parameters before the plotting takes place in the window.

In my next post, I’ll work with some more intermediate graphing options for scatterplots and line plots. That will include formatting your axes, using color, changing the points you use, drawing text, lines and shapes on your plots, and putting more than one plot in a single window. Finally, I’ll also talk about the best way to size and save your plots for use later on. However, the above code should provide a decent jumping off point for anyone interested in making their own graphics. Always check the HELP files for each function, as there are more options than I can go through here in a single post. The best way to figure this stuff out is to practice and try it out on your own. See what you can do!

#load data

To leave a comment for the author, please follow the link and comment on their blog: The Prince of Slides. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)