sab-R-metrics: Subsetting, Conditional Statements, ‘tapply()’, and VERY simple ‘for loops’

[This article was first published on The Prince of Slides, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my last sab-R-metrics post, I went over some basics of calling data and creating vectors or new data from those. Here, I want to extend that to full subsets of data and go on to use some of the basic functions in R so that we can begin plotting in the next tutorial.

Before I begin, I want to first give a data example that you can all follow along exactly. Since Blogger isn’t very helpful with uploading things to this page, I’ll refer you to this link and Joe Lefkowitz’s Pitch F/X tool (the best one on the web if you’re interested in doing your own analysis). The link directs you to data for Albert Pujols. The first thing you should do is click on the button that says “Download Excel File”. Be sure to open it up in Excel. For ease of use, I am going to change the variable names to all lower case. If you want the code below to work, you should do the same. The file also has variable names with spaces in them. I’ll use the same variable names with all lower case and an underscore (_) instead of a space. I note the changes below:

1. All lowercase variables.
2. Use underscore (_) for variable names.

Next, go to “Save As”, choose your favorite convenient directory, and save it as a comma separate (CSV) file as I showed you in the first post. Save it as “pujols” if you want everything to work from the code here. Go ahead and open up R and a new script to play with. At the top, name it something using the “#” sign to indicate a comment, set your working directory, and load in the data, naming it “pujols” as below:

##working with pujols pitch fx data

pujols <- read.csv(file="pujols.csv", h=T)

Make sure everything loaded in correctly so that there aren’t any problems. The first thing I want to do is subset my data. You’ll notice that there is a lot of missing data for certain variables in here. This is where there was no Pitch F/X availability for the 2008 season. Because we want to work specifically with Pitch F/X data, let’s subset that. We’ll use the “subset()” function. Below, I’m using a crude method of sub-setting the data, where I only select those rows that have a ‘start_speed’ number in them–the way I do it below is to indicate that I only want rows with a starting pitch speed greater than 0 mph:

#subsetting data into pitches with f/x data only

pitchfx <- subset(pujols, start_speed > 0)


pitchfx <- subset(pujols, pujols$start_speed > 0)

Notice above I did the sub-setting two ways: one with the “$” sign and one without. Because both the data name and the sub-setting requirements are within the “subset()” function, we don’t necessarily have to use the “$“, but if you have multiple data sets, it can be a good idea to get into the habit of always using the dollar sign.

Similarly, we can subset the data by the number of balls, strikes, or pitch type. I’ll begin by sub-setting from our already made “pitchfx” subset. Below, I subset the data into pitches with 2 strikes, then separately for pitches with 3 balls, and finally for pitches that are sliders only:

#other subsets of interest

k2 <- subset(pitchfx, strikes==2)

b3 <- subset(pitchfx, balls==3)

sliders <- subset(pitchfx, pitch_type=="SL")

Notice that we had to use quotations for selecting by pitch type. This is because it is indicated in our data set as text. Also, remember to use the double-equal sign “==” when stating a cell should be equal to your selection for sub-setting. We can also use our data to grab only change-ups:

#grabbing just change-ups using the dollar sign

changeup <- pitchfx[pitchfx$pitch_type=="CH"]

Here, we HAVE to use the dollar sign (unless we want to use the “attach()” function, which is not recommended). The “subset()” function is much more convenient, but when we don’t want to have lots and lots of data subsets in our memory, the above can become very useful. Mainly, we can use the above code to set criteria for plots and functions without creating too many subsets of the data. Below, we take the mean starting speed of change-ups that Pujols saw in 2008:

#average change-up or slider speed to Pujols



Hopefully you also got 82.037 mph as the average change-up speed and 83.86 mph as the average slider speed. See how we combined all these things together? There is another important R function that comes in handy called “tapply()“. This allows us to use a basic function like “mean()” across different criteria in one swoop, rather than doing a line of code for every pitch type. This function should be quicker, in general, than ‘for loops’ (next section), but sometimes it can be a pain. Begin with the variable you want to apply your function to, then say what variable you want to use to categorize the ‘types’ (here, it’s pitch type), and then just write the name of the function in the last portion.

#use tapply to get some descriptives of all pitch types individually

tapply(pitchfx$start_speed, pitchfx$pitch_type, mean)

tapply(pitchfx$start_speed, pitchfx$pitch_type, sd)

tapply(pitchfx$start_speed, pitchfx$pitch_type, max)

tapply(pitchfx$pz, pitchfx$pitch_type, mean)

This gives us a nice list of the average speed of each pitch Pujols sees, the standard deviation of each, and the average height of each. You can see that, on average, fastballs (FA) are the fastest pitches, while splitters (FS) are on average the lowest in the zone (“pz” is the height of the pitch). Using the “max()” function, we can also see the fastest of each pitch type that Pujols saw in 2008. Pretty easy stuff!


One thing I have not mentioned is using conditional statements for different reasons. The most simple case of using a conditional statement is creating a single dummy variable. Perhaps we’re interested in creating a column vector (a variable) that indicates whether or not the pitch was greater than 90 mph. To indicate that “Yes”, the pitch was above 90 mph, we’ll use the number “1”, and to indicate that “No”, the pitch was below 90 mph, we’ll use the number “0”. For this, we can use the “ifelse()” function, which is an extension of the “if()” function often used for writing your own functions. Here, you’ll need to recall how to add a variable to the end of your data set:

#adding an “above 90 mph” indicator variable to our data set

pitchfx$above_90 <- ifelse(pitchfx$start_speed > 90, 1, 0)

#or similarly

pitchfx$above_90_b <- ifelse(pitchfx$start_speed =< 90, 0, 1) In this code, we begin by picking a name for the variable and using our assignment operator “<-“. Then, we use our function, beginning with the criteria (that the pitch is above 90 mph), followed by the number to assign to the variable if it is true for that row, then the number to assign if the statement is not true. Similarly, the second version gives a “0” to the column if the pitch is less than or equal to 90 mph, and a “1” if this statement is false (or, the pitch is above 90 mph). We can also do the same using a text identifier within the column:

#using Yes and No

pitchfx$above_90_c <- ifelse(pitchfx$start_speed > 90, “Yes”, “No”)

********END SIDETRACK********

While the “tapply()” function is the most efficient, we can also run ‘for loops’ in order to assess things in our data. A ‘for loop’ runs code multiple times on a set of data depending on the criteria. Here, I will also make use of the “length()” function, which tells you the number of rows or columns in a data set. I also use the “sample()” function that randomly samples from your data, but I’ll only briefly discuss this. If you use:

#looking at the number of observations (pitches) in the dataset


#looking at the number of variables


We see that there are 2,364 pitches (rows) and 41 variables (columns) in our data set.


I sometimes use “sample()” to do certain things in R that I need some random selection for. I won’t get too in-depth with this, but let’s grab a random sample of velocities of pitches in the data set. First, I’ll also use the “set.seed()” function. We use this so that when we randomly sample something, we can be sure to reproduce the same exact sample next time. If you do not set the seed, you’ll get a different sample every time. Just plug in your favorite number between the parentheses. I use my birthday so I can always remember which one I used.

#set your random seed


#randomly sample start speeds of 10 pitches with replacement

speed_samp_repl <- sample(pitchfx$start_speed, 10, replace=TRUE)


#randomly sample start speeds of 10 pitches without replacement

speed_samp_norepl <- sample(pitchfx$start_speed, 10, replace=FALSE)


So, you start with the variable you want to randomly sample, then indicate the number of observations to sample, and finally indicate whether you want to sample with or without replacement. Easy enough for those of you that want to deal with permutations of data.

********END SIDETRACK********

In a ‘for loop’, the objective is to run the same code on multiple parts of a data set (like tapply()“), or to do lots of samples of data. For loops are useful when you want to make up your own bootstrap or other function (for those familiar with the technique). I mostly use the for loop to permute data for H2H fantasy (i.e. randomly matching up weekly category totals in order to gauge the probability of winning a category, given the data from the previous season). In words, the loop below says that for each row from the first to the last in the column called “start_speed”, multiply that value by 10 and call this new vector “dumb”, then take the mean of the vector:

#create useless for loop multiplying start speed by 10
dumb <- NULL
for (i in 1:length(pitchfx$startspeed)) {
stupid <- pitchfx$start_speed*10
dumb <- c(dumb, stupid)

It’s always good to and indent code for easier reading, especially with loops and functions. Above, I first created an empty vector called “dumb”. Then, I created the loop using the “for()” function indicating that I wanted to do it from Row 1 through the last row of the data (the colon is read as “to”). Then I define “stupid” as the speed times 10, and use the “c()” function to put each result into the previously empty vector “dumb”. The entire loop is surrounded by the “{}” brackets and I take the average of the vector of speeds multiplied by 10 after it runs.

I’m not going to get too in depth with this, as there are a number of things to do with the for loops…none of which are pertinent to the type of sabermetric analysis I’ll be discussing at this site. Just know that you can define “stupid” above as any mathematical function you’d like (as well as a conditional statement). Those of you who are creative programmers can probably figure out where to go from here with the for loops.

Next time, we can actually start doing the fun stuff. In my next post, I’ll go over some basic graphics in R and doing a quick t-test or regression using the same Albert Pujols data. Below, I again post my script from today’s tutorial:

##working with pujols pitch fx data
setwd("c:/Users/bmmillsy/Documents/My Dropbox/Blog Stuff/sab-R-metrics")
#load data

To leave a comment for the author, please follow the link and comment on their blog: The Prince of Slides. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)