Calculate umpire strikezone sizes

January 11, 2012

(This article was first published on pitchR/x, and kindly contributed to R-bloggers)

Head on over to THT, and you should see a post by yours truly about calculating the size of umpire strikezones. Here I will open source the code for how to do that. The hard part though is that in order to make use out of the code, you need to have every single pitch from 2011 available at your disposal. If you have that, then you can use the following code to generate umpire strikezone sizes. The script makes very heavy use of the “plyr” package (packages load libraries of extra functions and/or data into R), which may be my favorite R package. It was designed by Hadley Wickham, who is a celebrity of sorts in the R community. The plyr package is great for data manipulation and group-wise summaries. You can install it with:


You also need the “mgcv” package for creating the model that is used in the calculations.
You can find the code below in blue, and comments in green. Hope it’s readable:
#load library

#initalize — add necessary variables. Best way is with vectorized ifelse function. 
pitcher$babipn = ifelse(pitcher$type==”X” & pitcher$event %in% c(“Single”, “Double”, “Triple”), 1, 0)
pitcher$inplay = ifelse(pitcher$type==”X” & pitcher$event != “Home Run”, 1, 0)
pitcher$swing = ifelse(pitcher$type==’X’ | pitcher$des %in% c(‘Foul’, ‘Foul Tip’, ‘Foul (Runner Going)’, 
            ‘Swinging Strike’, ‘Swinging Strike (Blocked)’), 1, 0)
pitcher$walk = ifelse(pitcher$des %in% c(“Ball”, “Ball In Dirt”, “Intent Ball”) 
            & pitcher$event %in% c(“Intent Walk”, “Walk”) & pitcher$ball==3, 1, 0)
pitcher$strikeout = ifelse(pitcher$des %in% c(“Called Strike”, “Foul Tip”, “Strike”,
            “Swinging Strike”, “Swinging Strike (Blocked)”)
            & pitcher$event %in% c(“Strikeout”,”Strikeout – DP”) & pitcher$strike==2, 1, 0)
pitcher$homerun = ifelse(pitcher$des == “In play, run(s)” & pitcher$event == ‘Home Run’, 1, 0)
pitcher$full_name = paste(pitcher$first, pitcher$last)

#find number of plate appearances for each umpire, don’t know why I named it atbats
#calculate kwERA, FIP, babip, and swing rate
pitcher = ddply(pitcher, .(ump_id), transform, atbats = length(unique(ab_id)))
pitcher = ddply(pitcher, .(ump_id), transform, fip = ((13 * sum(homerun)
              + 3 * sum(walk) – 1.93 * sum(strikeout)) / (.23 * atbats)) + 3.10,
              kwERA = 5.24 – (12 * (sum(strikeout) – sum(walk)) / atbats),
              babip = sum(babipn) / sum(inplay), swing = mean(swing))

#subset data to only look at pitches where umpire makes a call
#this won’t affect the stats we just calculated
pitcher = subset(pitcher, des %in% c(“Automatic Ball”, “Ball”,”Ball In Dirt”, “Called Strike”,
                                   “Intent Ball”, “Pitchout”, “Strike”))
#add in variable to describe called strike
pitcher$cs = ifelse(pitcher$des == ‘Called Strike’, 1, 0)

#find number of pitches seen by each umpire
pitcher = ddply(pitcher, .(ump_id), transform, n = length(px))

#only look at umpires with greater than 3000 pitches
pitcher = subset(pitcher, n > 3000)

#define function to calculate strikezone area
csarea = function(x) {require(‘mgcv’)
  #create model with mgcv package, use smoothing                  
  model <- gam(cs ~ s(px) + s(pz), data = x, family = binomial(link=’logit’))
  mygrid <- expand.grid(x=seq(-4, 4, length=50), y=seq(-2, 6, length=50))
  #find predicted called strike rates for a bunch of tiny bins                  
mygrid$z <- as.numeric(predict(model, data.frame(px=mygrid$x, pz=mygrid$y), type=’response’))
thresh <- 0.5
  #find number of tiny bins where predicted rate is at least .5                  
nbins=length(mygrid$z[mygrid$z >= thresh])
  #multiply number of tiny bins by size of each bin                  
areas=nbins * 0.02665556
  #add in fields to data.frame                  
x$area <-rep(areas, length(x[,1]))
x$nbins <- rep(nbins, length(x[,1]))

#call + evaluate csarea() on each subset of umpire data
pitcher=ddply(pitcher, .(ump_id), “csarea”)

#create a summary dataset for each umpire, then sort by area
sum.pitch=ddply(pitcher, .(ump_id), summarize, area=mean(area), full_name=full_name[1],
                fip=mean(fip), kwERA=mean(kwERA), babip=mean(babip), swing=mean(swing))
sum.pitch=arrange(sum.pitch, area)

To leave a comment for the author, please follow the link and comment on their blog: pitchR/x. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)