Carson Sievert (creator of the really neat pitchRx package) and Steamer Projections posed a question about reasonable run times of the mgcv package on large data in R yesterday, and promised my Pitch F/X friends I would post here with a quick tip on speeding things up. Most of this came from Google searching, so I do not claim to be the original creator of the code used below.
A while back, I posted on using Generalized Additive Models instead of the basic loess package in R to fit Pitch F/X data, specifically with a binary dependent variable (for example, probability of some pitch being called a strike). Something went weird with the gam package, so I switched over to the mgcv package, which has now provided the basis of analysis for my 2 most recent academic publications. I like to fancy this the best way to work with binary F/X data–but I am biased. Note that the advantages of the mgcv package can also be leveraged in fitting other data distributions besides the binomial. This includes the negative binomial distribution, which can be more helpful for data that are zero-inflated (probably most of the other binomial data we want to model in baseball pitches).
One advantage of mgcv is that it uses generalized cross-validation in order to estimate the smoothing parameter. Why is this important? Well, because we have different sample sizes when making comparisons–for example, across umpire strike zones–and we also have different variances, we might not want to fit each one the same. Additionally, smoothing by looking at the plot until it “looks good” can create biases. Therefore, this allows a more objective way to fit the data. I also like the ability to fit interactions of the vertical and horizontal location variables. If you fit them separately and additively, you end up missing out on some of the asymmetries of the strike zone. These ultimately tend to be pretty important, with the zone tipping down and away from the batter (See the Appendix for comparison code, see below for picture of tilt; figure from this paper).
One thing that I did note on Twitter is that for the binary models, a larger sample size tends to be necessary. My rule of thumb–by no means a hard line–is that about 5,000 pitches are necessary to make a reasonable model of the strike zone. The is close to the bare minimum without having things go haywire and look funky like the example below, but depending on the data you might be able to fit a model with fewer observations.
Also, if you know a good way to integrate regression to the mean in some sort of Bayesian fashion in these models, that might help (simply some weighted average of all umpire calls and the pitches called by Umpire X that does not have enough experience yet).
Because R tends to work on a single thread, instead of using all the cores on your computer, the models can become rather cumbersome. Believe me, I know. For a while, I was fitting models with 1.3 million pitches, 125 dummy fixed effects, and some 30 other control variables at a time for this paper. It took anywhere from 1-3 hours, depending on whether my computer felt happy that day–and I kept forgetting to include a variable here, change something there, etc.
OK, so parallelization. It’s actually incredibly easy in the mgcv package. You first want to know if your computer has multiple cores, and if so, how many. You can do this through the code below (note that I first load all the necessary packages for what I want to do):
That last ‘cl’ just tells you how many cores you will be using. Note that this leaves one of your cores ready for processing other things. You can use all of them, but it could end up keeping you from being able to do anything else on your computer while your model is running. You can also use less. Simply change the second line from ‘-1′ to ‘-2′, or whatever you want to do. From here, mgcv has a single command for using multiple cores. You’ll want to use the ‘cl’ designation as the cores to use.
One should also note that, in R, large data sets and massive matrix inversions take up a significant amount of RAM. When I came to Florida I had to convince our IT people that I needed at least 32 GB of RAM, specifically to run the models in the paper linked above. Running the single model got me up to 8-10 GB, while doing multiple models in a single instance in R subsequently maxed me out at around 28 GB before I closed R and opened another instance. This is a limitation that can be addressed to some extent with mgcv, but if you’re not running every single pitch available in the F/X database, you probably won’t have to worry about this.
In case you do, mgcv also has a nice option that breaks the data up into chunks and has a much lower memory footprint. It is called bam() and works just as the gam() function does, but allows analysis on larger data sets when you have more limited memory by breaking it into chunks. The help file claims that it can work much faster on its own in addition to saving memory. And–most relevant to this post–this is the function that includes the option to parallelize your analyses. The code is exactly the same with the extra command using our ‘cl’ defined above. Note that I use the combined smooth and limit the degrees of freedom of the smooth to 50. Those are, of course, choices of the modeler and dependent on the type of data you are analyzing.
Boom. That’s it. You can also consider fitting smooths based on handedness. You can do one for each type of batter by breaking up the data and the modeling, or you can do the following below:
And of course you can add covariates to your model that you want to estimate parametrically, such as the impact of count or pitch type:
With the model, creating figures is as easy as using the predict() function and using code as I have shown here before. And, thanks to Carson, much of the figure production is now automated in the pitchRx package.
Note that much of my reading about this package comes from an excellent book by its creator, Simon Wood, called Generalized Additive Models: An Introduction with R. If these models are interesting to you, this is a must have resource.
Appendix: The reason I use the interaction term is that the UBRE score is significantly better by doing so, as suggested in the previously cited text. The code to compare the two models is also included below. Note that your variable names and data name may differ, so change accordingly:
###Model with separate smooths
fit <- bam(strike_call ~ s(px, k=51) + s(pz, k=51), method="GCV.Cp", data=called,
###Model with combined smooth
fit.add <- bam(strike_call ~ s(px, pz, k=51), method="GCV.Cp",
data=called, familiy=binomial(link="logit"), cluster=cl)
###combined smooth UBRE score is lower
###compare models with Wald test