While working with some pitch location data recently, I ran across something strange when using my new computer (with R-2.12.2 64-bit) versus my work computer (with R-2.11.1 x64). Both are 64-bit computers, but I got the new one for portability (it’s a laptop) and speed.
Anyway, I had been doing some work in the office with Pitch F/X data, just trying to map out the strike zone for all umpires in the aggregate. This was a bit slow on the computer (it’s about 400,000 called pitches for 2010) so I decided to run a program for generating multiple heat maps on my new computer back home. But there was a problem.
For data sets that are large enough (usually at least 5,000 observations with this type of data) I like to use a GAM model (the ‘gam’ package) for estimating the probability of a strike call in a given 2-dimensional location. I have shown these before, and I use “filled.contour()” in order to generate a heat map using this output (for those in the saber world, it’s an adaptation of Dave Allen’s presentation at the Pitch F/X summit in 2009). There are difficulties with using the correct bandwidth, etc. with these models and the visualizations can get tricky, but I won’t visit this issue right now, as this should be beside the point for this post.
The problem seems to stem from something else, and the only thing different seems to be the computer the code is run on (and the R version). Let’s start simple. Below I map out every called strike from 2010 just using a scatter plot. This plot turns out the same on both computers (keep in mind I used the EXACT same script for both), so no problem here.
When I used my code at the office in order to map the probability onto a 2-D space, I got the perfectly reasonable solution shown below:
However, when I did this on my home computer, things got weird. I’ve gone over it again and again and can’t figure out what is going on. But when I run the identical code on my personal laptop computer, I end up with the following:
So I guess my question is: Has anyone else run into this sort of problem? And would this be a problem with R or the ‘gam’ package (I’m not sure if I have different ‘gam’ package versions on my computers, but both were installed within the last 6 months or so)?
I think it’s pretty obvious that the latter model is not a good representation of the data. But my hope was to use my new computer to run big R projects that I have, so I want to be sure whatever I do on it is not a mess. If anyone has any suggestions, I’d be grateful for them. I hope this isn’t wasting anyone’s time, but I’m stumped.
I have the code below (you can find a smaller version of the data set at Joe Lefkowitz’s site on the sidebar). Keep in mind that I haven’t yet attempted to choose any sort of optimal bandwidth for the given data, I’m just ‘eyeballing it’. But the bandwidth does not seem to affect the flipped axis representation on my new version of R.
###########make color pallette library(RColorBrewer) brewer.pal(11, "RdYlBu") buylrd