# MLB Baseball Pitching Matchups ~ manipulating pitch f/x data using the RMySQL package in R

May 31, 2010
By

(This article was first published on mind of a Markov chain » R, and kindly contributed to R-bloggers)

But because the pitch f/x data is very elaborate, R can easily get overwhelmed by copying the dataset back and forth in memory, as you manipulate the data. So the natural progression is to use relational database systems.

Here, I show a simple example of using the RMySQL package to plot the pitch types of Tim Lincecum’s 2010 season, extracted from the source pitch f/x data from MLB Gameday. So first, I downloaded the 2010 data using the following script (with a weird error, I was only able to download till 5/4/10, but I can easily concatenate the dataset by running the script again for later dates):

DownloadPitchFX(fileloc = "./pitch2010Apr04May04.txt", start.date = "2010-04-04", end.date = "2010-05-04")

Then, I start MySQL from the BASH shell to create a database called “pitchfx” within the MySQL shell (this on Mac OS 10.6, not really familiar w/ Windows).

bash-3.2$mysql -u root mysql> create database pitchfx;  Then we connect to the MySQL database (“pitchfx”) from R from root. From there, I can write the text file to the MySQL database by specifying a table (“2010Apr04May04″) and the file location (“./pitch2010Apr04May04.txt”). It’s important to know dbWriteTable() can include certain input arguments from read.table() so you can specify how the data will be translated into a MySQL database. library(RMySQL) con <- dbConnect(dbDriver("MySQL"), dbname = "pitchfx", user = "root") dbWriteTable(con, "2010Apr04May04", "./pitch2010Apr04May04.txt", overwrite = F, sep = "t", head = T)  We can get the pitcher code for Tim Lincecum by looking online to then call a SQL query to get all the pitch info from Lincecum (hopefully I will have script to get pitcher/batter codes shortly). pitcher.Lince <- dbGetQuery(con, "SELECT * FROM 2010Apr04May04 WHERE pitcher = "453311"") > head(pitcher.Lince) Year Month Day Inning Home Away b s o batter 1 2010 4 5 1 Houston Astros San Francisco Giants 0 0 1 456422 2 2010 4 5 1 Houston Astros San Francisco Giants 2 3 2 430565 3 2010 4 5 1 Houston Astros San Francisco Giants 2 3 2 430565 4 2010 4 5 1 Houston Astros San Francisco Giants 2 3 2 430565 5 2010 4 5 1 Houston Astros San Francisco Giants 2 3 2 430565 6 2010 4 5 1 Houston Astros San Francisco Giants 2 3 2 430565 pitcher stand p_throws event des type x y 1 453311 L R Groundout In play, out(s) X 115.88 138.15 2 453311 L R Strikeout Ball B 142.49 120.88 3 453311 L R Strikeout Called Strike S 121.89 156.29 4 453311 L R Strikeout Ball B 144.21 121.75 5 453311 L R Strikeout Swinging Strike S 136.48 148.51 6 453311 L R Strikeout Swinging Strike (Blocked) S 84.98 188.23 start_speed end_speed sz_top sz_bot pfx_x pfx_z px pz x0 y0 1 93.1 84.2 3.2 1.5 -7.77 6.53 -0.601 2.845 -0.913 50.0 2 92.2 83.1 3.06 1.38 -5.39 11.07 -1.235 3.635 -0.738 50.0 3 93.3 84.8 2.94 1.38 -1.83 10.63 -0.674 1.873 -1.006 50.0 4 75.6 68.4 2.95 1.27 5.77 -10.3 -1.347 3.501 -0.605 50.0 5 93.6 85.1 3.26 1.63 -5.67 9.8 -1.206 2.215 -0.823 50.0 6 86.2 79.8 3.26 1.63 -5.05 0.56 0.444 0.739 -0.861 50.0 z0 vx0 vy0 vz0 ax ay az break_y break_angle 1 6.056 3.516 -136.487 -4.813 -14.322 36.387 -20.055 23.7 32.2 2 6.084 0.531 -135.161 -4.183 -9.735 35.878 -12.1 23.7 36.0 3 6.014 1.522 -136.494 -8.778 -3.397 34.418 -12.366 23.7 11.7 4 6.394 -3.209 -110.727 4.07 6.939 25.572 -44.48 23.7 -8.2 5 5.974 0.946 -137.032 -7.542 -10.617 34.766 -13.754 23.7 33.6 6 5.784 4.875 -126.238 -6.276 -8.113 26.87 -31.195 23.8 12.5 break_length pitch_type type_confidence 1 5.6 FT .920 2 3.7 FF .625 3 3.3 FF .909 4 15.0 CU .905 5 4.1 FT .860 6 8.4 CH .902  It’s somewhat of a pain to go through extracting info from the data frame, but here it is: # now separate out the dates starts.Lince <- paste(pitcher.Lince$Month, pitcher.Lince$Day) starts.Lince <- factor(starts.Lince, levels = unique(starts.Lince)) # grab number of pitches wrt type wrt start pitch_type.Lince <- by(pitcher.Lince, starts.Lince, function(x) by(x, x$pitch_type, function(x)
length(x\$pitch_type)))

# construct zero matrix with proper dims, pitch type rows, dates columns
p1 <- matrix(0,
nrow = length(unique(unlist(lapply(pitch_type.Lince, names)))),
ncol = length(pitch_type.Lince),
dimnames = list(unique(unlist(lapply(pitch_type.Lince, names))),
names(pitch_type.Lince)))

# count number of pitches thrown wrt pitch type for each start
p1.num <- lapply(pitch_type.Lince, function(x) as.matrix(x, colnames = names(x)))

# input uneven named list into matrix
for (i in 1:length(p1.num)) p1[row.names(p1.num[[i]]), i] <- p1.num[[i]]

# graph it!
barplot(apply(p1, 2, function(x) x / sum(x)), legend = rownames(p1),
main = "Tim Lincecum 2010 Pitch Type Distribution")


Where “FF” is the four seam fastball, “FT” the two seam fastball and “FA” when Gameday can’t figure out which.

Filed under: Baseball, R, RMySQL