MLB Baseball Pitching Matchups ~ manipulating pitch f/x data using the RMySQL package in R

[This article was first published on mind of a Markov chain » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

After downloading some pitch f/x data using my R script, we can finally have some fun.

But because the pitch f/x data is very elaborate, R can easily get overwhelmed by copying the dataset back and forth in memory, as you manipulate the data. So the natural progression is to use relational database systems.

Here, I show a simple example of using the RMySQL package to plot the pitch types of Tim Lincecum’s 2010 season, extracted from the source pitch f/x data from MLB Gameday. So first, I downloaded the 2010 data using the following script (with a weird error, I was only able to download till 5/4/10, but I can easily concatenate the dataset by running the script again for later dates):

DownloadPitchFX(fileloc = "./pitch2010Apr04May04.txt", start.date = "2010-04-04", end.date = "2010-05-04")

Then, I start MySQL from the BASH shell to create a database called “pitchfx” within the MySQL shell (this on Mac OS 10.6, not really familiar w/ Windows).

bash-3.2$ mysql -u root
mysql> create database pitchfx;

Then we connect to the MySQL database (“pitchfx”) from R from root. From there, I can write the text file to the MySQL database by specifying a table (“2010Apr04May04″) and the file location (“./pitch2010Apr04May04.txt”). It’s important to know dbWriteTable() can include certain input arguments from read.table() so you can specify how the data will be translated into a MySQL database.

library(RMySQL)

con <- dbConnect(dbDriver("MySQL"), dbname = "pitchfx", user = "root")

dbWriteTable(con, "2010Apr04May04", "./pitch2010Apr04May04.txt", overwrite = F, sep = "t", head = T)

We can get the pitcher code for Tim Lincecum by looking online to then call a SQL query to get all the pitch info from Lincecum (hopefully I will have script to get pitcher/batter codes shortly).

pitcher.Lince <- dbGetQuery(con, "SELECT * FROM 2010Apr04May04 WHERE pitcher = "453311"")

> head(pitcher.Lince)
 Year Month Day Inning           Home                 Away b s o batter
1 2010     4   5      1 Houston Astros San Francisco Giants 0 0 1 456422
2 2010     4   5      1 Houston Astros San Francisco Giants 2 3 2 430565
3 2010     4   5      1 Houston Astros San Francisco Giants 2 3 2 430565
4 2010     4   5      1 Houston Astros San Francisco Giants 2 3 2 430565
5 2010     4   5      1 Houston Astros San Francisco Giants 2 3 2 430565
6 2010     4   5      1 Houston Astros San Francisco Giants 2 3 2 430565
 pitcher stand p_throws     event                       des type      x      y
1  453311     L        R Groundout           In play, out(s)    X 115.88 138.15
2  453311     L        R Strikeout                      Ball    B 142.49 120.88
3  453311     L        R Strikeout             Called Strike    S 121.89 156.29
4  453311     L        R Strikeout                      Ball    B 144.21 121.75
5  453311     L        R Strikeout           Swinging Strike    S 136.48 148.51
6  453311     L        R Strikeout Swinging Strike (Blocked)    S  84.98 188.23
 start_speed end_speed sz_top sz_bot pfx_x pfx_z     px    pz     x0   y0
1        93.1      84.2    3.2    1.5 -7.77  6.53 -0.601 2.845 -0.913 50.0
2        92.2      83.1   3.06   1.38 -5.39 11.07 -1.235 3.635 -0.738 50.0
3        93.3      84.8   2.94   1.38 -1.83 10.63 -0.674 1.873 -1.006 50.0
4        75.6      68.4   2.95   1.27  5.77 -10.3 -1.347 3.501 -0.605 50.0
5        93.6      85.1   3.26   1.63 -5.67   9.8 -1.206 2.215 -0.823 50.0
6        86.2      79.8   3.26   1.63 -5.05  0.56  0.444 0.739 -0.861 50.0
 z0    vx0      vy0    vz0      ax     ay      az break_y break_angle
1 6.056  3.516 -136.487 -4.813 -14.322 36.387 -20.055    23.7        32.2
2 6.084  0.531 -135.161 -4.183  -9.735 35.878   -12.1    23.7        36.0
3 6.014  1.522 -136.494 -8.778  -3.397 34.418 -12.366    23.7        11.7
4 6.394 -3.209 -110.727   4.07   6.939 25.572  -44.48    23.7        -8.2
5 5.974  0.946 -137.032 -7.542 -10.617 34.766 -13.754    23.7        33.6
6 5.784  4.875 -126.238 -6.276  -8.113  26.87 -31.195    23.8        12.5
 break_length pitch_type type_confidence
1          5.6         FT            .920
2          3.7         FF            .625
3          3.3         FF            .909
4         15.0         CU            .905
5          4.1         FT            .860
6          8.4         CH            .902

It’s somewhat of a pain to go through extracting info from the data frame, but here it is:

# now separate out the dates
starts.Lince <- paste(pitcher.Lince$Month, pitcher.Lince$Day)
starts.Lince <- factor(starts.Lince, levels = unique(starts.Lince))

# grab number of pitches wrt type wrt start
pitch_type.Lince <- by(pitcher.Lince, starts.Lince, function(x)
                       by(x, x$pitch_type, function(x)
                          length(x$pitch_type)))

# construct zero matrix with proper dims, pitch type rows, dates columns
p1 <- matrix(0,
             nrow = length(unique(unlist(lapply(pitch_type.Lince, names)))),
             ncol = length(pitch_type.Lince),
             dimnames = list(unique(unlist(lapply(pitch_type.Lince, names))),
               names(pitch_type.Lince)))

# count number of pitches thrown wrt pitch type for each start
p1.num <- lapply(pitch_type.Lince, function(x) as.matrix(x, colnames = names(x)))

# input uneven named list into matrix
for (i in 1:length(p1.num)) p1[row.names(p1.num[[i]]), i] <- p1.num[[i]]

# graph it!
barplot(apply(p1, 2, function(x) x / sum(x)), legend = rownames(p1),
        main = "Tim Lincecum 2010 Pitch Type Distribution")

Where “FF” is the four seam fastball, “FT” the two seam fastball and “FA” when Gameday can’t figure out which.


Filed under: Baseball, R, RMySQL

To leave a comment for the author, please follow the link and comment on their blog: mind of a Markov chain » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)