MLB Baseball Pitching Matchups ~ manipulating pitch f/x data using the RMySQL package in R

May 31, 2010
By

(This article was first published on mind of a Markov chain » R, and kindly contributed to R-bloggers)

After downloading some pitch f/x data using my R script, we can finally have some fun.

But because the pitch f/x data is very elaborate, R can easily get overwhelmed by copying the dataset back and forth in memory, as you manipulate the data. So the natural progression is to use relational database systems.

Here, I show a simple example of using the RMySQL package to plot the pitch types of Tim Lincecum’s 2010 season, extracted from the source pitch f/x data from MLB Gameday. So first, I downloaded the 2010 data using the following script (with a weird error, I was only able to download till 5/4/10, but I can easily concatenate the dataset by running the script again for later dates):

DownloadPitchFX(fileloc = "./pitch2010Apr04May04.txt", start.date = "2010-04-04", end.date = "2010-05-04")

Then, I start MySQL from the BASH shell to create a database called “pitchfx” within the MySQL shell (this on Mac OS 10.6, not really familiar w/ Windows).

bash-3.2$ mysql -u root
mysql> create database pitchfx;

Then we connect to the MySQL database (“pitchfx”) from R from root. From there, I can write the text file to the MySQL database by specifying a table (“2010Apr04May04″) and the file location (“./pitch2010Apr04May04.txt”). It’s important to know dbWriteTable() can include certain input arguments from read.table() so you can specify how the data will be translated into a MySQL database.

library(RMySQL)

con <- dbConnect(dbDriver("MySQL"), dbname = "pitchfx", user = "root")

dbWriteTable(con, "2010Apr04May04", "./pitch2010Apr04May04.txt", overwrite = F, sep = "t", head = T)

We can get the pitcher code for Tim Lincecum by looking online to then call a SQL query to get all the pitch info from Lincecum (hopefully I will have script to get pitcher/batter codes shortly).

pitcher.Lince <- dbGetQuery(con, "SELECT * FROM 2010Apr04May04 WHERE pitcher = "453311"")

> head(pitcher.Lince)
 Year Month Day Inning           Home                 Away b s o batter
1 2010     4   5      1 Houston Astros San Francisco Giants 0 0 1 456422
2 2010     4   5      1 Houston Astros San Francisco Giants 2 3 2 430565
3 2010     4   5      1 Houston Astros San Francisco Giants 2 3 2 430565
4 2010     4   5      1 Houston Astros San Francisco Giants 2 3 2 430565
5 2010     4   5      1 Houston Astros San Francisco Giants 2 3 2 430565
6 2010     4   5      1 Houston Astros San Francisco Giants 2 3 2 430565
 pitcher stand p_throws     event                       des type      x      y
1  453311     L        R Groundout           In play, out(s)    X 115.88 138.15
2  453311     L        R Strikeout                      Ball    B 142.49 120.88
3  453311     L        R Strikeout             Called Strike    S 121.89 156.29
4  453311     L        R Strikeout                      Ball    B 144.21 121.75
5  453311     L        R Strikeout           Swinging Strike    S 136.48 148.51
6  453311     L        R Strikeout Swinging Strike (Blocked)    S  84.98 188.23
 start_speed end_speed sz_top sz_bot pfx_x pfx_z     px    pz     x0   y0
1        93.1      84.2    3.2    1.5 -7.77  6.53 -0.601 2.845 -0.913 50.0
2        92.2      83.1   3.06   1.38 -5.39 11.07 -1.235 3.635 -0.738 50.0
3        93.3      84.8   2.94   1.38 -1.83 10.63 -0.674 1.873 -1.006 50.0
4        75.6      68.4   2.95   1.27  5.77 -10.3 -1.347 3.501 -0.605 50.0
5        93.6      85.1   3.26   1.63 -5.67   9.8 -1.206 2.215 -0.823 50.0
6        86.2      79.8   3.26   1.63 -5.05  0.56  0.444 0.739 -0.861 50.0
 z0    vx0      vy0    vz0      ax     ay      az break_y break_angle
1 6.056  3.516 -136.487 -4.813 -14.322 36.387 -20.055    23.7        32.2
2 6.084  0.531 -135.161 -4.183  -9.735 35.878   -12.1    23.7        36.0
3 6.014  1.522 -136.494 -8.778  -3.397 34.418 -12.366    23.7        11.7
4 6.394 -3.209 -110.727   4.07   6.939 25.572  -44.48    23.7        -8.2
5 5.974  0.946 -137.032 -7.542 -10.617 34.766 -13.754    23.7        33.6
6 5.784  4.875 -126.238 -6.276  -8.113  26.87 -31.195    23.8        12.5
 break_length pitch_type type_confidence
1          5.6         FT            .920
2          3.7         FF            .625
3          3.3         FF            .909
4         15.0         CU            .905
5          4.1         FT            .860
6          8.4         CH            .902

It’s somewhat of a pain to go through extracting info from the data frame, but here it is:

# now separate out the dates
starts.Lince <- paste(pitcher.Lince$Month, pitcher.Lince$Day)
starts.Lince <- factor(starts.Lince, levels = unique(starts.Lince))

# grab number of pitches wrt type wrt start
pitch_type.Lince <- by(pitcher.Lince, starts.Lince, function(x)
                       by(x, x$pitch_type, function(x)
                          length(x$pitch_type)))

# construct zero matrix with proper dims, pitch type rows, dates columns
p1 <- matrix(0,
             nrow = length(unique(unlist(lapply(pitch_type.Lince, names)))),
             ncol = length(pitch_type.Lince),
             dimnames = list(unique(unlist(lapply(pitch_type.Lince, names))),
               names(pitch_type.Lince)))

# count number of pitches thrown wrt pitch type for each start
p1.num <- lapply(pitch_type.Lince, function(x) as.matrix(x, colnames = names(x)))

# input uneven named list into matrix
for (i in 1:length(p1.num)) p1[row.names(p1.num[[i]]), i] <- p1.num[[i]]

# graph it!
barplot(apply(p1, 2, function(x) x / sum(x)), legend = rownames(p1),
        main = "Tim Lincecum 2010 Pitch Type Distribution")

Where “FF” is the four seam fastball, “FT” the two seam fastball and “FA” when Gameday can’t figure out which.


Filed under: Baseball, R, RMySQL

To leave a comment for the author, please follow the link and comment on his blog: mind of a Markov chain » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , ,

Comments are closed.