MLB Baseball Pitching Matchups ~ downloading pitch f/x data using the XML package in R [updatedx6]

[This article was first published on mind of a Markov chain » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Update x6 (Jul 27): so I guess people want pitch counts. The data @ MLB seems to only give the pitch count of the end result and the strikes/balls/outs of the particular pitch. Of course you can combine them to get the pitch count. Stupid WordPress comments strip out necessary HTML to properly display code, therefore, I post below! pitcher.Lince is the data frame downloaded from MLB, just for Tim Lincecum data. Should work in general. Of course if you have a lot of data, it will take time.

pitcher.Lince$atbat <- cumsum(c(0, diff(pitcher.Lince$batter) != 0))

# get counts
aho <- tapply(pitcher.Lince$type, pitcher.Lince$atbat, function(x) {
  count <- matrix(0, length(x), 3,
                  dimnames = list(NULL, c("S","B","X")))
  for (i in 1:length(x)) {
    if (i >= 2) {
      count[i,] <- count[i-1,]
      count[i,x[i-1]] <- count[i-1,x[i-1]] + 1
      if (x[i-1] == "S" & count[i,"S"] > 2) count[i,"S"] <- 2
    }
  }
  count
})

aho2 <- data.frame()
for (i in 1:length(aho)) {
  aho2 <- rbind(aho2, aho[[i]])
}

pitcher.Lince <- data.frame(pitcher.Lince, aho2)
rm(aho, aho2)

Update x5 (Jun 13): More bug fixes. Now the default value for end.date is start.date.

Update x4 (Jun 05): More bug fixes (especially for 2008). I’ve also realized that it is advisable to run the script each time for a new day instead of a range of dates, which really bloats memory usage.

Update x3 (Jun 01): New code (version 0.4) fixed some bugs, grabs team info and checks game.type to choose type of game (regular season, world series, etc.).

Update x2 (May 27): New code (version 0.3) has replaced old buggy one (version 0.2).

Update: Please see comments below for some problems with code. I will make update shortly(?).

MLB Gameday collects massive amounts of data from each at bat of each Major League Baseball game. Gameday also doubles as a web application to see discrete events in action for those who don’t have a TV, too lazy to buy MLB.tv and are mad at MLB’s blackout policies. Using this detailed data, one can do a myriad of interesting analyses. Gameday added pitch f/x, which includes data on every pitch and its characteristics (speed, location, release point, break, etc., etc.). There are existing tools to use this data:

  1. pitch f/x online tool to graph w/o downloading, by Josh Kalk
  2. online tool by Dan Brooks
  3. R tool to download pitch f/x data by Erik Iverson
  4. Perl tool to download pitch f/x data by Mike Fast
  5. download via Microsoft tools by Sean Smith
  6. Gameday glossary by Alan Nathan

I’ve decided to construct my own R code that is more comprehensive than Erik’s. The current version simply downloads the data from the website containing the data, given a range of dates. Ultimately, I would like the user to be able to put it in a database, to query data for different pitchers, batters, teams, matchups, etc.

The only package that is necessary is XML. I used three functions of importance:

xmlInternalTreeParse
xpathSApply
htmlParse

The first function takes an XML file URL to put it into an XML class. The second function takes the XML object, and then “parses” through to extract certain elements, using the XPath language. The below example finds the game tag (), finds the gameday_link element (), and extracts its value(s) () into a vector.

URL.scoreboard <- "http://gd2.mlb.com/components/game/mlb/year_2009/month_05/day_02/miniscoreboard.xml"
XML.scoreboard <- xmlInternalTreeParse(URL.scoreboard)
parse.scoreboard <- xpathSApply(XML.scoreboard, "//game[@gameday_link]", xmlGetAttr, "gameday_link")

The third function doesn’t take a downloadable link, but the HTML file associated with the URL. This is used as a check to see certain files exist (the data is imperfect, where it has game files when no information is filled in, crashing the code). You can also use xpathSApply to the HTML object.

URL.game <- "http://gd2.mlb.com/components/game/mlb/year_2009/month_05/day_02/gid_2009_05_02_sdnmlb_lanmlb_1/"
HTML.game <- htmlParse(URL.game)
parse.game.exists <- xpathSApply(HTML.game, "//a[@*]", xmlGetAttr, "href")

The code grabs the information picked by the user (namely pitch and atbat type) iteratively for the selected dates, and writes it directly to a .csv file.

I haven’t really tested my code for a range of dates so it’s not fully robust (though I fixed a couple defects). The code also takes a long time due to the volume of the data (~20 seconds for a single day on my 2.5 Ghz machine).

Code:

# DownloadPitchFX.R
# downloads the massive MLB Gameday data.
# Author: apeecape
# Email: achikule at gmail dot com
# Updated: Jun 13 2010
# Version 0.4
# Version History
# 0.5 ~ grab player data, both pitchers and batters, ability to pick team
# 0.4 ~ get team data, and ability to grab team info, checks to see if regular season
# 0.3 ~ updated so 2010 works, fixed some bugs, and saves as tab delimited file
# 0.2 ~ inputs are start and end dates
# 0.1 ~ grab Pitch f/x data from MLB Gameday, specify date ranges (takes half a minute for a day's worth of data on my 2.5Ghz machine)

# Future Versions:
# ~ ability to pick pitchers, batters, teams
# - ability to grab matchups
# - better searching instead of tediously parsing through each XML file
# ~ connect to mysql database
# ~ don't overheat computer!
# ~ document Gameday Code

# downloading pitch f/x data from MLB website
# Get data from http://gd2.mlb.com/components/game/mlb/
# XML package http://www.omegahat.org/RSXML/shortIntro.html
# Perl script of same application by Mike Fast:
# http://fastballs.files.wordpress.com/2007/09/hack_28_parser_mikefast_test_pl.txt
# Less general R code from Erik Iverson of Blogistic Reflections:
# http://blogisticreflections.wordpress.com/2009/10/04/using-r-to-analyze-baseball-games-in-real-time/
# listing of pitch f/x tools by Baseball Analysts
# http://baseballanalysts.com/archives/2010/03/how_can_i_get_m.php
# downloadable pitch f/x database from Darrell Zimmerman
# http://www.wantlinux.net/category/baseball-data/

# I think gameday data starts 2005
# I think enhanced gameday (pitch fx) has all of 2009, most of 2008, some 2007, tiny bit 2006

# required libraries:
library(XML)

# code for <game type> in game.xml (input game.type in code)
# "S" ~ spring training, "R" ~ regular season, "D" ~ Division Series
# "L" ~ League Championship Series "W" ~ World Series

# code for <game gameday_sw> in game.xml
# http://sports.dir.groups.yahoo.com/group/RetroSQL/message/320
# "N" ~ missing, no pitch info
# "Y" ~ standard w/ pitch locations
# "E" ~ w/ pitch f/x
# "P" ~ for 2010, whatever that's supposed to mean

# code for teams

# code for players

# code for gameday

# code for pitch type

# code for atbat type

# checks for:
# gameday type
# home, away
# player, batter, pitch type

# -----------------------------------------------------------

DownloadPitchFX <- function(fileloc = "./pitchfx.txt",
                            start.date = "2009-05-02", end.date = start.date,
                            URL.base = "http://gd2.mlb.com/components/game/mlb/",
                            game.type = "R",
                            grab.pitch = c("des", "type", "x", "y",
                              "start_speed", "end_speed",
                              "sz_top", "sz_bot", "pfx_x", "pfx_z", "px", "pz",
                              "x0", "y0", "z0", "vx0", "vy0", "vz0", "ax", "ay", "az",
                              "break_y", "break_angle", "break_length", "pitch_type",
                              "type_confidence"),
                            grab.atbat = c("b", "s", "o", "batter", "pitcher", "b_height",
                              "stand", "p_throws", "event")) {
  # write initial variables on file
  meta <- c("Year", "Month", "Day", "Inning", "Home", "Away")
  write(c(meta, grab.atbat, grab.pitch), file = fileloc,
        ncol = length(c(grab.atbat, grab.pitch)) + length(meta), sep = "t")

  # transfer date info
  start.date <- as.POSIXlt(start.date); end.date <- as.POSIXlt(end.date);
  diff.date <- as.numeric(difftime(end.date, start.date))
  date.range <- as.POSIXlt(seq(start.date, by = "days",
                               length = 1 + diff.date))

  for (i in 1:(diff.date+1)) {
    year <- date.range[i]$year + 1900
    month <- date.range[i]$mon + 1
    day <- date.range[i]$mday
    URL.date <- paste(URL.base, "year_", year, "/",
                      ifelse(month >= 10, "month_", "month_0"), month, "/",
                      ifelse(day >= 10, "day_", "day_0"), day, "/", sep = "")

    # grab matchups for today
    ##     URL.scoreboard <- paste(URL.date, "miniscoreboard.xml", sep = "")
    ##     XML.scoreboard <- xmlInternalTreeParse(URL.scoreboard)
    ##     parse.scoreboard <- xpathSApply(XML.scoreboard, "//game[@gameday_link]",
    ##                                     xmlGetAttr, "gameday_link")
    HTML.day <- htmlParse(URL.date)
    parse.day <- xpathSApply(HTML.day, "//a[@*]", xmlGetAttr, "href")
    parse.day <- parse.day[grep("^gid_*", parse.day)]

    # if games exists today
    if (length(parse.day) >= 1) {

      # for each game
      for (game in 1:length(parse.day)) {
        print(game)
        URL.game <- paste(URL.date, parse.day[game], sep = "")
        HTML.game <- htmlParse(URL.game)
        parse.game.exists <- xpathSApply(HTML.game, "//a[@*]", xmlGetAttr, "href")

        # if game.xml exists
        if (sum(match(parse.game.exists, "game.xml"), na.rm = T) > 0) {

          # grab game type (regular season, etc.) and gameday type (pitch f/x, etc.)
          XML.game <- xmlInternalTreeParse(paste(URL.game, "game.xml", sep = ""))
          parse.game <- sapply(c("type", "gameday_sw"), function (x)
                               xpathSApply(XML.game, "//game[@*]", xmlGetAttr, x))

          # if proper game type: "R" ~ regular season, "S" ~ spring, "D" ~ divison series
          # "L" ~ league chamption series, "W" ~ world series
          if (parse.game['type'] == game.type) {
            # grab team names
            parse.teams <- sapply(c("abbrev"), function (x)
                                  xpathSApply(XML.game, "//team[@*]", xmlGetAttr, x))
            home <- parse.teams[1]; away <- parse.teams[2]

            # if pitch f/x data exists
            if (parse.game["gameday_sw"] == "E" | parse.game["gameday_sw"] == "P") {

              # grab number of innings played
              HTML.Ninnings <- htmlParse(paste(URL.game, "inning/", sep = ""))
              parse.Ninnings <- xpathSApply(HTML.Ninnings, "//a[@*]", xmlGetAttr, "href")

              # check to see if game exists data by checking innings > 1
              if (length(grep("^inning_[0-9]", parse.Ninnings)) > 1) {

                # for each inning
                for (inning in 1:length(grep("^inning_[0-9]", parse.Ninnings))) {

                  # grab inning info
                  URL.inning <- paste(URL.game, "inning/", "inning_", inning,
                                      ".xml", sep = "")
                  XML.inning <- xmlInternalTreeParse(URL.inning)
                  parse.atbat <- xpathSApply(XML.inning, "//atbat[@*]")
                  parse.Npitches.atbat <- sapply(parse.atbat, function(x)
                                                 sum(names(xmlChildren(x)) == "pitch"))

                  # check to see if atbat exists
                  if (length(parse.atbat) > 0) {
                    print(paste(parse.day[game], "inning =", inning))

                    # parse attributes from pitch and atbat (ugh, ugly)
                    parse.pitch <- sapply(grab.pitch, function(x)
                                          as.character(xpathSApply(XML.inning, "//pitch[@*]",
                                                                   xmlGetAttr, x)))
                    parse.pitch <- if (class(parse.pitch) == "character") {
                      t(parse.pitch)
                    } else apply(parse.pitch, 2, as.character)
                    results.atbat <- t(sapply(parse.atbat, function(x)
                                              xmlAttrs(x)[grab.atbat]))
                    results.atbat <- results.atbat[rep(seq(nrow(results.atbat)),
                                                       times = parse.Npitches.atbat),]
                    results.atbat <- if (class(results.atbat) == "character") {
                      t(results.atbat)
                    } else results.atbat

                    # write results
                    write(t(cbind(year, month, day, inning, home, away,
                                  results.atbat, parse.pitch)), file = fileloc,
                          ncol = length(c(grab.atbat, grab.pitch)) + length(meta),
                          append = T, sep = "t")
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}


Filed under: Baseball, R, XML

To leave a comment for the author, please follow the link and comment on their blog: mind of a Markov chain » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)