# MLB Baseball Pitching Matchups ~ downloading pitch f/x data using the XML package in R [updatedx6]

May 18, 2010
By

[This article was first published on mind of a Markov chain » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Update x6 (Jul 27): so I guess people want pitch counts. The data @ MLB seems to only give the pitch count of the end result and the strikes/balls/outs of the particular pitch. Of course you can combine them to get the pitch count. Stupid WordPress comments strip out necessary HTML to properly display code, therefore, I post below! `pitcher.Lince` is the data frame downloaded from MLB, just for Tim Lincecum data. Should work in general. Of course if you have a lot of data, it will take time.

```pitcher.Lince\$atbat <- cumsum(c(0, diff(pitcher.Lince\$batter) != 0))

# get counts
aho <- tapply(pitcher.Lince\$type, pitcher.Lince\$atbat, function(x) {
count <- matrix(0, length(x), 3,
dimnames = list(NULL, c("S","B","X")))
for (i in 1:length(x)) {
if (i >= 2) {
count[i,] <- count[i-1,]
count[i,x[i-1]] <- count[i-1,x[i-1]] + 1
if (x[i-1] == "S" & count[i,"S"] > 2) count[i,"S"] <- 2
}
}
count
})

aho2 <- data.frame()
for (i in 1:length(aho)) {
aho2 <- rbind(aho2, aho[[i]])
}

pitcher.Lince <- data.frame(pitcher.Lince, aho2)
rm(aho, aho2)```

Update x5 (Jun 13): More bug fixes. Now the default value for `end.date` is `start.date`.

Update x4 (Jun 05): More bug fixes (especially for 2008). I’ve also realized that it is advisable to run the script each time for a new day instead of a range of dates, which really bloats memory usage.

Update x3 (Jun 01): New code (version 0.4) fixed some bugs, grabs team info and checks `game.type` to choose type of game (regular season, world series, etc.).

Update x2 (May 27): New code (version 0.3) has replaced old buggy one (version 0.2).

Update: Please see comments below for some problems with code. I will make update shortly(?).

MLB Gameday collects massive amounts of data from each at bat of each Major League Baseball game. Gameday also doubles as a web application to see discrete events in action for those who don’t have a TV, too lazy to buy MLB.tv and are mad at MLB’s blackout policies. Using this detailed data, one can do a myriad of interesting analyses. Gameday added pitch f/x, which includes data on every pitch and its characteristics (speed, location, release point, break, etc., etc.). There are existing tools to use this data:

I’ve decided to construct my own R code that is more comprehensive than Erik’s. The current version simply downloads the data from the website containing the data, given a range of dates. Ultimately, I would like the user to be able to put it in a database, to query data for different pitchers, batters, teams, matchups, etc.

The only package that is necessary is XML. I used three functions of importance:

```xmlInternalTreeParse
xpathSApply
htmlParse```

The first function takes an XML file URL to put it into an XML class. The second function takes the XML object, and then “parses” through to extract certain elements, using the XPath language. The below example finds the game tag (), finds the gameday_link element (), and extracts its value(s) () into a vector.

```URL.scoreboard <- "http://gd2.mlb.com/components/game/mlb/year_2009/month_05/day_02/miniscoreboard.xml"
XML.scoreboard <- xmlInternalTreeParse(URL.scoreboard)

The third function doesn’t take a downloadable link, but the HTML file associated with the URL. This is used as a check to see certain files exist (the data is imperfect, where it has game files when no information is filled in, crashing the code). You can also use xpathSApply to the HTML object.

```URL.game <- "http://gd2.mlb.com/components/game/mlb/year_2009/month_05/day_02/gid_2009_05_02_sdnmlb_lanmlb_1/"
HTML.game <- htmlParse(URL.game)
parse.game.exists <- xpathSApply(HTML.game, "//a[@*]", xmlGetAttr, "href")```

The code grabs the information picked by the user (namely pitch and atbat type) iteratively for the selected dates, and writes it directly to a .csv file.

I haven’t really tested my code for a range of dates so it’s not fully robust (though I fixed a couple defects). The code also takes a long time due to the volume of the data (~20 seconds for a single day on my 2.5 Ghz machine).

Code:

```# DownloadPitchFX.R
# Author: apeecape
# Email: achikule at gmail dot com
# Updated: Jun 13 2010
# Version 0.4
# Version History
# 0.5 ~ grab player data, both pitchers and batters, ability to pick team
# 0.4 ~ get team data, and ability to grab team info, checks to see if regular season
# 0.3 ~ updated so 2010 works, fixed some bugs, and saves as tab delimited file
# 0.2 ~ inputs are start and end dates
# 0.1 ~ grab Pitch f/x data from MLB Gameday, specify date ranges (takes half a minute for a day's worth of data on my 2.5Ghz machine)

# Future Versions:
# ~ ability to pick pitchers, batters, teams
# - ability to grab matchups
# - better searching instead of tediously parsing through each XML file
# ~ connect to mysql database
# ~ don't overheat computer!
# ~ document Gameday Code

# Get data from http://gd2.mlb.com/components/game/mlb/
# XML package http://www.omegahat.org/RSXML/shortIntro.html
# Perl script of same application by Mike Fast:
# http://fastballs.files.wordpress.com/2007/09/hack_28_parser_mikefast_test_pl.txt
# Less general R code from Erik Iverson of Blogistic Reflections:
# http://blogisticreflections.wordpress.com/2009/10/04/using-r-to-analyze-baseball-games-in-real-time/
# listing of pitch f/x tools by Baseball Analysts
# http://baseballanalysts.com/archives/2010/03/how_can_i_get_m.php
# http://www.wantlinux.net/category/baseball-data/

# I think gameday data starts 2005
# I think enhanced gameday (pitch fx) has all of 2009, most of 2008, some 2007, tiny bit 2006

# required libraries:
library(XML)

# code for  in game.xml (input game.type in code)
# "S" ~ spring training, "R" ~ regular season, "D" ~ Division Series
# "L" ~ League Championship Series "W" ~ World Series

# code for  in game.xml
# http://sports.dir.groups.yahoo.com/group/RetroSQL/message/320
# "N" ~ missing, no pitch info
# "Y" ~ standard w/ pitch locations
# "E" ~ w/ pitch f/x
# "P" ~ for 2010, whatever that's supposed to mean

# code for teams

# code for players

# code for gameday

# code for pitch type

# code for atbat type

# checks for:
# gameday type
# home, away
# player, batter, pitch type

# -----------------------------------------------------------

start.date = "2009-05-02", end.date = start.date,
URL.base = "http://gd2.mlb.com/components/game/mlb/",
game.type = "R",
grab.pitch = c("des", "type", "x", "y",
"start_speed", "end_speed",
"sz_top", "sz_bot", "pfx_x", "pfx_z", "px", "pz",
"x0", "y0", "z0", "vx0", "vy0", "vz0", "ax", "ay", "az",
"break_y", "break_angle", "break_length", "pitch_type",
"type_confidence"),
grab.atbat = c("b", "s", "o", "batter", "pitcher", "b_height",
"stand", "p_throws", "event")) {
# write initial variables on file
meta <- c("Year", "Month", "Day", "Inning", "Home", "Away")
write(c(meta, grab.atbat, grab.pitch), file = fileloc,
ncol = length(c(grab.atbat, grab.pitch)) + length(meta), sep = "t")

# transfer date info
start.date <- as.POSIXlt(start.date); end.date <- as.POSIXlt(end.date);
diff.date <- as.numeric(difftime(end.date, start.date))
date.range <- as.POSIXlt(seq(start.date, by = "days",
length = 1 + diff.date))

for (i in 1:(diff.date+1)) {
year <- date.range[i]\$year + 1900
month <- date.range[i]\$mon + 1
day <- date.range[i]\$mday
URL.date <- paste(URL.base, "year_", year, "/",
ifelse(month >= 10, "month_", "month_0"), month, "/",
ifelse(day >= 10, "day_", "day_0"), day, "/", sep = "")

# grab matchups for today
##     URL.scoreboard <- paste(URL.date, "miniscoreboard.xml", sep = "")
##     XML.scoreboard <- xmlInternalTreeParse(URL.scoreboard)
HTML.day <- htmlParse(URL.date)
parse.day <- xpathSApply(HTML.day, "//a[@*]", xmlGetAttr, "href")
parse.day <- parse.day[grep("^gid_*", parse.day)]

# if games exists today
if (length(parse.day) >= 1) {

# for each game
for (game in 1:length(parse.day)) {
print(game)
URL.game <- paste(URL.date, parse.day[game], sep = "")
HTML.game <- htmlParse(URL.game)
parse.game.exists <- xpathSApply(HTML.game, "//a[@*]", xmlGetAttr, "href")

# if game.xml exists
if (sum(match(parse.game.exists, "game.xml"), na.rm = T) > 0) {

# grab game type (regular season, etc.) and gameday type (pitch f/x, etc.)
XML.game <- xmlInternalTreeParse(paste(URL.game, "game.xml", sep = ""))
parse.game <- sapply(c("type", "gameday_sw"), function (x)
xpathSApply(XML.game, "//game[@*]", xmlGetAttr, x))

# if proper game type: "R" ~ regular season, "S" ~ spring, "D" ~ divison series
# "L" ~ league chamption series, "W" ~ world series
if (parse.game['type'] == game.type) {
# grab team names
parse.teams <- sapply(c("abbrev"), function (x)
xpathSApply(XML.game, "//team[@*]", xmlGetAttr, x))
home <- parse.teams[1]; away <- parse.teams[2]

# if pitch f/x data exists
if (parse.game["gameday_sw"] == "E" | parse.game["gameday_sw"] == "P") {

# grab number of innings played
HTML.Ninnings <- htmlParse(paste(URL.game, "inning/", sep = ""))
parse.Ninnings <- xpathSApply(HTML.Ninnings, "//a[@*]", xmlGetAttr, "href")

# check to see if game exists data by checking innings > 1
if (length(grep("^inning_[0-9]", parse.Ninnings)) > 1) {

# for each inning
for (inning in 1:length(grep("^inning_[0-9]", parse.Ninnings))) {

# grab inning info
URL.inning <- paste(URL.game, "inning/", "inning_", inning,
".xml", sep = "")
XML.inning <- xmlInternalTreeParse(URL.inning)
parse.atbat <- xpathSApply(XML.inning, "//atbat[@*]")
parse.Npitches.atbat <- sapply(parse.atbat, function(x)
sum(names(xmlChildren(x)) == "pitch"))

# check to see if atbat exists
if (length(parse.atbat) > 0) {
print(paste(parse.day[game], "inning =", inning))

# parse attributes from pitch and atbat (ugh, ugly)
parse.pitch <- sapply(grab.pitch, function(x)
as.character(xpathSApply(XML.inning, "//pitch[@*]",
xmlGetAttr, x)))
parse.pitch <- if (class(parse.pitch) == "character") {
t(parse.pitch)
} else apply(parse.pitch, 2, as.character)
results.atbat <- t(sapply(parse.atbat, function(x)
xmlAttrs(x)[grab.atbat]))
results.atbat <- results.atbat[rep(seq(nrow(results.atbat)),
times = parse.Npitches.atbat),]
results.atbat <- if (class(results.atbat) == "character") {
t(results.atbat)
} else results.atbat

# write results
write(t(cbind(year, month, day, inning, home, away,
results.atbat, parse.pitch)), file = fileloc,
ncol = length(c(grab.atbat, grab.pitch)) + length(meta),
append = T, sep = "t")
}
}
}
}
}
}
}
}
}
}
```

Filed under: Baseball, R, XML

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.