MLB Baseball Pitching Matchups ~ grabbing pitcher and/or batter codes by specify game date using R XML

[This article was first published on mind of a Markov chain » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

MLB Gameday stores its game data in XML format, with the players denoted in ID numbers. To find out who is who, the codes are stored in pitchers.xml or batters.xml of each game.

My DownloadPitchFX.R script can download the ID numbers, but it doesn’t look to see who the ID is because of the extra processing time. But to use the data (say in RMySQL), it helps to have another script that figures out the ID number for any player.

The following script (GetPitcherBatterCodes.R) requires the last and/or first name of the player, the team that he plays on and the specific date the player is assumed to play. It outputs a data frame with the matched name (however many) and their ID numbers. You can also let just.player = FALSE to download all of the players listed in that game (although it does that anyways).

The input for the team name is fairly general. You can use the codes that are specified in Gameday (“SF”, “sfn”), or the actual city of the team (“San Francisco”), or its team name (“Giants”).

## GetPitcherBatterCodes.R
## get pitcher batter codes for pitch f/x

library(XML)

# -- Outputs
# data frame of all matching names, OR
# data frame of all batters or pitchers in game

# -- Inputs
# game.date ~ game date player plays in, default POSIXlt format, e.g. "2009-05-20"
# is.pitcher ~ TRUE for pitcher, FALSE for batter
# last_name ~ a character vector for the last name
# first_name ~ a char vector for first name,
#   have to spell correctly but don't need both first and last names..
# team ~ denote team that player plays in,
#   use any of the following code within quotes.. example for SF Giants, or SD Padres:
#   away_name_abbrev="SF" home_name_abbrev="SD" away_code="sfn" away_file_code="sf" away_team_city="San Francisco" away_team_name="Giants" home_code="sdn" home_file_code="sd" home_team_city="San Diego" home_team_name="Padres"
# just.player ~ TRUE to get ID for player, FALSE to grab all pitchers OR batters in game

GetPitcherBatterCodes <- function(game.date = "2009-05-20",
                                  is.pitcher = TRUE,
                                  last_name = "Lincecum", first_name = "Tim",
                                  team = "sfn",
                                  just.player = TRUE,
                                  URL.base = "http://gd2.mlb.com/components/game/mlb/") {
  # extract date
  game.date <- as.POSIXlt(game.date)
  year <- game.date$year + 1900
  month <- game.date$mon + 1
  day <- game.date$mday
  URL.date <- paste(URL.base, "year_", year, "/",
                    ifelse(month >= 10, "month_", "month_0"), month, "/",
                    ifelse(day >= 10, "day_", "day_0"), day, "/", sep = "")

  # extract miniscoreboard.xml
  URL.scoreboard <- paste(URL.date, "miniscoreboard.xml", sep = "")
  XML.scoreboard <- xmlInternalTreeParse(URL.scoreboard)
  parse.scoreboard <- sapply(c("gameday_link",
                               "away_name_abbrev", "home_name_abbrev",
                               "away_code", "home_code",
                               "away_file_code", "home_file_code",
                               "away_team_city", "home_team_city",
                               "away_team_name", "home_team_name"), function(x)
                             xpathSApply(XML.scoreboard, "//game[@*]", xmlGetAttr, x))

  # get game URL of specified team
  team.index <- apply(parse.scoreboard, 1, function(x) team %in% x)
  team.URL <- parse.scoreboard[team.index, 1][1] # protect from double headers
  URL.game <- paste(URL.date, "gid_", team.URL, "/", sep = "")

  # get player data
  URL.players <- ifelse(is.pitcher, paste(URL.game, "pitchers/", sep = ""),
                        paste(URL.game, "batters/", sep = ""))
  HTML.players <- htmlParse(URL.players)
  codes.players <- xpathSApply(HTML.players, "//a[@*]", xmlGetAttr, "href")[-1]

  # loop through player codes to match last AND/OR first name
  info.players <- sapply(codes.players, function(x) {
    URL.player <- paste(URL.players, x, sep = "")
    XML.player <- xmlInternalTreeParse(URL.player)
    print(x)
    info.player <- sapply(c("team", "id", "type", "first_name", "last_name"), function(x)
                          xpathSApply(XML.player, "//Player[@*]", xmlGetAttr, x))
  })

  # get results and match player names if necessary
  if (just.player == TRUE) {
    last.index <- last_name == info.players["last_name",]
    first.index <- first_name == info.players["first_name",]
    matched.index <- as.logical(last.index + first.index)
    matched.players <- data.frame(id = info.players["id", matched.index],
                                  first_name = info.players["first_name", matched.index],
                                  last_name = info.players["last_name", matched.index])
    return(matched.players)
  }
  else return(info.players)
}

Some output:

> aho <- GetPitcherBatterCodes()
> aho
      id first_name last_name
1 453311        Tim  Lincecum

> aho2 <- GetPitcherBatterCodes(just.player = FALSE)
> aho2
           116615.xml 133982.xml 217096.xml 277405.xml 346793.xml 408241.xml
team       "sfn"      "sfn"      "sfn"      "sfn"      "sfn"      "sdn"
id         "116615"   "133982"   "217096"   "277405"   "346793"   "408241"
type       "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"
first_name "Randy"    "Bob"      "Barry"    "Justin"   "Jeremy"   "Jake"
last_name  "Johnson"  "Howry"    "Zito"     "Miller"   "Affeldt"  "Peavy"
           425514.xml 429718.xml 429723.xml 429781.xml 429985.xml 430161.xml
team       "sdn"      "sdn"      "sfn"      "sdn"      "sdn"      "sfn"
id         "425514"   "429718"   "429723"   "429781"   "429985"   "430161"
type       "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"
first_name "Heath"    "Shawn"    "Merkin"   "Kevin"    "Chad"     "Noah"
last_name  "Bell"     "Hill"     "Valdez"   "Correia"  "Gaudin"   "Lowry"
           430606.xml 430650.xml 430657.xml  430665.xml 430912.xml 432934.xml
team       "sdn"      "sdn"      "sdn"       "sfn"      "sfn"      "sdn"
id         "430606"   "430650"   "430657"    "430665"   "430912"   "432934"
type       "pitcher"  "pitcher"  "pitcher"   "pitcher"  "pitcher"  "pitcher"
first_name "Mike"     "Edwin"    "Cha Seung" "Brandon"  "Matt"     "Chris"
last_name  "Adams"    "Moreno"   "Baek"      "Medders"  "Cain"     "Young"
           435619.xml 445995.xml 446207.xml 448592.xml 450312.xml 450527.xml
team       "sfn"      "sdn"      "sdn"      "sdn"      "sdn"      "sfn"
id         "435619"   "445995"   "446207"   "448592"   "450312"   "450527"
type       "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"
first_name "Pat"      "Arturo"   "Josh"     "Cla"      "Mark"     "Alex"
last_name  "Misch"    "Lopez"    "Geer"     "Meredith" "Worrell"  "Hinshaw"
           450832.xml 451216.xml 452724.xml 453281.xml 453311.xml 456043.xml
team       "sfn"      "sfn"      "sfn"      "sdn"      "sfn"      "sfn"
id         "450832"   "451216"   "452724"   "453281"   "453311"   "456043"
type       "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"
first_name "Jesse"    "Brian"    "Billy"    "Wade"     "Tim"      "Jonathan"
last_name  "English"  "Wilson"   "Sadler"   "LeBlanc"  "Lincecum" "Sanchez"
           457117.xml 457566.xml 458155.xml 459987.xml 460044.xml 464351.xml
team       "sdn"      "sdn"      "sfn"      "sdn"      "sdn"      "sfn"
id         "457117"   "457566"   "458155"   "459987"   "460044"   "464351"
type       "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"
first_name "Ernesto"  "Greg"     "Joe"      "Cesar"    "Cesar"    "Kelvin"
last_name  "Frieri"   "Burke"    "Martinez" "Ramos"    "Carrillo" "Pichardo"
           464400.xml 465629.xml 466412.xml 467683.xml 471183.xml 477581.xml
team       "sfn"      "sdn"      "sdn"      "sfn"      "sfn"      "sdn"
id         "464400"   "465629"   "466412"   "467683"   "471183"   "477581"
type       "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"
first_name "Henry"    "Edward"   "Luis"     "Osiris"   "Waldis"   "Walter"
last_name  "Sosa"     "Mujica"   "Perdomo"  "Matos"    "Joaquin"  "Silva"
           489265.xml 491159.xml 502381.xml  503355.xml
team       "sfn"      "sdn"      "sdn"       "sdn"
id         "489265"   "491159"   "502381"    "503355"
type       "pitcher"  "pitcher"  "pitcher"   "pitcher"
first_name "Sergio"   "Joe"      "Luke"      "Jackson"
last_name  "Romo"     "Thatcher" "Gregerson" "Quezada"


Filed under: Baseball, R, XML

To leave a comment for the author, please follow the link and comment on their blog: mind of a Markov chain » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)