Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
MLB Gameday stores its game data in XML format, with the players denoted in ID numbers. To find out who is who, the codes are stored in pitchers.xml or batters.xml of each game.
My DownloadPitchFX.R script can download the ID numbers, but it doesn’t look to see who the ID is because of the extra processing time. But to use the data (say in RMySQL), it helps to have another script that figures out the ID number for any player.
The following script (GetPitcherBatterCodes.R) requires the last and/or first name of the player, the team that he plays on and the specific date the player is assumed to play. It outputs a data frame with the matched name (however many) and their ID numbers. You can also let just.player = FALSE to download all of the players listed in that game (although it does that anyways).
The input for the team name is fairly general. You can use the codes that are specified in Gameday (“SF”, “sfn”), or the actual city of the team (“San Francisco”), or its team name (“Giants”).
## GetPitcherBatterCodes.R
## get pitcher batter codes for pitch f/x
library(XML)
# -- Outputs
# data frame of all matching names, OR
# data frame of all batters or pitchers in game
# -- Inputs
# game.date ~ game date player plays in, default POSIXlt format, e.g. "2009-05-20"
# is.pitcher ~ TRUE for pitcher, FALSE for batter
# last_name ~ a character vector for the last name
# first_name ~ a char vector for first name,
# have to spell correctly but don't need both first and last names..
# team ~ denote team that player plays in,
# use any of the following code within quotes.. example for SF Giants, or SD Padres:
# away_name_abbrev="SF" home_name_abbrev="SD" away_code="sfn" away_file_code="sf" away_team_city="San Francisco" away_team_name="Giants" home_code="sdn" home_file_code="sd" home_team_city="San Diego" home_team_name="Padres"
# just.player ~ TRUE to get ID for player, FALSE to grab all pitchers OR batters in game
GetPitcherBatterCodes <- function(game.date = "2009-05-20",
is.pitcher = TRUE,
last_name = "Lincecum", first_name = "Tim",
team = "sfn",
just.player = TRUE,
URL.base = "http://gd2.mlb.com/components/game/mlb/") {
# extract date
game.date <- as.POSIXlt(game.date)
year <- game.date$year + 1900
month <- game.date$mon + 1
day <- game.date$mday
URL.date <- paste(URL.base, "year_", year, "/",
ifelse(month >= 10, "month_", "month_0"), month, "/",
ifelse(day >= 10, "day_", "day_0"), day, "/", sep = "")
# extract miniscoreboard.xml
URL.scoreboard <- paste(URL.date, "miniscoreboard.xml", sep = "")
XML.scoreboard <- xmlInternalTreeParse(URL.scoreboard)
parse.scoreboard <- sapply(c("gameday_link",
"away_name_abbrev", "home_name_abbrev",
"away_code", "home_code",
"away_file_code", "home_file_code",
"away_team_city", "home_team_city",
"away_team_name", "home_team_name"), function(x)
xpathSApply(XML.scoreboard, "//game[@*]", xmlGetAttr, x))
# get game URL of specified team
team.index <- apply(parse.scoreboard, 1, function(x) team %in% x)
team.URL <- parse.scoreboard[team.index, 1][1] # protect from double headers
URL.game <- paste(URL.date, "gid_", team.URL, "/", sep = "")
# get player data
URL.players <- ifelse(is.pitcher, paste(URL.game, "pitchers/", sep = ""),
paste(URL.game, "batters/", sep = ""))
HTML.players <- htmlParse(URL.players)
codes.players <- xpathSApply(HTML.players, "//a[@*]", xmlGetAttr, "href")[-1]
# loop through player codes to match last AND/OR first name
info.players <- sapply(codes.players, function(x) {
URL.player <- paste(URL.players, x, sep = "")
XML.player <- xmlInternalTreeParse(URL.player)
print(x)
info.player <- sapply(c("team", "id", "type", "first_name", "last_name"), function(x)
xpathSApply(XML.player, "//Player[@*]", xmlGetAttr, x))
})
# get results and match player names if necessary
if (just.player == TRUE) {
last.index <- last_name == info.players["last_name",]
first.index <- first_name == info.players["first_name",]
matched.index <- as.logical(last.index + first.index)
matched.players <- data.frame(id = info.players["id", matched.index],
first_name = info.players["first_name", matched.index],
last_name = info.players["last_name", matched.index])
return(matched.players)
}
else return(info.players)
}
Some output:
> aho <- GetPitcherBatterCodes()
> aho
id first_name last_name
1 453311 Tim Lincecum
> aho2 <- GetPitcherBatterCodes(just.player = FALSE)
> aho2
116615.xml 133982.xml 217096.xml 277405.xml 346793.xml 408241.xml
team "sfn" "sfn" "sfn" "sfn" "sfn" "sdn"
id "116615" "133982" "217096" "277405" "346793" "408241"
type "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" "pitcher"
first_name "Randy" "Bob" "Barry" "Justin" "Jeremy" "Jake"
last_name "Johnson" "Howry" "Zito" "Miller" "Affeldt" "Peavy"
425514.xml 429718.xml 429723.xml 429781.xml 429985.xml 430161.xml
team "sdn" "sdn" "sfn" "sdn" "sdn" "sfn"
id "425514" "429718" "429723" "429781" "429985" "430161"
type "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" "pitcher"
first_name "Heath" "Shawn" "Merkin" "Kevin" "Chad" "Noah"
last_name "Bell" "Hill" "Valdez" "Correia" "Gaudin" "Lowry"
430606.xml 430650.xml 430657.xml 430665.xml 430912.xml 432934.xml
team "sdn" "sdn" "sdn" "sfn" "sfn" "sdn"
id "430606" "430650" "430657" "430665" "430912" "432934"
type "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" "pitcher"
first_name "Mike" "Edwin" "Cha Seung" "Brandon" "Matt" "Chris"
last_name "Adams" "Moreno" "Baek" "Medders" "Cain" "Young"
435619.xml 445995.xml 446207.xml 448592.xml 450312.xml 450527.xml
team "sfn" "sdn" "sdn" "sdn" "sdn" "sfn"
id "435619" "445995" "446207" "448592" "450312" "450527"
type "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" "pitcher"
first_name "Pat" "Arturo" "Josh" "Cla" "Mark" "Alex"
last_name "Misch" "Lopez" "Geer" "Meredith" "Worrell" "Hinshaw"
450832.xml 451216.xml 452724.xml 453281.xml 453311.xml 456043.xml
team "sfn" "sfn" "sfn" "sdn" "sfn" "sfn"
id "450832" "451216" "452724" "453281" "453311" "456043"
type "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" "pitcher"
first_name "Jesse" "Brian" "Billy" "Wade" "Tim" "Jonathan"
last_name "English" "Wilson" "Sadler" "LeBlanc" "Lincecum" "Sanchez"
457117.xml 457566.xml 458155.xml 459987.xml 460044.xml 464351.xml
team "sdn" "sdn" "sfn" "sdn" "sdn" "sfn"
id "457117" "457566" "458155" "459987" "460044" "464351"
type "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" "pitcher"
first_name "Ernesto" "Greg" "Joe" "Cesar" "Cesar" "Kelvin"
last_name "Frieri" "Burke" "Martinez" "Ramos" "Carrillo" "Pichardo"
464400.xml 465629.xml 466412.xml 467683.xml 471183.xml 477581.xml
team "sfn" "sdn" "sdn" "sfn" "sfn" "sdn"
id "464400" "465629" "466412" "467683" "471183" "477581"
type "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" "pitcher"
first_name "Henry" "Edward" "Luis" "Osiris" "Waldis" "Walter"
last_name "Sosa" "Mujica" "Perdomo" "Matos" "Joaquin" "Silva"
489265.xml 491159.xml 502381.xml 503355.xml
team "sfn" "sdn" "sdn" "sdn"
id "489265" "491159" "502381" "503355"
type "pitcher" "pitcher" "pitcher" "pitcher"
first_name "Sergio" "Joe" "Luke" "Jackson"
last_name "Romo" "Thatcher" "Gregerson" "Quezada"
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
