A Baseball Dashboard in Time for Opening Weekend (part one)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
@ctrent I’ve always thought that it’d be fun to debate the @Reds HOF like the real one. Set a high , but debatable, bar. Bring on the JAWS
— Craig Wales (@C_Dubs1) August 22, 2017
it would be interesting, I don't have the math or computer skills to do so https://t.co/p5n9O5NjGT
— C. Trent Rosecrans (@ctrent) August 22, 2017
Baseball fan? ☑️ Math? ☑️ Computers skills? ☑️
It’s always nice when project ideas fall into your lap. Let’s build an dashboard that can be used to evaluate nominees for a franchise’s hall of fame.
JAWS
For the uninitiated, the first thing you need to understand is WAR. WAR attempts to encapsulate a player’s yearly contribution into one statistic. JAWS is calculated by taking the average of a player’s total WAR over their career and the sum of their seven highest WAR values. The JAWS statistic is meant to be a starting point in the discussion of a nominee’s creditials for the Hall of Fame. Jaffe provides a thorough explanation here. There’s also a more succinct description at Baseball-Reference.
The question is can JAWS, whose domain of applicability is the MLB Hall of Fame, be applied to a franchise’s Hall of Fame. Instead of career WAR, we’ll be using only the WAR the player accrued while he played for the Cincinnati Reds. The likely sticking point is the seven year qualification since free agency makes it less likely that players remain with a team for that length of duration. The number of inductees in a franchise hall of fame probably is considerably less than the MLB hall of fame, so the trick will be to chose a tenure long enough to qualify a high percentage of the inductees and still allow JAWS to provide an adequate measure.
Scrape Hall of Fame Inductee Names
First, we need to get the names of the members of the Reds Hall of Fame. There are a few places with this information including the Reds Hall of Fame website, Baseball-Reference, and Wikipedia. None had the information available to download or presented it in a tidy format. Wikipedia is a source that has wide-ranging utility. This is a good opportunity to become familiar with scraping the website.
Honestly, I expected this part to be more of a hassle, but rvest
made the process quite painless. Use Google Chrome’s inspect feature to obtain the xpath by scrolling down to the table with the member names, right-clicking on the page, and clicking inspect. In the left panel under the Elements tab, you’ll see HTML code. Hovering over each line will shade an object on the page. Once we find the line of code that corresponds to the table we want, we right-click that line of code, choose copy, and copy xpath.
library(tidyverse) library(rvest) url <- "https://en.wikipedia.org/wiki/Cincinnati_Reds_Hall_of_Fame_and_Museum#Cincinnati_Reds_Hall_of_Fame_members" members <- url %>% read_html() %>% html_nodes(xpath='//*[@id="mw-content-text"]/div/table[2]') %>% html_table() members <- members[[1]] glimpse(members) ## Observations: 86 ## Variables: 5 ## $ Year <chr> "1958", "1958", "1958", "1958", "1958", "1959", "1959... ## $ No. <chr> "30", "4", "10", "33", "31", "24", "18", "44, 47", "—... ## $ Inductee <chr> "Paul Derringer", "Ernie Lombardi", "Frank McCormick"... ## $ Position <chr> "P", "C", "1B", "P", "P/3B\nManager", "RF", "P", "P",... ## $ Tenure <chr> "1933–1942", "1932–1941", "1934–1945", "1937–1943\n19...
- Year is the year inducted into the Reds HOF.
- No. is the number on the back of player’s jersey.
- Inductee is the player’s name.
- Position is the positions played.
- Tenure is the number of years played as a Red and is formatted as a range
For our calculation, we’ll need Inductee, Position, and Tenure. There are formatting problems with Position and Tenure, but that information can be obtained elsewhere.
There are only six nominees in the 2018 class so we can just copy/paste their information from the announcement: third baseman Aaron Boone (1997-2003), outfielder Adam Dunn (2001-2008), pitcher John Franco (1984-1989),pitcher Danny Graves (1997-2005), third baseman Scott Rolen (2009-2012) and outfielder Reggie Sanders (1991-1998).
WAR Values
To get our WAR values, we’ll utilize two data sets from the openWARData
package: idTT (player IDs) and rWAR (Baseball-Reference WAR). Our member names are used to filter the idTT data to get IDs and the IDs to filter the rWAR data to get the WAR values.
library(openWARData) # Combining first and last names to match the member names we scraped idTTa <- idTT %>% select(key_bbref, name_last, name_first) %>% mutate(name_whole = paste(name_first, name_last)) # Missing values come along for the ride so they need removed indID <- map_dfr(members[,"Inductee"], function(x) { filter(idTTa, name_whole == x & key_bbref != "")})
Some players were excluded from indID
because in members
, they have accent marks in their names. Likewise excluded, a father/son duo who have the same names in idTT
but have Sr/Jr suffixes in members
. All of these players will be added to indID
. Also, administrative personnel were removed during the filtering process, since obviously, they have no WAR values.
missNamList <- list("Dolf Luque", "Leo Cardenas", "Tony Perez", "Dave Concepcion", "Ken Griffey", "Jose Rijo", "Cesar Geronimo", "Pedro Borbon") indID <- map_dfr(missNamList, function(x) { filter(idTTa, name_whole == x & key_bbref != "")}) %>% bind_rows(indID) %>% mutate(name_whole = if_else(key_bbref == "griffke02", "Ken Griffey Jr", name_whole))
Not many people have original names, including baseball players, so we need to remove the extra Pete Rose (Jr.), Joe Morgan, Mike McCormick, Pedro Borbon, and George Wright. Sparky Anderson and Fred Hutchinson were managers so they can be dropped as well.
indID <- filter(indID, key_bbref != "rosepe02" & key_bbref != "morgajo01" & key_bbref != "mccormi03" & key_bbref != "andersp01" & key_bbref != "wrighge03" & key_bbref != "hutchfr01" & key_bbref != "borbope02" ) glimpse(indID) ## Observations: 81 ## Variables: 4 ## $ key_bbref <fct> luquedo01, cardele01, perezto01, conceda01, griffke... ## $ name_last <fct> Luque, Cardenas, Perez, Concepcion, Griffey, Griffe... ## $ name_first <fct> Dolf, Leo, Tony, Dave, Ken, Ken, Jose, Cesar, Pedro... ## $ name_whole <chr> "Dolf Luque", "Leo Cardenas", "Tony Perez", "Dave C...
Now for the nominees…
nomNamList <- list("Aaron Boone", "Adam Dunn", "John Franco", "Danny Graves", "Scott Rolen", "Reggie Sanders") nomID <- map_dfr(nomNamList, function(x) { filter(idTTa, name_whole == x & key_bbref != "")}) # Snagged an extra Reggie Sanders nomID <- filter(nomID, key_bbref != "sandere01")
We have IDs for the inductees and nominees, so now we can get those Reds WAR values.
# Inductees indWar <- map_dfr(as.character(indID$key_bbref), function(x) { filter(rWAR, playerId == x)}) %>% select(playerId, yearId, teamId, rWAR) %>% mutate_if(is.factor, as.character) %>% filter(teamId == "CIN") # add Name column indWar <- indID %>% select(name_whole, key_bbref) %>% rename(Name = name_whole, playerId = key_bbref) %>% inner_join(indWar, by = 'playerId') ## Warning: Column `playerId` joining factor and character vector, coercing ## into character vector # Nominees nomWar <- map_dfr(as.character(nomID$key_bbref), function(x) { filter(rWAR, playerId == x)}) %>% select(playerId, yearId, teamId, rWAR) %>% mutate_if(is.factor, as.character) %>% filter(teamId == "CIN") # add Name column nomWar <- nomID %>% select(name_whole, key_bbref) %>% rename(Name = name_whole, playerId = key_bbref) %>% inner_join(nomWar, by = 'playerId') ## Warning: Column `playerId` joining factor and character vector, coercing ## into character vector glimpse(indWar) ## Observations: 721 ## Variables: 5 ## $ Name <chr> "Dolf Luque", "Dolf Luque", "Dolf Luque", "Dolf Luque... ## $ playerId <chr> "luquedo01", "luquedo01", "luquedo01", "luquedo01", "... ## $ yearId <int> 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926,... ## $ teamId <chr> "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN... ## $ rWAR <dbl> -0.14, 1.14, 4.56, 4.96, 3.10, 10.77, 2.55, 6.56, 1.1...
EDA
Tenure
We can next turn our attention to making a decision on our central quandary: how long should our tenure requirement be? If we look at tables of the different values, we can calculate the percentage of players that would remain at each cutoff.
# Inductees indYrs <- indWar %>% group_by(playerId) %>% summarize(tenure = n()) table(indYrs$tenure) ## ## 2 3 4 5 6 7 8 9 10 11 12 13 15 16 17 18 19 ## 1 2 3 5 6 9 12 8 10 8 6 2 1 1 1 1 3 # Nominees nomYrs <- nomWar %>% group_by(playerId) %>% summarize(tenure = n()) table(nomYrs$tenure) ## ## 4 6 7 8 9 ## 1 1 1 2 1
Cutoff | % Remaining |
---|---|
7 yrs | 67% |
6 yrs | 78% |
5 yrs | 86% |
4 yrs | 92% |
Looking at the nominees, it turns out Scott Rolen only played four seasons for the Reds. The goal of this project is to evaluate nominees, so four years would be the necessary cutoff in order for Rolen to be included. If Rolen wasn’t part of this class, I’d consider five years but not above five.
So the inductees that didn’t make the cut were the following: Billy Werber, Bill McKechnie, and Wayne Granger. The Wright boys, George and Harry, also aren’t in there. They played with the Reds prior to 1871 and their WAR wasn’t available. Our final inductee pool has 76 players.
indWar <- filter(indWar, playerId != "grangwa01" & playerId != "mckecbi01" & playerId != "werbebi01" & playerId != "wrighge01" & playerId != "wrighha01") warDat <- indWar %>% bind_rows(nomWar)
Position
We need to figure out the primary positions for players as they tend to play multiple positions during their career. When comparing JAWS scores between inductees and nominees, it’s usually done by position. So, if we were to examine Scott Rolen’s case, we’d look at his JAWS score and compare it to other Reds third basemen such as Frank Robinson or Chris Sabo.
Jaffe makes this determination by calculating the total WAR at each position and selecting the position with the greatest value. Currently, I can’t find a relatively convenient way to obtain the necessary data to make that calculation. We’ll make our determination by using the Lahman
package and its Fielding
data set to find the position which has the most games played as a Red.
library(Lahman) head(Fielding, 3) ## playerID yearID stint teamID lgID POS G GS InnOuts PO A E DP PB WP ## 1 abercda01 1871 1 TRO NA SS 1 NA NA 1 3 2 0 NA NA ## 2 addybo01 1871 1 RC1 NA 2B 22 NA NA 67 72 42 5 NA NA ## 3 addybo01 1871 1 RC1 NA SS 3 NA NA 8 14 7 0 NA NA ## SB CS ZR ## 1 NA NA NA ## 2 NA NA NA ## 3 NA NA NA # Fielding dataset has different designations for 1800's Reds teams: CN1 and CN2. posDat <- map2_dfr(warDat$playerId, warDat$yearId, function(x,y) { filter(Fielding, playerID == x & yearID == y)}) %>% filter(teamID == "CIN" | teamID == "CN1" | teamID == "CN2") # Getting position with most games as a Red posDat <- posDat %>% select(playerID, POS, G) %>% group_by(playerID, POS) %>% summarize(sumG = sum(G)) %>% filter(sumG == max(sumG)) %>% ungroup() %>% select(playerID, POS) # Jim O'Toole's Baseball-Reference ID in the Fielding data set is incorrect but he was a pitcher his whole career. Adding him to the df. setdiff(warDat$playerId, posDat$playerID) ## [1] "o'tooji01" posDat <- posDat %>% add_row(playerID = "o'tooji01", POS = "P") %>% rename(playerId = playerID) glimpse(posDat) ## Observations: 82 ## Variables: 2 ## $ playerId <chr> "becklja01", "bellgu01", "benchjo01", "billija01", "b... ## $ POS <chr> "1B", "OF", "C", "P", "P", "3B", "P", "OF", "P", "C",...
The outfield position can be further divided into right field, center field, and left field using the Appearances
data set in Lahman
. It would be desirable to also split the pitching position into relief and starting but unfortunately the Lahman
package doesn’t afford us this capability explicitly. I think the information could be derived from the games started statistic and some others but that task will have be left for another time.
head(Appearances, 3) ## yearID teamID lgID playerID G_all GS G_batting G_defense G_p G_c G_1b ## 1 1871 TRO NA abercda01 1 NA 1 1 0 0 0 ## 2 1871 RC1 NA addybo01 25 NA 25 25 0 0 0 ## 3 1871 CL1 NA allisar01 29 NA 29 29 0 0 0 ## G_2b G_3b G_ss G_lf G_cf G_rf G_of G_dh G_ph G_pr ## 1 0 0 1 0 0 0 0 NA NA NA ## 2 22 0 3 0 0 0 0 NA NA NA ## 3 2 0 0 0 29 0 29 NA NA NA # Get outfielder IDs ofDat <- posDat %>% filter(POS == "OF") # Get yearId from warDat ofYears <- map_dfr(ofDat$playerId, function(x) { filter(warDat, playerId == x) }) # Number of games played at each OF position for each season ofSplit <- map2_dfr(ofYears$playerId, ofYears$yearId, function(x,y) { filter(Appearances, playerID == x & yearID == y)}) %>% rename(LF = G_lf, CF = G_cf, RF = G_rf) %>% gather('LF', 'CF', 'RF', key = "POS", value = "G") # Primary outfield position = most games played at that position splitSum <- ofSplit %>% select(playerID, POS, G) %>% rename(playerId = playerID) %>% group_by(playerId, POS) %>% summarize(sumG = sum(G)) %>% filter(sumG == max(sumG)) %>% ungroup() %>% select(playerId, POS) # Replacing "OF" values in posDat ofPos <- posDat %>% filter(POS == "OF") %>% select(-POS) %>% inner_join(splitSum, by = "playerId") posDat <- posDat %>% filter(POS != "OF") %>% bind_rows(ofPos) # Add POS column warDat <- warDat %>% inner_join(posDat, by = 'playerId')
Save Objects
We’ve generated quite a few objects in part one of this series. If you’d like to keep your environment relatively clean, we’ll only need a few of these going into the calculations of part two: indWar
, nomWar
, posDat
, and warDat
.
Conclusion
We have the ingredients for the JAWS calculation and our dashboard. In this part, we scraped Wikipedia to get the Hall of Fame members’ names. Those names were used to get Baseball-Reference IDs which in turn were used to obtain WAR values. Next, by examining the data, we determined our tenure qualification, and primary player positions were determined by the greatest number of games played at a position. Next, we’ll perform the calculations in part two and visualize them in a shinydashboard
in part three.
References
[1] B. Baumer and G. Matthews. openWARData: Data Associated with openWAR. R package version 0.1.1.9004. 2015. URL: https://github.com/beanumber/openWARData.
[2] M. Friendly. Lahman: Sean ‘Lahman’ Baseball Database. R package version 6.0-0. 2017. URL: https://CRAN.R-project.org/package=Lahman.
[3] H. Wickham. rvest: Easily Harvest (Scrape) Web Pages. R package version 0.3.2. 2016. URL: https://CRAN.R-project.org/package=rvest.
[4] H. Wickham. tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. 2017. URL: https://CRAN.R-project.org/package=tidyverse.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.