How to Add WAR Metrics to your Lahman Database

[This article was first published on R Tricks – Data Science Riot!, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I get a lot of questions on how to calculate WAR in the Lahman database. In the past I’ve discussed ways to calculate wOBA and FIP in Lahman but WAR has always been difficult due to the “closed-source” nature of the calculation.

But there is an answer

While stumbling around on Baseball Reference one day, I found that BR makes their WAR tables open to the public in the form of text documents. They can be found here, go to the middle of the page and look for “WAR-Position Player” and “WAR-Pitchers.”

These tables contain much more than just WAR. They also contain a myriad of other metrics including WAA, Pythagorean exponent and player salary; just to name a few.

The data is great but how to make it “talk” to the Lahman database? One could haphazardly load these .csv files into the Lahman schema but they still wouldn’t be keyed correctly. BR has a different id system for players and teams (among other things.)

R to the rescue!

If you’re a fan of sabermetrics and you aren’t already using R, you should really start. What will R do? In this case we can automatically scrape the data, pull indexes from the Lahman database, match them with BR indexes, parse the data and load everything into the database directly form R. And oh yeah, you can set it up as an automated task if you like!

The biggest challenge is the BR player_ID and team_ID, which are different than Lahman’s. The good news is, Sean Lahman figured people like us would be doing this, so he included pairings for all these keys! In the 2014 edition of Lahman, you can find “bbrefID” on the Master table and teamIDBR on the Teams table. The script below will use these ids to match those from BR and replace them with the correct Lahman ids.

The end result

In the end you get two additional tables in your Lahman database. These tables are neatly indexed by Lahman teamID and playerID, so they can be easily joined to other tables. In the example below, I included player name and team name just to make sure the indexing in R worked correctly. As you can see, it matches the list from BR’s website (although I have an extra decimal.)

dbasewar

brwar

The technical stuff

I have the tendency to make all my column names lowercase, which isn’t the default in Lahman. I tried to set all the column names in the script to the Lahman standard, but you may have to change one or two.

This script requires a database connection, you’ll have to enter your username and password at line 16.

Since so many saber-types like MySQL, the below script uses that type of database connection. I personally think MySQL is a tool of the devil, and if anyone is interested in a PostgreSQL equivalent, it can be found on my GitHub page here.

To the MySQL pundits, there are several reasons that Postgres is superior for sabermetrics!

I took great pains to ensure the data types crossed over correctly, but it would be a good idea to check those in our database (especially the floats.) Also, once in the db, it would be a good idea to create a primary key for playerID. Not necessary, just a good idea.

This is just for batting, the script for pitching can be found on my GitHub page here.

### Open source script for scraping data from BaseballReference.com and laoding in Sean Lahman's Baseball Database.
### This script is the work of Kris Eberwein and can be found on GitHub at https://github.com/keberwein/Data_Science_Riot/tree/master/Lahman_WAR

#Load packages and install if you don't have them
require(DBI)
require(RMySQL)
require(dplyr)

#Get the data from Baseball Reference
if(!file.exists("./data")){dir.create("./data")}
fileUrl <- "http://www.baseball-reference.com/data/war_daily_bat.txt"
download.file(fileUrl, destfile="war_daily_bat.csv", method="curl")

#Write the download to a data frame
df <- read.csv("war_daily_bat.csv", header=TRUE)

#Connect to your Lahman instance so we can grab some data from the master table
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, host= "localhost", dbname = "Lahman", 
                 user="YOUR USERNAME", password="YOUR PASSWORD")

#SQL the database for playerid on the master table "bbrefid" are the Baseball Reference ids
master <- dbSendQuery(con, "SELECT playerID, bbrefID FROM master")
m <- fetch(master, n = -1)

#SQL on the teams table to match Baseball Ref team_ID
teams <- dbSendQuery(con, "SELECT yearID, teamID, teamIDBR FROM teams")
t <- fetch(teams, n = -1)

#Join master and war data frames
df2 <- left_join(df, m, by = c("player_ID" = "bbrefID"))

# Convert and rename a few things in the teams dataframe to make the join smooth
t$teamidbr <- as.factor(t$teamidbr)
names(t)[names(t)=="teamIDBR"] <- "team_ID"
names(t)[names(t)=="yearID"] <- "year_ID"

#Now we index the teams
df3 <- left_join(df2, t)

#Reorder data frame
final <- subset(df3, select = c(playerID, year_ID, age, team_ID, stint_ID, lg_ID, PA, G, 
                              Inn, runs_bat, runs_br, runs_dp, runs_field, runs_infield, 
                              runs_outfield, runs_catcher, runs_good_plays, runs_defense, 
                              runs_position, runs_position_p, runs_replacement, runs_above_rep, 
                              runs_above_avg, runs_above_avg_off, runs_above_avg_def, WAA, WAA_off, 
                              WAA_def, WAR, WAR_def, WAR_off, WAR_rep, salary, pitcher, teamRpG, 
                              oppRpG, oppRpPA_rep, oppRpG_rep, pyth_exponent, pyth_exponent_rep, 
                              waa_win_perc, waa_win_perc_off, waa_win_perc_def, waa_win_perc_rep))

#Rename a couple of columns in our new tidy data set to fit to Lahman standards
names(final)[names(final)=="stint_ID"] <- "stint"
names(final)[names(final)=="lg_ID"] <- "lgID"
names(final)[names(final)=="year_ID"] <- "yearID"
names(final)[names(final)=="team_ID"] <- "teamID"

# Clean up the data types before loading into Lahman
# This part is nausiating but necessary...
final$age <- as.integer(as.character(final$age))
final$stint <- as.integer(as.character(final$stint))
final$PA <- as.integer(as.character(final$PA))
final$G <- as.integer(as.character(final$G))
final$Inn <- as.integer(as.character(final$Inn))
final$runs_bat <- as.double(as.character(final$runs_bat))
final$runs_br <- as.double(as.character(final$runs_br))
final$runs_dp <- as.double(as.character(final$runs_dp))
final$runs_field <- as.double(as.character(final$runs_field))
final$runs_infield <- as.double(as.character(final$runs_infield))
final$runs_outfield <- as.double(as.character(final$runs_outfield))
final$runs_catcher <- as.double(as.character(final$runs_catcher))
final$runs_good_plays <- as.double(as.character(final$runs_good_plays))
final$runs_position <- as.double(as.character(final$runs_position))
final$replacement <- as.double(as.character(final$replacement))
final$runs_above_rep <- as.double(as.character(final$runs_above_rep))
final$runs_above_avg <- as.double(as.character(final$runs_above_avg))
final$runs_above_avg_off <- as.double(as.character(final$runs_above_avg_off))
final$runs_above_avg_def <- as.double(as.character(final$runs_above_avg_def))
final$WAA <- as.double(as.character(final$WAA))
final$WAA_off <- as.double(as.character(final$WAA_off))
final$WAA_def <- as.double(as.character(final$WAA_def))
final$WAR <- as.double(as.character(final$WAR))
final$WAR_def <- as.double(as.character(final$WAR_def))
final$WAR_off <- as.double(as.character(final$WAR_off))
final$WAR_rep <- as.double(as.character(final$WAR_rep))
final$salary <- as.integer(as.character(final$salary))
final$teamrRpG <- as.double(as.character(final$teamRpG))
final$oppRpG <- as.double(as.character(final$oppRpG))
final$oppRpPA_rep <- as.double(as.character(final$oppRpPA_rep))
final$oppRpG_rep <- as.double(as.character(final$oppRpG_rep))
final$pyth_exponent <- as.double(as.character(final$pyth_exponent))
final$pyth_exponent_rep <- as.double(as.character(final$pyth_exponent_rep))
final$waa_win_perc <- as.double(as.character(final$waa_win_perc))
final$waa_win_perc_off <- as.double(as.character(final$waa_win_perc_off))
final$waa_win_perc_def <- as.double(as.character(final$waa_win_perc_def))
final$waa_win_perc_rep <- as.double(as.character(final$waa_win_perc_rep))

# At this point you can do a write.csv() and load that into your Lahman instance
# OR
# Use the database connection that you established earlier to wirte a new table directly to Lahman
#Write your data frame back to the dbase. I like to write it as a test table first.
dbWriteTable(con, name='war_batting', value=final, row.names = FALSE, overwrite = TRUE)

##Now go to the Baseball Reference WAR tables and admire your work!

Photo by John Leon-Guerrero

To leave a comment for the author, please follow the link and comment on their blog: R Tricks – Data Science Riot!.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)