# Major League Baseball run scoring trends with R’s Lahman package

**Bayes Ball**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The statistical software R has an ever-expanding array of packages that provide pre-programmed functions and datasets. One such package is named Lahman, bundling the contents of the Lahman database into a quick-and-easy resource for R users. In addition to the data tables, the package resources also contain a variety of analyses and graphics undertaken using the package, providing some examples of how the package can be used.

*Full disclosure: I am now one of the Lahman package project members.*

This is my first blog post using the Lahman package, and as a first step I will simply recreate the league run scoring trends graphs that I generated previously. Originally, I had used data from Baseball Reference, for the simple reason that the Lahman database does not, in its source form, contain any league-level aggregations.

The process for loading the Lahman package is as simple as any other R package; this simplicity is even greater if you are using an IDE such as RStudio. Once loaded, you have access to all the tables in the database, without any of the futzing that is sometimes required in tidying up a raw flat file (I find that variable names are sometimes lost or changed in translation).

The code (available as a gist here, downloadable as an R script file) creates a pair of tables, calculating each league’s run scoring rates by year. Then, recycling my earlier code, it calculates a series of trend lines using the loess method, and graphs those trend lines. For simplicity’s sake, only the final version of each graph is shown.

Step 1: install the package (if you haven’t already), access the library, and load the data table “Teams”.

# load the package into R, and open the data table 'Teams' into the<br /># workspace<br />library("Lahman")<br />data(Teams)<br />#<br />

The second step is to use the individual team season results to calculate the aggregate of each league’s year. We start with 1901, the year the American League was formed. Once those tables are created, the loess function is used to calculate trend lines for each league’s run scoring environment.

# ===== CREATE LEAGUE SUMMARY TABLES<br /># <br /># select a sub-set of teams from 1901 [the establishment of the American<br /># League] forward to 2012<br />Teams_sub <- as.data.frame(subset(Teams, yearID > 1900))<br /># calculate each team's average runs and runs allowed per game<br />Teams_sub$RPG <- Teams_sub$R/Teams_sub$G<br />Teams_sub$RAPG <- Teams_sub$RA/Teams_sub$G<br /># create new data frame with season totals for each league<br />LG_RPG <- aggregate(cbind(R, RA, G) ~ yearID + lgID, data = Teams_sub, sum)<br /># calculate league + season runs and runs allowed per game<br />LG_RPG$LG_RPG <- LG_RPG$R/LG_RPG$G<br />LG_RPG$LG_RAPG <- LG_RPG$RA/LG_RPG$G<br /># select a sub-set of teams from 1901 [the establishment of the American<br /># League] forward to 2012 read the data into separate league tables<br />ALseason <- (subset(LG_RPG, yearID > 1900 & lgID == "AL"))<br />NLseason <- (subset(LG_RPG, yearID > 1900 & lgID == "NL"))<br />#

# ===== TRENDS: RUNS SCORED PER GAME<br /># <br /># AMERICAN LEAGUE create new object ALRunScore.LO for loess model<br />ALRunScore.LO <- loess(ALseason$LG_RPG ~ ALseason$yearID)<br />ALRunScore.LO.predict <- predict(ALRunScore.LO)<br /># create new objects RunScore.Lo.XX for loess models with 'span' control<br /># span = 0.25<br />ALRunScore.LO.25 <- loess(ALseason$LG_RPG ~ ALseason$yearID, span = 0.25)<br />ALRunScore.LO.25.predict <- predict(ALRunScore.LO.25)<br /># span = 0.5<br />ALRunScore.LO.5 <- loess(ALseason$LG_RPG ~ ALseason$yearID, span = 0.5)<br />ALRunScore.LO.5.predict <- predict(ALRunScore.LO.5)<br /># NATIONAL LEAGUE create new object RunScore.LO for loess model<br />NLRunScore.LO <- loess(NLseason$LG_RPG ~ NLseason$yearID)<br />NLRunScore.LO.predict <- predict(NLRunScore.LO)<br /># loess models<br />NLRunScore.LO.25 <- loess(NLseason$LG_RPG ~ NLseason$yearID, span = 0.25)<br />NLRunScore.LO.25.predict <- predict(NLRunScore.LO.25)<br />NLRunScore.LO.5 <- loess(NLseason$LG_RPG ~ NLseason$yearID, span = 0.5)<br />NLRunScore.LO.5.predict <- predict(NLRunScore.LO.5)<br />#<br />

Now that we have calculated the league averages and trend lines (using the loess method), we can start the plots. First, a simple plot of the actual values:

# MULTI-PLOT -- MERGING AL AND NL RESULTS plot individual years as lines<br />ylim <- c(3, 6)<br /># start with AL line<br />plot(ALseason$LG_RPG ~ ALseason$yearID, type = "l", lty = "solid", col = "red", <br /> lwd = 2, main = "Runs per team per game, 1901-2012", ylim = ylim, xlab = "year", <br /> ylab = "runs per game")<br /># add NL line<br />lines(NLseason$yearID, NLseason$LG_RPG, lty = "solid", col = "blue", lwd = 2)<br /># chart additions<br />grid()<br />legend(1900, 3.5, c("AL", "NL"), lty = c("solid", "solid"), col = c("red", "blue"), <br /> lwd = c(2, 2))<br />

Next, comparing the league trends.

# plot multiple loess curves (span=0.50 and 0.25)<br />ylim <- c(3, 6)<br /># start with AL line<br />plot(ALRunScore.LO.5.predict ~ ALseason$yearID, type = "l", lty = "solid", col = "red", <br /> lwd = 2, main = "Runs per team per game, 1901-2012", ylim = ylim, xlab = "year", <br /> ylab = "runs per game")<br /># add NL line<br />lines(NLseason$yearID, NLRunScore.LO.5.predict, lty = "solid", col = "blue", <br /> lwd = 2)<br /># add 0.25 lines<br />lines(ALseason$yearID, ALRunScore.LO.25.predict, lty = "dashed", col = "red", <br /> lwd = 2)<br />lines(NLseason$yearID, NLRunScore.LO.25.predict, lty = "dashed", col = "blue", <br /> lwd = 2)<br /># chart additions<br />legend(1900, 3.5, c("AL (span=0.50)", "NL (span=0.50)", "AL (span=0.25)", "NL (span=0.25)"), <br /> lty = c("solid", "solid", "dashed", "dashed"), col = c("red", "blue", "red", <br /> "blue"), lwd = c(2, 2, 2, 2))<br />grid()<br />

Next, calculate the difference between the two leagues – both the absolute difference and the difference in the loess trend lines.

# 1. absolute<br />RunDiff <- (ALseason$LG_RPG - NLseason$LG_RPG)<br /># 2. LOESS span=0.25<br />RunDiffLO <- (ALRunScore.LO.25.predict - NLRunScore.LO.25.predict)<br />#<br />

And plot the differences.

<br /># plot each year absolute difference as bar, difference in trend as line<br />ylim <- c(-1, 1.5)<br />plot(RunDiff ~ ALseason$yearID, type = "h", lty = "solid", col = "blue", lwd = 2, <br /> main = "Run scoring trend: AL difference from NL, 1901-2012", ylim = ylim, <br /> xlab = "year", ylab = "runs per game")<br /># add RunDiff line<br />lines(ALseason$yearID, RunDiffLO, lty = "solid", col = "black", lwd = 2)<br /># add line at zero<br />abline(h = 0, lty = "dotdash")<br /># chart additions<br />grid()<br />legend(1900, 1.5, c("AL difference from NL: absolute", "AL difference from NL, LOESS (span=0.25)"), <br /> lty = c("solid", "solid"), col = c("blue", "black"), lwd = c(2, 2))<br />

#<br />

For the next “using R” post, I’ll take a look at the ways to plot the residuals from the loess method.

The one after that: ggplot2 versions of the graphs.

-30-

**leave a comment**for the author, please follow the link and comment on their blog:

**Bayes Ball**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.