**Bayes Ball**, and kindly contributed to R-bloggers)

The statistical software R has an ever-expanding array of packages that provide pre-programmed functions and datasets. One such package is named Lahman, bundling the contents of the Lahman database into a quick-and-easy resource for R users. In addition to the data tables, the package resources also contain a variety of analyses and graphics undertaken using the package, providing some examples of how the package can be used.

*Full disclosure: I am now one of the Lahman package project members.*

This is my first blog post using the Lahman package, and as a first step I will simply recreate the league run scoring trends graphs that I generated previously. Originally, I had used data from Baseball Reference, for the simple reason that the Lahman database does not, in its source form, contain any league-level aggregations.

The process for loading the Lahman package is as simple as any other R package; this simplicity is even greater if you are using an IDE such as RStudio. Once loaded, you have access to all the tables in the database, without any of the futzing that is sometimes required in tidying up a raw flat file (I find that variable names are sometimes lost or changed in translation).

The code (available as a gist here, downloadable as an R script file) creates a pair of tables, calculating each league’s run scoring rates by year. Then, recycling my earlier code, it calculates a series of trend lines using the loess method, and graphs those trend lines. For simplicity’s sake, only the final version of each graph is shown.

Step 1: install the package (if you haven’t already), access the library, and load the data table “Teams”.

`# load the package into R, and open the data table 'Teams' into the`

# workspace

library("Lahman")

data(Teams)

#

The second step is to use the individual team season results to calculate the aggregate of each league’s year. We start with 1901, the year the American League was formed. Once those tables are created, the loess function is used to calculate trend lines for each league’s run scoring environment.

`# ===== CREATE LEAGUE SUMMARY TABLES`

#

# select a sub-set of teams from 1901 [the establishment of the American

# League] forward to 2012

Teams_sub <- as.data.frame(subset(Teams, yearID > 1900))

# calculate each team's average runs and runs allowed per game

Teams_sub$RPG <- Teams_sub$R/Teams_sub$G

Teams_sub$RAPG <- Teams_sub$RA/Teams_sub$G

# create new data frame with season totals for each league

LG_RPG <- aggregate(cbind(R, RA, G) ~ yearID + lgID, data = Teams_sub, sum)

# calculate league + season runs and runs allowed per game

LG_RPG$LG_RPG <- LG_RPG$R/LG_RPG$G

LG_RPG$LG_RAPG <- LG_RPG$RA/LG_RPG$G

# select a sub-set of teams from 1901 [the establishment of the American

# League] forward to 2012 read the data into separate league tables

ALseason <- (subset(LG_RPG, yearID > 1900 & lgID == "AL"))

NLseason <- (subset(LG_RPG, yearID > 1900 & lgID == "NL"))

#

`# ===== TRENDS: RUNS SCORED PER GAME`

#

# AMERICAN LEAGUE create new object ALRunScore.LO for loess model

ALRunScore.LO <- loess(ALseason$LG_RPG ~ ALseason$yearID)

ALRunScore.LO.predict <- predict(ALRunScore.LO)

# create new objects RunScore.Lo.XX for loess models with 'span' control

# span = 0.25

ALRunScore.LO.25 <- loess(ALseason$LG_RPG ~ ALseason$yearID, span = 0.25)

ALRunScore.LO.25.predict <- predict(ALRunScore.LO.25)

# span = 0.5

ALRunScore.LO.5 <- loess(ALseason$LG_RPG ~ ALseason$yearID, span = 0.5)

ALRunScore.LO.5.predict <- predict(ALRunScore.LO.5)

# NATIONAL LEAGUE create new object RunScore.LO for loess model

NLRunScore.LO <- loess(NLseason$LG_RPG ~ NLseason$yearID)

NLRunScore.LO.predict <- predict(NLRunScore.LO)

# loess models

NLRunScore.LO.25 <- loess(NLseason$LG_RPG ~ NLseason$yearID, span = 0.25)

NLRunScore.LO.25.predict <- predict(NLRunScore.LO.25)

NLRunScore.LO.5 <- loess(NLseason$LG_RPG ~ NLseason$yearID, span = 0.5)

NLRunScore.LO.5.predict <- predict(NLRunScore.LO.5)

#

Now that we have calculated the league averages and trend lines (using the loess method), we can start the plots. First, a simple plot of the actual values:

`# MULTI-PLOT -- MERGING AL AND NL RESULTS plot individual years as lines`

ylim <- c(3, 6)

# start with AL line

plot(ALseason$LG_RPG ~ ALseason$yearID, type = "l", lty = "solid", col = "red",

lwd = 2, main = "Runs per team per game, 1901-2012", ylim = ylim, xlab = "year",

ylab = "runs per game")

# add NL line

lines(NLseason$yearID, NLseason$LG_RPG, lty = "solid", col = "blue", lwd = 2)

# chart additions

grid()

legend(1900, 3.5, c("AL", "NL"), lty = c("solid", "solid"), col = c("red", "blue"),

lwd = c(2, 2))

Next, comparing the league trends.

`# plot multiple loess curves (span=0.50 and 0.25)`

ylim <- c(3, 6)

# start with AL line

plot(ALRunScore.LO.5.predict ~ ALseason$yearID, type = "l", lty = "solid", col = "red",

lwd = 2, main = "Runs per team per game, 1901-2012", ylim = ylim, xlab = "year",

ylab = "runs per game")

# add NL line

lines(NLseason$yearID, NLRunScore.LO.5.predict, lty = "solid", col = "blue",

lwd = 2)

# add 0.25 lines

lines(ALseason$yearID, ALRunScore.LO.25.predict, lty = "dashed", col = "red",

lwd = 2)

lines(NLseason$yearID, NLRunScore.LO.25.predict, lty = "dashed", col = "blue",

lwd = 2)

# chart additions

legend(1900, 3.5, c("AL (span=0.50)", "NL (span=0.50)", "AL (span=0.25)", "NL (span=0.25)"),

lty = c("solid", "solid", "dashed", "dashed"), col = c("red", "blue", "red",

"blue"), lwd = c(2, 2, 2, 2))

grid()

Next, calculate the difference between the two leagues – both the absolute difference and the difference in the loess trend lines.

`# 1. absolute`

RunDiff <- (ALseason$LG_RPG - NLseason$LG_RPG)

# 2. LOESS span=0.25

RunDiffLO <- (ALRunScore.LO.25.predict - NLRunScore.LO.25.predict)

#

And plot the differences.

# plot each year absolute difference as bar, difference in trend as line

ylim <- c(-1, 1.5)

plot(RunDiff ~ ALseason$yearID, type = "h", lty = "solid", col = "blue", lwd = 2,

main = "Run scoring trend: AL difference from NL, 1901-2012", ylim = ylim,

xlab = "year", ylab = "runs per game")

# add RunDiff line

lines(ALseason$yearID, RunDiffLO, lty = "solid", col = "black", lwd = 2)

# add line at zero

abline(h = 0, lty = "dotdash")

# chart additions

grid()

legend(1900, 1.5, c("AL difference from NL: absolute", "AL difference from NL, LOESS (span=0.25)"),

lty = c("solid", "solid"), col = c("blue", "black"), lwd = c(2, 2))

`#`

For the next “using R” post, I’ll take a look at the ways to plot the residuals from the loess method.

The one after that: ggplot2 versions of the graphs.

-30-

**leave a comment**for the author, please follow the link and comment on his blog:

**Bayes Ball**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...