Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The new Major League Baseball season will soon begin, which means it’s time to look back and update my run scoring trends data visualization application, built using RStudio’s shiny package.
You can find the app here: https://monkmanmh.shinyapps.io/MLBrunscoring_shiny/
The github repo for this app is https://github.com/MonkmanMH/MLBrunscoring_shiny
This update gave me the opportunity to make some cosmetic tweaks to the front end, and more consequential changes to the code under the hood.

## 1. retired reshape, using tidyr

At one point in the app’s code, I had used the now-retired reshape package to melt a data table. Although this still works, I opted to update the code to use the gather() function in the package tidyr, part of the tidyverse.

## 2. feather instead of csv

The app relied on some pre-wrangled csv files; these have been replaced by files stored using the .feather format, which makes for a signficant performance improvement.

## 3. wrangling: the calculation of team and league run scoring averages

The goal is to create data tables that minimize the amount of processing the app has to do.
In previous versions of the app, the filtering of rows (records or observations) and selecting of columns (variables), the calculation of each team’s average runs scored and runs allowed per game, the league average runs per game, and the comparison of the team to the league, was done first using base R’s apply family of functions.
Then I switched to using dplyr, and although the steps were now in a pipe, this approach still required creating a separate data table with the league average, and then joining that table back into the main team table so that the team result could be compared to the league average.
For this iteration, preparing the data for the app is now done using tidyr::nest() and purrr::map(). What follows is a detailed explanation of how I approached this.
It’s always valuable to have your end-state in mind when working through a multi-step data wrangle like this. My goal is the values shown on the “team plot” tab of the app – an index value (i.e. a percentage) of a team’s average runs scored (and runs allowed) compared to the league run scoring rate, for a single season.

First, load the necessary packages, the first four of which are part of the tidyverse.

# tidyverse packages
library(dplyr)
library(purrr)
library(tidyr)

library(feather)

Teams <- read_csv("Teams.csv",
col_types = cols(
divID = col_character(),
DivWin = col_character(),
SF = col_character(),
WCWin = col_character()
))

## # A tibble: 6 x 48
##   yearID lgID  teamID franchID divID  Rank     G Ghome     W     L DivWin
##
## 1   1871   BS1    BNA            3    31    NA    20    10
## 2   1871   CH1    CNA            2    28    NA    19     9
## 3   1871   CL1    CFC            8    29    NA    10    19
## 4   1871   FW1    KEK            7    19    NA     7    12
## 5   1871   NY2    NNA            5    33    NA    16    17
## 6   1871   PH1    PNA            1    28    NA    21     7
## # ... with 37 more variables: WCWin , LgWin , WSWin ,
## #   R , AB , H , 2B , 3B , HR ,
## #   BB , SO , SB , CS , HBP , SF , RA ,
## #   ER , ERA , CG , SHO , SV , IPouts ,
## #   HA , HRA , BBA , SOA , E , DP ,
## #   FP , name , park , attendance , BPF ,
## #   PPF , teamIDBR , teamIDlahman45 , teamIDretro 

The table above has far more variables than what we need, and some that we’ll have to calculate.

### b. Create league summary tables

A short set of instructions that starts with the “Teams” table in the Lahman database and summarizes it for MLB run scoring trends Shiny app
Thus rather than having the app do the work of

1. remove unnecessary records (rows) and fields (columns) and
2. run the calculations for the runs-per-game, runs-allowed-per-game, and indexed versions of those,

the calculations are conducted here. This will vastly improve the performance of the app.

#### i. create nested table

I started with the “Many Models”” chapter of Wickham and Grolemund, R for Data Science. (And thanks to Dr. Charlotte Wickham, whose training course was invaluable in helping me wrap my head around this.)
At this point, the code

• filters out the years prior to 1901 and the misbegotten Federal League.
• and then creates a nested data table, starting with the group_by() year and league (lgID)
# select a sub-set of teams from 1901 [the establishment of the American League] forward to most recent year
Teams_lgyr <- Teams %>%
filter(yearID > 1900, lgID != "FL") %>%
group_by(yearID, lgID) %>%
nest()

Teams_lgyr
## # A tibble: 236 x 3
##    yearID lgID  data
##
##  1   1901 AL
##  2   1901 NL
##  3   1902 AL
##  4   1902 NL
##  5   1903 AL
##  6   1903 NL
##  7   1904 AL
##  8   1904 NL
##  9   1905 AL
## 10   1905 NL
## # ... with 226 more rows

Here’s a quick peek inside the first entry of the “data” column…the American League, 1901.

Teams_lgyr$data[[1]] ## # A tibble: 8 x 46 ## teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin ## ## 1 BLA NYY 5 134 66 68 65 N ## 2 BOS BOS 2 138 69 79 57 N ## 3 CHA CHW 1 137 71 83 53 Y ## 4 CLE CLE 7 138 69 54 82 N ## 5 DET DET 3 135 70 74 61 N ## 6 MLA BAL 8 139 70 48 89 N ## 7 PHA OAK 4 137 66 74 62 N ## 8 WS1 MIN 6 138 68 61 72 N ## # ... with 35 more variables: WSWin , R , AB , H , ## # 2B , 3B , HR , BB , SO , SB , ## # CS , HBP , SF , RA , ER , ERA , ## # CG , SHO , SV , IPouts , HA , HRA , ## # BBA , SOA , E , DP , FP , name , ## # park , attendance , BPF , PPF , teamIDBR , ## # teamIDlahman45 , teamIDretro  #### ii – functional programming This step creates a league run scoring function, and then applies that using the purrr::map() function. Note: • In the gapminder example in R for Data Science, the variables were called using their names. In this case, for a reason I have not yet determined, we have to specify the data object they are coming from; e.g. for the runs variable R, we have to use df$R (not just R).

First, a simple test, calculating runs scored, and checking to see if we got the right answer, b comparing that to the value calculated using dplyr:

# base R format
leagueRuns_fun <- function(df) {
sum(data = df$R) } league_year_runs <- map(Teams_lgyr$data, leagueRuns_fun)

league_year_runs[[1]]
## [1] 5873
#check the answer by old school dplyr method
Teams %>%
filter(yearID == 1901,
lgID == "AL") %>%
summarise(leagueruns = sum(R))
## # A tibble: 1 x 1
##   leagueruns
##
## 1       5873

Now we move on to the calculation of league averages.
For the first approach, the sum calculation is part of the function.

• There are two functions, one for Runs and the other for Runs Allowed. This is because I have not yet figured out how to specify two different variables (i.e. the name of the data object and the variable to be used in the function) in the map_() function and successfully have them carried into my calculation functions
• Also note that in order to be consistent with other sources, the number of games played is calculated using the sum of wins (W) and losses (L), rather than the number of games reported in the G variable.
# functions
leagueRPG_fun <- function(df) {
sum(data = df$R) / (sum(data = df$W) + sum(data = df$L)) } leagueRAPG_fun <- function(df) { sum(data = df$RA) / (sum(data = df$W) + sum(data = df$L))
}

# simple map version
league_year_RPG <- map(Teams_lgyr$data, leagueRPG_fun) # embed as new columns in nested data object Teams_lgyr <- Teams_lgyr %>% mutate(lgRPG = map_dbl(Teams_lgyr$data, leagueRPG_fun),
lgRAPG = map_dbl(Teams_lgyr$data, leagueRAPG_fun)) Teams_lgyr ## # A tibble: 236 x 5 ## yearID lgID data lgRPG lgRAPG ## ## 1 1901 AL 5.43 5.43 ## 2 1901 NL 4.69 4.69 ## 3 1902 AL 4.97 4.97 ## 4 1902 NL 4.09 4.09 ## 5 1903 AL 4.15 4.15 ## 6 1903 NL 4.85 4.85 ## 7 1904 AL 3.65 3.65 ## 8 1904 NL 3.98 3.98 ## 9 1905 AL 3.75 3.75 ## 10 1905 NL 4.16 4.16 ## # ... with 226 more rows In the second approach: • the league and year total runs, runs allowed, and games are first calculated using separate functions • RPG and RAPG for each league and year combination are then calculated outside the nested tibbles # more functions - individual league by year totals leagueR_fun <- function(df) { sum(data = df$R)
}

leagueRA_fun <- function(df) {
sum(data = df$RA) } leagueG_fun <- function(df) { (sum(data = df$W) + sum(data = df$L)) } Teams_lgyr <- Teams_lgyr %>% mutate(lgR = map_dbl(Teams_lgyr$data, leagueR_fun),
lgRA = map_dbl(Teams_lgyr$data, leagueRA_fun), lgG = map_dbl(Teams_lgyr$data, leagueG_fun))

Teams_lgyr <- Teams_lgyr %>%
mutate(lgRPG = (lgR / lgG),
lgRAPG = (lgRA / lgG))

Teams_lgyr
## # A tibble: 236 x 8
##    yearID lgID  data              lgRPG lgRAPG   lgR  lgRA   lgG
##
##  1   1901 AL      5.43   5.43  5873  5873  1082
##  2   1901 NL      4.69   4.69  5194  5194  1108
##  3   1902 AL      4.97   4.97  5407  5407  1088
##  4   1902 NL      4.09   4.09  4494  4494  1098
##  5   1903 AL      4.15   4.15  4543  4543  1096
##  6   1903 NL      4.85   4.85  5349  5349  1102
##  7   1904 AL      3.65   3.65  4433  4433  1216
##  8   1904 NL      3.98   3.98  4872  4872  1224
##  9   1905 AL      3.75   3.75  4547  4547  1212
## 10   1905 NL      4.16   4.16  5092  5092  1224
## # ... with 226 more rows

#### iii. save LG_RPG files

And then write csv and feather versions. As noted above, the shiny app now uses the feather format.
Notes:

• rounding of variables after all the calculations, simply to make the tables as viewed more legible.
• renaming of variables to correspond with shiny app names.
LG_RPG <- Teams_lgyr %>%
mutate(lgRPG = round(lgRPG, 2),
lgRAPG = round(lgRAPG, 2)) %>%
select(yearID, lgID, R = lgR, RA = lgRA, G = lgG,
leagueRPG = lgRPG, leagueRAPG = lgRAPG)

LG_RPG
## # A tibble: 236 x 7
##    yearID lgID      R    RA     G leagueRPG leagueRAPG
##
##  1   1901 AL     5873  5873  1082      5.43       5.43
##  2   1901 NL     5194  5194  1108      4.69       4.69
##  3   1902 AL     5407  5407  1088      4.97       4.97
##  4   1902 NL     4494  4494  1098      4.09       4.09
##  5   1903 AL     4543  4543  1096      4.15       4.15
##  6   1903 NL     5349  5349  1102      4.85       4.85
##  7   1904 AL     4433  4433  1216      3.65       3.65
##  8   1904 NL     4872  4872  1224      3.98       3.98
##  9   1905 AL     4547  4547  1212      3.75       3.75
## 10   1905 NL     5092  5092  1224      4.16       4.16
## # ... with 226 more rows
write_csv(LG_RPG, "LG_RPG.csv")
write_feather(LG_RPG, "LG_RPG.feather")

### c. Repeat for MLB total

This only differs from the league summaries in the level of nesting; instead of grouping by year and league, it’s only year (yearID).

Teams_lgyr <- Teams_lgyr %>%
unnest() %>%
group_by(yearID) %>%
nest()

Teams_lgyr
## # A tibble: 118 x 2
##    yearID data
##
##  1   1901
##  2   1902
##  3   1903
##  4   1904
##  5   1905
##  6   1906
##  7   1907
##  8   1908
##  9   1909
## 10   1910
## # ... with 108 more rows
Teams_lgyr <- Teams_lgyr %>%
mutate(mlbR = map_dbl(Teams_lgyr$data, leagueR_fun), mlbRA = map_dbl(Teams_lgyr$data, leagueRA_fun),
mlbG = map_dbl(Teams_lgyr\$data, leagueG_fun),
mlbRPG = (mlbR / mlbG),
mlbRAPG = (mlbRA / mlbG))

Teams_lgyr
## # A tibble: 118 x 7
##    yearID data                mlbR mlbRA  mlbG mlbRPG mlbRAPG
##
##  1   1901  11067 11067  2190   5.05    5.05
##  2   1902   9901  9901  2186   4.53    4.53
##  3   1903   9892  9892  2198   4.50    4.50
##  4   1904   9305  9305  2440   3.81    3.81
##  5   1905   9639  9639  2436   3.96    3.96
##  6   1906   8881  8878  2416   3.68    3.67
##  7   1907   8703  8703  2406   3.62    3.62
##  8   1908   8422  8422  2456   3.43    3.43
##  9   1909   8810  8810  2436   3.62    3.62
## 10   1910   9584  9584  2446   3.92    3.92
## # ... with 108 more rows

And again, we save the files for use in the shiny app.

MLB_RPG <- Teams_lgyr %>%
mutate(mlbRPG = round(mlbRPG, 2),
mlbRAPG = round(mlbRAPG, 2)) %>%
select(yearID, R = mlbR, RA = mlbRA, G = mlbG,
leagueRPG = mlbRPG, leagueRAPG = mlbRAPG)

write_csv(MLB_RPG, "MLB_RPG.csv")
write_feather(MLB_RPG, "MLB_RPG.feather")

### d. Individual team values

Calculate index of team run scoring against league average
Note that we start with unnest() and create a new object, Teams_append … a tibble with all of the variables exposed.

Teams_append <- Teams_lgyr %>%
unnest() %>%
mutate(teamRPG=(R / (W + L)),
teamRAPG=(RA / (W + L)),
WLpct=(W / (W + L))) %>%
# runs scored index where 100=the league average for that season
mutate(R_index = (teamRPG / lgRPG) * 100) %>%
mutate(R_index.sd = sd(R_index)) %>%
mutate(R_z = (R_index - 100) / R_index.sd) %>%
# runs allowed
mutate(RA_index = (teamRAPG / lgRAPG) * 100) %>%
mutate(RA_index.sd = sd(RA_index)) %>%
mutate(RA_z = (RA_index - 100) / RA_index.sd)

Teams_append
## # A tibble: 2,496 x 67
##    yearID  mlbR mlbRA  mlbG mlbRPG mlbRAPG lgID  lgRPG lgRAPG   lgR  lgRA
##
##  1   1901 11067 11067  2190   5.05    5.05 AL     5.43   5.43  5873  5873
##  2   1901 11067 11067  2190   5.05    5.05 AL     5.43   5.43  5873  5873
##  3   1901 11067 11067  2190   5.05    5.05 AL     5.43   5.43  5873  5873
##  4   1901 11067 11067  2190   5.05    5.05 AL     5.43   5.43  5873  5873
##  5   1901 11067 11067  2190   5.05    5.05 AL     5.43   5.43  5873  5873
##  6   1901 11067 11067  2190   5.05    5.05 AL     5.43   5.43  5873  5873
##  7   1901 11067 11067  2190   5.05    5.05 AL     5.43   5.43  5873  5873
##  8   1901 11067 11067  2190   5.05    5.05 AL     5.43   5.43  5873  5873
##  9   1901 11067 11067  2190   5.05    5.05 NL     4.69   4.69  5194  5194
## 10   1901 11067 11067  2190   5.05    5.05 NL     4.69   4.69  5194  5194
## # ... with 2,486 more rows, and 56 more variables: lgG ,
## #   teamID , franchID , divID , Rank , G ,
## #   Ghome , W , L , DivWin , WCWin , LgWin ,
## #   WSWin , R , AB , H , 2B , 3B ,
## #   HR , BB , SO , SB , CS , HBP , SF ,
## #   RA , ER , ERA , CG , SHO , SV ,
## #   IPouts , HA , HRA , BBA , SOA , E ,
## #   DP , FP , name , park , attendance ,
## #   BPF , PPF , teamIDBR , teamIDlahman45 ,
## #   teamIDretro , teamRPG , teamRAPG , WLpct ,
## #   R_index , R_index.sd , R_z , RA_index ,
## #   RA_index.sd , RA_z 

In this the final step, we first create a new data object Teams_merge.
Notes:

• rounding of a variety of the calculated variables, to address readability concerns.
• selection and renaming of variables to correspond with shiny app names.
• then write csv and feather versions.
Teams_merge <- Teams_append %>%
mutate(lgRPG = round(lgRPG, 2),
lgRAPG = round(lgRAPG, 2),
WLpct = round(WLpct, 3),
teamRPG = round(teamRPG, 2),
teamRAPG = round(teamRAPG, 2),
R_index = round(R_index, 1),
RA_index = round(RA_index, 1)
) %>%
select(yearID, lgID, franchID, teamID, name,
W, L, WLpct, R.x = R, RA.x = RA,
teamRPG, leagueRPG = lgRPG, R_index,
teamRAPG, leagueRAPG = lgRAPG, RA_index)

Teams_merge
## # A tibble: 2,496 x 16
##    yearID lgID  franchID teamID name      W     L WLpct   R.x  RA.x teamRPG
##
##  1   1901 AL    NYY      BLA    Balt~    68    65 0.511   760   750    5.71
##  2   1901 AL    BOS      BOS    Bost~    79    57 0.581   759   608    5.58
##  3   1901 AL    CHW      CHA    Chic~    83    53 0.61    819   631    6.02
##  4   1901 AL    CLE      CLE    Clev~    54    82 0.397   666   831    4.9
##  5   1901 AL    DET      DET    Detr~    74    61 0.548   741   694    5.49
##  6   1901 AL    BAL      MLA    Milw~    48    89 0.35    641   828    4.68
##  7   1901 AL    OAK      PHA    Phil~    74    62 0.544   805   760    5.92
##  8   1901 AL    MIN      WS1    Wash~    61    72 0.459   682   771    5.13
##  9   1901 NL    LAD      BRO    Broo~    79    57 0.581   744   600    5.47
## 10   1901 NL    ATL      BSN    Bost~    69    69 0.5     531   556    3.85
## # ... with 2,486 more rows, and 5 more variables: leagueRPG ,
## #   R_index , teamRAPG , leagueRAPG , RA_index
write_csv(Teams_merge, "Teams_merge.csv")
write_feather(Teams_merge, "Teams_merge.feather")

And the feather files can now be incorproated into the shiny app.
-30-