Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## FiveThirtyEight’s Riddler Express

From Taylor Firman comes an opportunity to make baseball history:

This year, Major League Baseball announced it will play a shortened 60-game season, as opposed to the typical 162-game season. Baseball is a sport of numbers and statistics, and so Taylor wondered about the impact of the season’s length on some famous baseball records.

Some statistics are more achievable than others in a shortened season. Suppose your true batting average is .350, meaning you have a 35 percent chance of getting a hit with every at-bat. If you have four at-bats per game, what are your chances of batting at least .400 over the course of the 60-game season?2 And how does this compare to your chances of batting at least .400 over the course of a 162-game season?

## Plan

This riddle should be pretty straight forward to solve statistically and with simulations, so I will do both.

knitr::opts_chunk$set(echo = TRUE, comment = "#>", cache = FALSE, dpi = 400) library(mustashe) library(tidyverse) library(conflicted) # Handle any namespace conflicts. conflict_prefer("filter", "dplyr") conflict_prefer("select", "dplyr") # Default 'ggplot2' theme. theme_set(theme_minimal()) # For reproducibility. set.seed(123) ## Statistical solution This is a simple binomial system: at each at bat, the player either gets a hit or not. If their real batting average is 0.350, that means the probability of them getting a hit during each at bat is 35%. Thus, according the the Central Limit Theorem, given a sufficiently large number of at bats, the frequency of a hit should be 35%. This is because the distribution converges towards the mean. However, at smaller sample-sizes, the distribution is more broad, meaning that the observed batting average has a greater chance of being further away from the true value. First, let’s answer the riddle. The solution is just the probability of observing a batting average of 0.400 or greater. The first value is computed using dbinom() and the second cumulative probability is calculated using pbinom(), setting lower.tail = FALSE to get the tail above 0.400. num_at_bats <- 60 * 4 real_batting_average <- 0.350 target_batting_average <- 0.400 prob_at_400 <- dbinom(x = target_batting_average * num_at_bats, size = num_at_bats, prob = real_batting_average) prob_above_400 <- pbinom(q = target_batting_average * num_at_bats, size = num_at_bats, prob = real_batting_average, lower.tail = FALSE) prob_at_400 + prob_above_400 #> [1] 0.06083863 Under the described assumptions, there is a 6.1% chance of reaching a batting average of 0.400 in the shorter season. For comparison, the chance for a normal 162-game season is calculated below. Because$0.400 \times 162 \times 4\$ is a non-integer value, an exact 0.400 batting average cannot be obtained. Therefore, only the probability of a batting average greater than 0.400 needs to be calculated.

num_at_bats <- 162 * 4
real_batting_average <- 0.350
target_batting_average <- 0.400
prob_above_400 <- pbinom(q = target_batting_average * num_at_bats,
size = num_at_bats,
prob = real_batting_average,
lower.tail = FALSE)
prob_above_400

#> [1] 0.003789922

Over 162 games, there is a 0.4% chance of achieving a batting average of at least 0.400.

## Simulation

The solution to this riddle could also be found by simulating a whole bunch of seasons with the real batting average of 0.350 and then just counting how frequently the simulations resulted in an observed batting average of 0.400.

A single season can be simulated using the rbinom() function where n is the number of seasons to simulate, size takes the number of at bats, and prob takes the true batting average. The returned value is a sampled number of hits (“successes”) over the season from the binomial distribution.

The first example shows the observed batting average from a single season.

num_at_bats <- 60 * 4
real_batting_average <- 0.350
target_batting_average <- 0.400
rbinom(n = 1, size = num_at_bats, prob = real_batting_average) / num_at_bats

#> [1] 0.3375

The n = 1 can just be replaced with a large number to simulate a bunch of seasons. The average batting average over these seasons should be close to the true batting average.

n_seasons <- 1e6 # 1 million simulations.
sim_res <- rbinom(n = n_seasons,
size = num_at_bats,
prob = real_batting_average)
sim_res <- sim_res / num_at_bats
# The average batting average is near the true batting average of 0.350.
mean(sim_res)

#> [1] 0.3500121

The full distribution of batting averages over the 1 million simulations is shown below.

tibble(sims = sim_res) %>%
ggplot(aes(x = sims)) +
geom_density(color = "black", fill = "black", adjust = 2,
alpha = 0.2, size = 1.2, ) +
geom_vline(xintercept = target_batting_average,
color = "tomato", lty = 2, size = 1.2) +
scale_y_continuous(expand = expansion(mult = c(0.01, 0.02))) +
theme(plot.title = element_text(hjust = 0.5)) +
labs(x = "1 million simulated season batting averages",
y = "probability density",
title = "Distribution of simulated batting averages in a 60-game season")

The answer from the simulation is pretty close to the actual answer.

mean(sim_res >= 0.40)

#> [1] 0.060813

One last visualization I want to do demonstrates why the length of the season matters to the distribution. Instead of using rbinom() to simulate the number of successes over the entire season, I use it below to simulate a season’s-worth of individual at bats, returning a vector of 0’s and 1’s. I then plotted the cumulative number of hits at each at bat and colored the line by the running batting average.

The coloring shows how the batting average was more volatile when there were fewer at bats.

sampled_at_bats <- rbinom(60*4, 1, 0.35)
tibble(at_bat = sampled_at_bats) %>%
mutate(i = row_number(),
cum_total = cumsum(at_bat),
running_avg = cum_total / i) %>%
ggplot(aes(x = i, y = cum_total)) +
geom_line(aes(color = running_avg), size = 1.2) +
scale_color_viridis_c() +
theme(plot.title = element_text(hjust = 0.5),
legend.position = c(0.85, 0.35)) +
labs(x = "at bat number",
y = "total number of hits",
color = "batting average",
title = "Running batting average over a simulated season")

The following two plots do the same analysis many times to simulate many seasons and color the lines by whether or not the final batting average was at or above 0.400. As there are more games, the running batting averages, which are essentially biased random walks, regress towards the true batting average. (Note that I had to do 500 simulations for the 162-game season to get any simulations with a final batting average above 0.400.)

simulate_season_at_bats <- function(num_at_bats) {
sampled_at_bats <- rbinom(num_at_bats, size = 1, prob = 0.35)
tibble(result = sampled_at_bats) %>%
mutate(at_bat = row_number(),
cum_total = cumsum(result),
running_avg = cum_total / at_bat)
}
tibble(season = 1:100) %>%
mutate(season_results = map(season, ~ simulate_season_at_bats(60 * 4))) %>%
unnest(season_results) %>%
group_by(season) %>%
mutate(
above_400 = 0.4 <= running_avg[which.max(at_bat)],
above_400 = ifelse(above_400, "BA ≥ 0.400", "BA < 0.400")
) %>%
ungroup() %>%
ggplot(aes(x = at_bat, y = cum_total)) +
geom_line(aes(group = season, color = above_400,
alpha = above_400),
size = 0.8) +
geom_hline(yintercept = 0.4 * 60 * 4,
color = "tomato", lty = 2, size = 1) +
scale_color_manual(values = c("grey50", "dodgerblue")) +
scale_alpha_manual(values = c(0.1, 1.0), guide = FALSE) +
scale_x_continuous(expand = c(0, 0)) +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
legend.position = c(0.85, 0.25)) +
labs(x = "at bat number",
y = "total number of hits",
color = NULL,
title = "Running batting averages over simulated 60-game seasons",
subtitle = "Blue lines indicate a simulation with a final batting average of at least 0.400.")