# Riddler: Can You Beat MLB Recods?

**Posts | Joshua Cook**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## FiveThirtyEight’s Riddler Express

From Taylor Firman comes an opportunity to make baseball history:

This year, Major League Baseball announced it will play a shortened

60-game season, as opposed to the typical 162-game season. Baseball is

a sport of numbers and statistics, and so Taylor wondered about the

impact of the season’s length on some famous baseball records.Some statistics are more achievable than others in a shortened season.

Suppose your true batting average is .350, meaning you have a 35

percent chance of getting a hit with every at-bat. If you have four

at-bats per game, what are your chances of batting at least .400 over

the course of the 60-game season?2 And how does this compare to your

chances of batting at least .400 over the course of a 162-game season?

## Plan

This riddle should be pretty straight forward to solve statistically and

with simulations, so I will do both.

## Setup

knitr::opts_chunk$set(echo = TRUE, comment = "#>", cache = FALSE, dpi = 400) library(mustashe) library(tidyverse) library(conflicted) # Handle any namespace conflicts. conflict_prefer("filter", "dplyr") conflict_prefer("select", "dplyr") # Default 'ggplot2' theme. theme_set(theme_minimal()) # For reproducibility. set.seed(123)

## Statistical solution

This is a simple binomial system: at each at bat, the player either gets

a hit or not. If their real batting average is 0.350, that means the

probability of them getting a hit during each at bat is 35%. Thus,

according the the Central Limit Theorem, given a sufficiently large

number of at bats, the frequency of a hit should be 35%. This is because

the distribution converges towards the mean. However, at smaller

sample-sizes, the distribution is more broad, meaning that the observed

batting average has a greater chance of being further away from the true

value.

First, let’s answer the riddle. The solution is just the probability of

observing a batting average of 0.400 or greater. The first value is

computed using `dbinom()`

and the second cumulative probability is

calculated using `pbinom()`

, setting `lower.tail = FALSE`

to get the

tail above 0.400.

num_at_bats <- 60 * 4 real_batting_average <- 0.350 target_batting_average <- 0.400 prob_at_400 <- dbinom(x = target_batting_average * num_at_bats, size = num_at_bats, prob = real_batting_average) prob_above_400 <- pbinom(q = target_batting_average * num_at_bats, size = num_at_bats, prob = real_batting_average, lower.tail = FALSE) prob_at_400 + prob_above_400 #> [1] 0.06083863

**Under the described assumptions, there is a 6.1% chance of reaching a
batting average of 0.400 in the shorter season.**

For comparison, the chance for a normal 162-game season is calculated

below. Because $0.400 \times 162 \times 4$ is a non-integer value, an

exact 0.400 batting average cannot be obtained. Therefore, only the

probability of a batting average greater than 0.400 needs to be

calculated.

num_at_bats <- 162 * 4 real_batting_average <- 0.350 target_batting_average <- 0.400 prob_above_400 <- pbinom(q = target_batting_average * num_at_bats, size = num_at_bats, prob = real_batting_average, lower.tail = FALSE) prob_above_400 #> [1] 0.003789922

Over 162 games, there is a 0.4% chance of achieving a batting average of

at least 0.400.

## Simulation

The solution to this riddle could also be found by simulating a whole

bunch of seasons with the real batting average of 0.350 and then just

counting how frequently the simulations resulted in an observed batting

average of 0.400.

A single season can be simulated using the `rbinom()`

function where `n`

is the number of seasons to simulate, `size`

takes the number of at

bats, and `prob`

takes the true batting average. The returned value is a

sampled number of hits (“successes”) over the season from the binomial

distribution.

The first example shows the observed batting average from a single

season.

num_at_bats <- 60 * 4 real_batting_average <- 0.350 target_batting_average <- 0.400 rbinom(n = 1, size = num_at_bats, prob = real_batting_average) / num_at_bats #> [1] 0.3375

The `n = 1`

can just be replaced with a large number to simulate a bunch

of seasons. The average batting average over these seasons should be

close to the true batting average.

n_seasons <- 1e6 # 1 million simulations. sim_res <- rbinom(n = n_seasons, size = num_at_bats, prob = real_batting_average) sim_res <- sim_res / num_at_bats # The average batting average is near the true batting average of 0.350. mean(sim_res) #> [1] 0.3500121

The full distribution of batting averages over the 1 million simulations

is shown below.

tibble(sims = sim_res) %>% ggplot(aes(x = sims)) + geom_density(color = "black", fill = "black", adjust = 2, alpha = 0.2, size = 1.2, ) + geom_vline(xintercept = target_batting_average, color = "tomato", lty = 2, size = 1.2) + scale_y_continuous(expand = expansion(mult = c(0.01, 0.02))) + theme(plot.title = element_text(hjust = 0.5)) + labs(x = "1 million simulated season batting averages", y = "probability density", title = "Distribution of simulated batting averages in a 60-game season")

The answer from the simulation is pretty close to the actual answer.

mean(sim_res >= 0.40) #> [1] 0.060813

One last visualization I want to do demonstrates why the length of the

season matters to the distribution. Instead of using `rbinom()`

to

simulate the number of successes over the entire season, I use it below

to simulate a season’s-worth of individual at bats, returning a vector

of 0’s and 1’s. I then plotted the cumulative number of hits at each at

bat and colored the line by the running batting average.

The coloring shows how the batting average was more volatile when there

were fewer at bats.

sampled_at_bats <- rbinom(60*4, 1, 0.35) tibble(at_bat = sampled_at_bats) %>% mutate(i = row_number(), cum_total = cumsum(at_bat), running_avg = cum_total / i) %>% ggplot(aes(x = i, y = cum_total)) + geom_line(aes(color = running_avg), size = 1.2) + scale_color_viridis_c() + theme(plot.title = element_text(hjust = 0.5), legend.position = c(0.85, 0.35)) + labs(x = "at bat number", y = "total number of hits", color = "batting average", title = "Running batting average over a simulated season")

The following two plots do the same analysis many times to simulate many

seasons and color the lines by whether or not the final batting average

was at or above 0.400. As there are more games, the running batting

averages, which are essentially biased random walks, regress towards the

true batting average. (Note that I had to do 500 simulations for the

162-game season to get any simulations with a final batting average

above 0.400.)

simulate_season_at_bats <- function(num_at_bats) { sampled_at_bats <- rbinom(num_at_bats, size = 1, prob = 0.35) tibble(result = sampled_at_bats) %>% mutate(at_bat = row_number(), cum_total = cumsum(result), running_avg = cum_total / at_bat) } tibble(season = 1:100) %>% mutate(season_results = map(season, ~ simulate_season_at_bats(60 * 4))) %>% unnest(season_results) %>% group_by(season) %>% mutate( above_400 = 0.4 <= running_avg[which.max(at_bat)], above_400 = ifelse(above_400, "BA ≥ 0.400", "BA < 0.400") ) %>% ungroup() %>% ggplot(aes(x = at_bat, y = cum_total)) + geom_line(aes(group = season, color = above_400, alpha = above_400), size = 0.8) + geom_hline(yintercept = 0.4 * 60 * 4, color = "tomato", lty = 2, size = 1) + scale_color_manual(values = c("grey50", "dodgerblue")) + scale_alpha_manual(values = c(0.1, 1.0), guide = FALSE) + scale_x_continuous(expand = c(0, 0)) + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5), legend.position = c(0.85, 0.25)) + labs(x = "at bat number", y = "total number of hits", color = NULL, title = "Running batting averages over simulated 60-game seasons", subtitle = "Blue lines indicate a simulation with a final batting average of at least 0.400.")

**leave a comment**for the author, please follow the link and comment on their blog:

**Posts | Joshua Cook**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.