Uncanny X-Men: Bayesian take on Dr. Silge’s analysis

[This article was first published on Posts | Joshua Cook, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The other day, Dr. Silge from RStudio posted the screencast and blog post of her #TidyTuesday analysis of the Uncanny X-Men data set from Claremont Run Project. In her analysis, she used logistic regression to model the effect of various features of each comic book issue on the likelihood of the characters to visit the X-Mansion at least once. She also built a similar model for whether or not the comic book issue passed the Bechdel test.

One thing that caught my eye was that she used bootstrap re-sampling to build a distribution of values for each parameter for the models. To me, this resembled using Markov Chain Monte Carlo (MCMC) sampling methods for fitting models in Bayesian statistics. Therefore, I thought it would be interesting to fit the same logistic model (I only analyzed the first one on visiting the X-Mansion) using Bayesian methods and compare the results and interpretations.

Just to be clear, this article is not meant as a criticism of Dr. Silge or her work. I greatly enjoy watching her TidyTuesday screencasts, and she seems like a delightful person and strong data scientist.

Dr. Silge’s analysis

The following was taken from Dr. Silge’s blog post. I provide brief explanations about each step, though more information and explanation can be found in the original article.

Data preparation

First the data was downloaded from the TidyTuesday GitHub repository and loaded into R.

character_visualization <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-06-30/character_visualization.csv")

#> Parsed with column specification:
#> cols(
#> issue = col_double(),
#> costume = col_character(),
#> character = col_character(),
#> speech = col_double(),
#> thought = col_double(),
#> narrative = col_double(),
#> depicted = col_double()
#> )

xmen_bechdel <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-06-30/xmen_bechdel.csv")

#> Parsed with column specification:
#> cols(
#> issue = col_double(),
#> pass_bechdel = col_character(),
#> notes = col_character(),
#> reprint = col_logical()
#> )

locations <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-06-30/locations.csv")

#> Parsed with column specification:
#> cols(
#> issue = col_double(),
#> location = col_character(),
#> context = col_character(),
#> notes = col_character()
#> )

Dr. Silge first created the per_issue data frame that is a aggregation over all of the main characters summarizing number of speech bubbles (speech), number of thought bubbles (thought), number of times the characters were involved in narrative statements (narrative), and the total number of depictions (depicted) in each issue.

per_issue <- character_visualization %>%
group_by(issue) %>%
summarise(across(speech:depicted, sum)) %>%
ungroup()
per_issue

#> # A tibble: 196 x 5
#> issue speech thought narrative depicted
#>     
#> 1 97 146 13 71 168
#> 2 98 172 9 29 180
#> 3 99 105 22 29 124
#> 4 100 141 28 7 122
#> 5 101 158 27 58 191
#> 6 102 78 27 33 133
#> 7 103 91 6 25 121
#> 8 104 142 15 25 165
#> 9 105 83 12 24 128
#> 10 106 20 6 20 16
#> # … with 186 more rows

She also made the x_mansion data frame which just says whether each issue visited the X-Mansion at least once and then joined that with per_issue to create locations_joined.

x_mansion <- locations %>%
group_by(issue) %>%
summarise(mansion = "X-Mansion" %in% location)
locations_joined <- per_issue %>%
inner_join(x_mansion)

#> Joining, by = "issue"

Modeling

To get a distribution of parameter estimates, Dr. Silge bootstrapped 1,000 versions of locations_joined and fit a separate logistic model to each. She then extracted the coefficients of each model and used the percentile interval method (int_pctl()) to gather estimates and confidence intervals for the bootstraps.

set.seed(123)
boots <- bootstraps(locations_joined, times = 1000, apparent = TRUE)
boot_models <- boots %>%
mutate(
model = map(
splits,
~ glm(mansion ~ speech + thought + narrative + depicted,
family = "binomial", data = analysis(.)
)
),
coef_info = map(model, tidy)
)
boot_coefs <- boot_models %>%
unnest(coef_info)
int_pctl(boot_models, coef_info)

#> # A tibble: 5 x 6
#> term .lower .estimate .upper .alpha .method
#>      
#> 1 (Intercept) -2.42 -1.29 -0.277 0.05 percentile
#> 2 depicted 0.00193 0.0103 0.0196 0.05 percentile
#> 3 narrative -0.0106 0.00222 0.0143 0.05 percentile
#> 4 speech -0.0148 -0.00716 0.000617 0.05 percentile
#> 5 thought -0.0143 -0.00338 0.00645 0.05 percentile

The boostrapped distributions are shown below.

boot_coefs %>%
filter(term != "(Intercept)") %>%
mutate(term = fct_inorder(term)) %>%
ggplot(aes(estimate, fill = term)) +
geom_vline(
xintercept = 0, color = "gray50",
alpha = 0.6, lty = 2, size = 1.5
) +
geom_histogram(alpha = 0.8, bins = 25, show.legend = FALSE) +
facet_wrap(~term, scales = "free") +
labs(
title = "Which issues contain the X-Mansion as a location?",
subtitle = "Comparing the top 25 characters' speech, thought, narrative portrayal, and total depictions",
caption = "Data from the Claremont Run Project"
)


The Bayesian way

Bayesian modeling is the practice of updating our prior beliefs using observed data to produce a probability distribtion for the values of unknown parameters. Thus, unlike the single point-estimates provided by “frequentist” approaches, the results of a Bayesian analysis are the distributions of estimated parameters. This is why Dr. Silge’s bootstrapping analysis reminded by of Bayesian regression modeling.

The libraries

The ‘rstanarm’ package was used to fit the model, and ‘tidybayes’, ‘bayestestR’, and ‘see’ were used for investigating the model’s estimates (‘bayestestR’ and ‘see’ are both from the ‘easystats’ suite of packages).

library(rstanarm)
library(tidybayes)
library(bayestestR)
library(see)

Fitting the model

The stan_glm() function is the ‘rstanarm’ equivalent of glm(). The only additional arguments to include are the prior distributions for the predictor coefficients and intercept. Here, I kept it simple by using normal distributions that were not too biased. A thorough analysis would include a section where the impact of different prior distributions would be assessed.

bayes_mansion <- stan_glm(
mansion ~ speech + thought + narrative + depicted,
family = binomial(link = "logit"),
data = locations_joined,
prior = normal(location = 0, scale = 0.5),
prior_intercept = normal(location = 0, scale = 3)
)

#>
#> SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1).
#> Chain 1:
#> Chain 1: Gradient evaluation took 0.000213 seconds
#> Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 2.13 seconds.
#> Chain 1: Adjust your expectations accordingly!
#> Chain 1:
#> Chain 1:
#> Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
#> Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
#> Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
#> Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
#> Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
#> Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
#> Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
#> Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
#> Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
#> Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
#> Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
#> Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
#> Chain 1:
#> Chain 1: Elapsed Time: 0.22106 seconds (Warm-up)
#> Chain 1: 0.172703 seconds (Sampling)
#> Chain 1: 0.393763 seconds (Total)
#> Chain 1:
#>
#> SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 2).
#> Chain 2:
#> Chain 2: Gradient evaluation took 6.3e-05 seconds
#> Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.63 seconds.
#> Chain 2: Adjust your expectations accordingly!
#> Chain 2:
#> Chain 2:
#> Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
#> Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
#> Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
#> Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
#> Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
#> Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
#> Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
#> Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
#> Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
#> Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
#> Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
#> Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
#> Chain 2:
#> Chain 2: Elapsed Time: 0.188406 seconds (Warm-up)
#> Chain 2: 0.310647 seconds (Sampling)
#> Chain 2: 0.499053 seconds (Total)
#> Chain 2:
#>
#> SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 3).
#> Chain 3:
#> Chain 3: Gradient evaluation took 5.3e-05 seconds
#> Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.53 seconds.
#> Chain 3: Adjust your expectations accordingly!
#> Chain 3:
#> Chain 3:
#> Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
#> Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
#> Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
#> Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
#> Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
#> Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
#> Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
#> Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
#> Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
#> Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
#> Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
#> Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
#> Chain 3:
#> Chain 3: Elapsed Time: 0.159436 seconds (Warm-up)
#> Chain 3: 0.220296 seconds (Sampling)
#> Chain 3: 0.379732 seconds (Total)
#> Chain 3:
#>
#> SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 4).
#> Chain 4:
#> Chain 4: Gradient evaluation took 1.5e-05 seconds
#> Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.15 seconds.
#> Chain 4: Adjust your expectations accordingly!
#> Chain 4:
#> Chain 4:
#> Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
#> Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
#> Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
#> Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
#> Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
#> Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
#> Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
#> Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
#> Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
#> Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
#> Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
#> Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
#> Chain 4:
#> Chain 4: Elapsed Time: 0.169659 seconds (Warm-up)
#> Chain 4: 0.129054 seconds (Sampling)
#> Chain 4: 0.298713 seconds (Total)
#> Chain 4:

Model evaluation

Now that the model is fit, the next step is to inspect the posterior distributions of the coefficients.

plot(bayes_mansion, prob = 0.50, prob_outer = 0.89)

Each dot represents the mean of the posterior distribution for the coefficient along with the 50% and 89% density intervals. We can see that the intercept is quite large and negative, indicating that, on average, the X-Men tended to not visit the X-Mansion. Comparably, the distributions for the other coefficients are very small and located close to 0. This suggests that they do not poses much additional information on whether or not the X-Men visited the X-Mansion.

Another useful plot in Bayesian analysis is of the Highest Density Interval (HDI), the smallest range of parameter values that hold a given density of the distribution. With the 89% HDI for a posterior distribution, we can say that, given the structure of the model and observed data, there is an 89% chance that the real parameter value lies within the range. This is one method for understanding the confidence of the estimated value.

plot(bayestestR::hdi(bayes_mansion, ci = c(0.5, 0.75, 0.89, 0.95)))

From the HDI shown above, we can see that the coefficients for the number of speech bubles (speech) and number of times the characters were depicted depicted in an issue were the strongest predictors. The 89% HDI for speech includes 0 while the 95% HDI for depicted includes 0. Therefore, none of these posterior distributions are particularly exciting as they are all very small (in conjunction with a strong intercept) and have a fair chance of actually being 0.

Two other measurements that are useful for Bayesian analysis are the propbability of direction (PD) and the region of practical equivalence (ROPE). Without going too in-depth, the PD is the probability that a parameter is positive or negative. It ranges from 0.5 to 1, a value of 1 indicates it is definitely positive or negative (i.e. non-zero). The ROPE is a similar value but accounts for effect size by measuring how much of the posterior distribution lies within a region that we (the analysts) would say is effectively zero. Thus, if the ROPE is high, then it is unlikely that the parameter’s value has much importance.

The following table provides a summary of the posterior distributions for this model.

bayestestR::describe_posterior(bayes_mansion)

#> Possible multicollinearity between depicted and speech (r = 0.7). This might lead to inappropriate results. See 'Details' in '?rope'.
#> # Description of Posterior Distributions
#>
#> Parameter | Median | 89% CI | pd | 89% ROPE | % in ROPE | Rhat | ESS
#> ------------------------------------------------------------------------------------------------
#> (Intercept) | -1.267 | [-2.095, -0.467] | 0.997 | [-0.181, 0.181] | 0 | 1.000 | 5612.356
#> speech | -0.004 | [-0.010, 0.001] | 0.900 | [-0.181, 0.181] | 100 | 1.001 | 2401.035
#> thought | -0.003 | [-0.011, 0.006] | 0.697 | [-0.181, 0.181] | 100 | 1.001 | 2865.077
#> narrative | 0.004 | [-0.005, 0.013] | 0.771 | [-0.181, 0.181] | 100 | 1.000 | 3264.987
#> depicted | 0.007 | [ 0.001, 0.014] | 0.961 | [-0.181, 0.181] | 100 | 1.000 | 2219.555

We can see that even though the PD for speech and depicted are close to 1.0 indicating they are likely non-zero, the “% in ROPE” is 100% suggesting the differences are unimportant.

Posterior predictive checks

The last step of this analysis is to make predictions using the model. First, we can make predictions on the provided data to see how well the model fit the data. Second, we can input new data points that are interesting to us to see how they impact the model’s predictions.

The plot below shows the distribution of posterior predictions on the original data. The plotted values are the predicted probability that the X-Mansion was visited in the comic book issue, separated by whether or not the X-Mansion was actually visited. The two distributions look almost identical. This is not surprising because the coefficients fit to the variables were so small, they do not provide much additional information for the prediction. Therefore, the model is primarily relying upon the intercept to calculate an estimate.

# From the 'rethinking' package.
logistic <- function (x) {
p <- 1/(1 + exp(-x))
p <- ifelse(x == Inf, 1, p)
return(p)
}
locations_joined %>%
mutate(mansion_predict = logistic(predict(bayes_mansion))) %>%
ggplot(aes(x = mansion_predict, color = mansion, fill = mansion)) +
geom_density(size = 1.2, alpha = 0.2) +
geom_vline(xintercept = 0.5, size = 1.2, lty = 2, color = grey) +
scale_color_brewer(palette = "Dark2") +
scale_fill_brewer(palette = "Set2") +
theme(legend.position = c(0.65, 0.73)) +
labs(x = "predicted probability of being in the X-Mansion",
y = "probability density",
title = "The Bayesian logistic model's posterior predictions",
color = "was in the\nX-mansion",
fill = "was in the\nX-mansion")

The second common type of posterior prediction is to make data that varies one or two variables and holds the rest of the variables constant. Since depicted had the largest predicted effect of the non-intercept coefficients, I decided to conduct this posterior predictive check on the values of this variable. Therefore, I created the pred_data data frame which has 100 values across the range of depicted and just the average values for the rest of the variables. Making predictions on this artifical data set will show the effect of the depicted variable while holding the other variables constant.

Using the add_fitted_draws() function from ‘tidybayes’, 200 predictions were made for the artificial data.

pred_data <- locations_joined %>%
summarise(across(issue:narrative, mean)) %>%
mutate(depicted = list(modelr::seq_range(locations_joined$depicted, n = 100))) %>%
unnest(depicted) %>%
add_fitted_draws(bayes_mansion, n = 200)
pred_data

#> # A tibble: 20,000 x 10
#> # Groups: issue, speech, thought, narrative, depicted, .row [100]
#> issue speech thought narrative depicted .row .chain .iteration .draw .value
#>          
#> 1 189. 143. 44.2 48.7 16 1 NA NA 1 0.125
#> 2 189. 143. 44.2 48.7 16 1 NA NA 6 0.143
#> 3 189. 143. 44.2 48.7 16 1 NA NA 15 0.117
#> 4 189. 143. 44.2 48.7 16 1 NA NA 47 0.0890
#> 5 189. 143. 44.2 48.7 16 1 NA NA 86 0.272
#> 6 189. 143. 44.2 48.7 16 1 NA NA 118 0.101
#> 7 189. 143. 44.2 48.7 16 1 NA NA 131 0.158
#> 8 189. 143. 44.2 48.7 16 1 NA NA 133 0.279
#> 9 189. 143. 44.2 48.7 16 1 NA NA 140 0.0990
#> 10 189. 143. 44.2 48.7 16 1 NA NA 154 0.109
#> # … with 19,990 more rows

The following plot shows the logistic curves for the artificial data. Each curve represents an individual prediction over the range of depicted values. The original data is also plotted on top.

# Just to shift the `mansion` values for plotting purposes.
locations_joined_mod <- locations_joined %>%
mutate(mansion_num = as.numeric(mansion) + ifelse(mansion, -0.1, 0.1))
pred_data %>%
ggplot(aes(x = depicted, y = .value)) +
geom_line(aes(group = .draw), alpha = 0.1) +
geom_jitter(aes(y = mansion_num, color = mansion),
data = locations_joined_mod,
height = 0.08, width = 0,
size = 2.2, alpha = 0.5) +
scale_color_brewer(palette = "Dark2") +
scale_x_continuous(expand = expansion(mult = c(0.02, 0.02))) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 1)) +
labs(x = "depicted",
y = "probability of being in the X-mansion",
color = "was in the\nX-Mansion",
title = "Posterior predictions of the effect of the number\nof depictions of the main characters",
subtitle = "All other predictors were held constant at their average value.")

We can see that there is a general tendency for the model to predict that the episode visited the X-Mansion as the value for depicted increases, but it is a very gradual sinusoidal curve because the posterior distribution was located so close to zero. We can also see how the curve is shifted closer to 0 for most values of depicted. This is because of the strong intercept value.


Wrapping-up

Overall, I think that is was an interesting comparison between a frequentist approach to building a distribution of coefficient values and the Bayesian method of analyzing a model. I am not in the position to provide a theoretical comparison between the two approaches, though I would say that, personally, interpreting the Bayesian posterior distribution is more intuitive than interpreting the bootstrapped distribution.

To leave a comment for the author, please follow the link and comment on their blog: Posts | Joshua Cook.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)