# Uncanny X-Men: Bayesian take on Dr. Silge’s analysis

**Posts | Joshua Cook**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The other day, Dr. Silge from RStudio posted the

screencast and

blog post of her

`#TidyTuesday`

analysis of the Uncanny X-Men data set from

Claremont Run

Project. In her analysis, she used

logistic regression to model the effect of various features of each

comic book issue on the likelihood of the characters to visit the

X-Mansion at least once. She also built a similar model for whether or

not the comic book issue passed the

Bechdel

test.

One thing that caught my eye was that she used bootstrap re-sampling to

build a distribution of values for each parameter for the models. To me,

this resembled using Markov Chain Monte Carlo (MCMC) sampling methods

for fitting models in Bayesian statistics. Therefore, I thought it would

be interesting to fit the same logistic model (I only analyzed the first

one on visiting the X-Mansion) using Bayesian methods and compare the

results and interpretations.

Just to be clear, this article is not meant as a criticism of Dr. Silge or her work.

I greatly enjoy watching her TidyTuesday screencasts, and she seems like a delightful person and strong data scientist.

## Dr. Silge’s analysis

The following was taken from Dr. Silge’s

blog

post. I provide brief

explanations about each step, though more information and explanation

can be found in the original article.

### Data preparation

First the data was downloaded from the TidyTuesday GitHub repository and

loaded into R.

character_visualization <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-06-30/character_visualization.csv") #> Parsed with column specification: #> cols( #> issue = col_double(), #> costume = col_character(), #> character = col_character(), #> speech = col_double(), #> thought = col_double(), #> narrative = col_double(), #> depicted = col_double() #> ) xmen_bechdel <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-06-30/xmen_bechdel.csv") #> Parsed with column specification: #> cols( #> issue = col_double(), #> pass_bechdel = col_character(), #> notes = col_character(), #> reprint = col_logical() #> ) locations <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-06-30/locations.csv") #> Parsed with column specification: #> cols( #> issue = col_double(), #> location = col_character(), #> context = col_character(), #> notes = col_character() #> )

Dr. Silge first created the `per_issue`

data frame that is a aggregation

over all of the main characters summarizing number of speech bubbles

(`speech`

), number of thought bubbles (`thought`

), number of times the

characters were involved in narrative statements (`narrative`

), and the

total number of depictions (`depicted`

) in each issue.

per_issue <- character_visualization %>% group_by(issue) %>% summarise(across(speech:depicted, sum)) %>% ungroup() per_issue #> # A tibble: 196 x 5 #> issue speech thought narrative depicted #>#> 1 97 146 13 71 168 #> 2 98 172 9 29 180 #> 3 99 105 22 29 124 #> 4 100 141 28 7 122 #> 5 101 158 27 58 191 #> 6 102 78 27 33 133 #> 7 103 91 6 25 121 #> 8 104 142 15 25 165 #> 9 105 83 12 24 128 #> 10 106 20 6 20 16 #> # … with 186 more rows

She also made the `x_mansion`

data frame which just says whether each

issue visited the X-Mansion at least once and then joined that with

`per_issue`

to create `locations_joined`

.

x_mansion <- locations %>% group_by(issue) %>% summarise(mansion = "X-Mansion" %in% location) locations_joined <- per_issue %>% inner_join(x_mansion) #> Joining, by = "issue"

### Modeling

To get a distribution of parameter estimates, Dr. Silge bootstrapped

1,000 versions of `locations_joined`

and fit a separate logistic model

to each. She then extracted the coefficients of each model and used the

percentile interval method (`int_pctl()`

) to gather estimates and

confidence intervals for the bootstraps.

set.seed(123) boots <- bootstraps(locations_joined, times = 1000, apparent = TRUE) boot_models <- boots %>% mutate( model = map( splits, ~ glm(mansion ~ speech + thought + narrative + depicted, family = "binomial", data = analysis(.) ) ), coef_info = map(model, tidy) ) boot_coefs <- boot_models %>% unnest(coef_info) int_pctl(boot_models, coef_info) #> # A tibble: 5 x 6 #> term .lower .estimate .upper .alpha .method #>#> 1 (Intercept) -2.42 -1.29 -0.277 0.05 percentile #> 2 depicted 0.00193 0.0103 0.0196 0.05 percentile #> 3 narrative -0.0106 0.00222 0.0143 0.05 percentile #> 4 speech -0.0148 -0.00716 0.000617 0.05 percentile #> 5 thought -0.0143 -0.00338 0.00645 0.05 percentile

The boostrapped distributions are shown below.

boot_coefs %>% filter(term != "(Intercept)") %>% mutate(term = fct_inorder(term)) %>% ggplot(aes(estimate, fill = term)) + geom_vline( xintercept = 0, color = "gray50", alpha = 0.6, lty = 2, size = 1.5 ) + geom_histogram(alpha = 0.8, bins = 25, show.legend = FALSE) + facet_wrap(~term, scales = "free") + labs( title = "Which issues contain the X-Mansion as a location?", subtitle = "Comparing the top 25 characters' speech, thought, narrative portrayal, and total depictions", caption = "Data from the Claremont Run Project" )

## The Bayesian way

Bayesian modeling is the practice of updating our prior beliefs using

observed data to produce a probability distribtion for the values of

unknown parameters. Thus, unlike the single point-estimates provided by

“frequentist” approaches, the results of a Bayesian analysis are the

distributions of estimated parameters. This is why Dr. Silge’s

bootstrapping analysis reminded by of Bayesian regression modeling.

### The libraries

The

‘rstanarm’ package was

used to fit the model, and

‘tidybayes’,

‘bayestestR’, and

‘see’ were used for

investigating the model’s estimates (‘bayestestR’ and ‘see’ are both

from the

‘easystats’ suite of

packages).

library(rstanarm) library(tidybayes) library(bayestestR) library(see)

### Fitting the model

The `stan_glm()`

function is the ‘rstanarm’ equivalent of `glm()`

. The

only additional arguments to include are the prior distributions for the

predictor coefficients and intercept. Here, I kept it simple by using

normal distributions that were not too biased. A thorough analysis would

include a section where the impact of different prior distributions

would be assessed.

bayes_mansion <- stan_glm( mansion ~ speech + thought + narrative + depicted, family = binomial(link = "logit"), data = locations_joined, prior = normal(location = 0, scale = 0.5), prior_intercept = normal(location = 0, scale = 3) ) #> #> SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1). #> Chain 1: #> Chain 1: Gradient evaluation took 0.000213 seconds #> Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 2.13 seconds. #> Chain 1: Adjust your expectations accordingly! #> Chain 1: #> Chain 1: #> Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup) #> Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup) #> Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup) #> Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup) #> Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup) #> Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup) #> Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling) #> Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling) #> Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling) #> Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling) #> Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling) #> Chain 1: Iteration: 2000 / 2000 [100%] (Sampling) #> Chain 1: #> Chain 1: Elapsed Time: 0.22106 seconds (Warm-up) #> Chain 1: 0.172703 seconds (Sampling) #> Chain 1: 0.393763 seconds (Total) #> Chain 1: #> #> SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 2). #> Chain 2: #> Chain 2: Gradient evaluation took 6.3e-05 seconds #> Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.63 seconds. #> Chain 2: Adjust your expectations accordingly! #> Chain 2: #> Chain 2: #> Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup) #> Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup) #> Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup) #> Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup) #> Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup) #> Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup) #> Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling) #> Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling) #> Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling) #> Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling) #> Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling) #> Chain 2: Iteration: 2000 / 2000 [100%] (Sampling) #> Chain 2: #> Chain 2: Elapsed Time: 0.188406 seconds (Warm-up) #> Chain 2: 0.310647 seconds (Sampling) #> Chain 2: 0.499053 seconds (Total) #> Chain 2: #> #> SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 3). #> Chain 3: #> Chain 3: Gradient evaluation took 5.3e-05 seconds #> Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.53 seconds. #> Chain 3: Adjust your expectations accordingly! #> Chain 3: #> Chain 3: #> Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup) #> Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup) #> Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup) #> Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup) #> Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup) #> Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup) #> Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling) #> Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling) #> Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling) #> Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling) #> Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling) #> Chain 3: Iteration: 2000 / 2000 [100%] (Sampling) #> Chain 3: #> Chain 3: Elapsed Time: 0.159436 seconds (Warm-up) #> Chain 3: 0.220296 seconds (Sampling) #> Chain 3: 0.379732 seconds (Total) #> Chain 3: #> #> SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 4). #> Chain 4: #> Chain 4: Gradient evaluation took 1.5e-05 seconds #> Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.15 seconds. #> Chain 4: Adjust your expectations accordingly! #> Chain 4: #> Chain 4: #> Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup) #> Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup) #> Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup) #> Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup) #> Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup) #> Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup) #> Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling) #> Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling) #> Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling) #> Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling) #> Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling) #> Chain 4: Iteration: 2000 / 2000 [100%] (Sampling) #> Chain 4: #> Chain 4: Elapsed Time: 0.169659 seconds (Warm-up) #> Chain 4: 0.129054 seconds (Sampling) #> Chain 4: 0.298713 seconds (Total) #> Chain 4:

### Model evaluation

Now that the model is fit, the next step is to inspect the posterior

distributions of the coefficients.

plot(bayes_mansion, prob = 0.50, prob_outer = 0.89)

Each dot represents the mean of the posterior distribution for the

coefficient along with the 50% and 89% density intervals. We can see

that the intercept is quite large and negative, indicating that, on

average, the X-Men tended to not visit the X-Mansion. Comparably, the

distributions for the other coefficients are very small and located

close to 0. This suggests that they do not poses much additional

information on whether or not the X-Men visited the X-Mansion.

Another useful plot in Bayesian analysis is of the Highest Density

Interval (HDI), the smallest range of parameter values that hold a given

density of the distribution. With the 89% HDI for a posterior

distribution, we can say that, given the structure of the model and

observed data, there is an 89% chance that the real parameter value lies

within the range. This is one method for understanding the confidence of

the estimated value.

plot(bayestestR::hdi(bayes_mansion, ci = c(0.5, 0.75, 0.89, 0.95)))

From the HDI shown above, we can see that the coefficients for the

number of speech bubles (`speech`

) and number of times the characters

were depicted `depicted`

in an issue were the strongest predictors. The

89% HDI for `speech`

includes 0 while the 95% HDI for `depicted`

includes 0. Therefore, none of these posterior distributions are

particularly exciting as they are all very small (in conjunction with a

strong intercept) and have a fair chance of actually being 0.

Two other measurements that are useful for Bayesian analysis are the

*propbability of direction* (PD) and the *region of practical
equivalence* (ROPE). Without going too in-depth, the PD is the

probability that a parameter is positive or negative. It ranges from 0.5

to 1, a value of 1 indicates it is definitely positive or negative

(i.e. non-zero). The ROPE is a similar value but accounts for effect

size by measuring how much of the posterior distribution lies within a

region that we (the analysts) would say is effectively zero. Thus, if

the ROPE is high, then it is unlikely that the parameter’s value has

much importance.

The following table provides a summary of the posterior distributions

for this model.

bayestestR::describe_posterior(bayes_mansion) #> Possible multicollinearity between depicted and speech (r = 0.7). This might lead to inappropriate results. See 'Details' in '?rope'. #> # Description of Posterior Distributions #> #> Parameter | Median | 89% CI | pd | 89% ROPE | % in ROPE | Rhat | ESS #> ------------------------------------------------------------------------------------------------ #> (Intercept) | -1.267 | [-2.095, -0.467] | 0.997 | [-0.181, 0.181] | 0 | 1.000 | 5612.356 #> speech | -0.004 | [-0.010, 0.001] | 0.900 | [-0.181, 0.181] | 100 | 1.001 | 2401.035 #> thought | -0.003 | [-0.011, 0.006] | 0.697 | [-0.181, 0.181] | 100 | 1.001 | 2865.077 #> narrative | 0.004 | [-0.005, 0.013] | 0.771 | [-0.181, 0.181] | 100 | 1.000 | 3264.987 #> depicted | 0.007 | [ 0.001, 0.014] | 0.961 | [-0.181, 0.181] | 100 | 1.000 | 2219.555

We can see that even though the PD for `speech`

and `depicted`

are close

to 1.0 indicating they are likely non-zero, the “% in ROPE” is 100%

suggesting the differences are unimportant.

### Posterior predictive checks

The last step of this analysis is to make predictions using the model.

First, we can make predictions on the provided data to see how well the

model fit the data. Second, we can input new data points that are

interesting to us to see how they impact the model’s predictions.

The plot below shows the distribution of posterior predictions on the

original data. The plotted values are the predicted probability that the

X-Mansion was visited in the comic book issue, separated by whether or

not the X-Mansion was actually visited. The two distributions look

almost identical. This is not surprising because the coefficients fit to

the variables were so small, they do not provide much additional

information for the prediction. Therefore, the model is primarily

relying upon the intercept to calculate an estimate.

# From the 'rethinking' package. logistic <- function (x) { p <- 1/(1 + exp(-x)) p <- ifelse(x == Inf, 1, p) return(p) } locations_joined %>% mutate(mansion_predict = logistic(predict(bayes_mansion))) %>% ggplot(aes(x = mansion_predict, color = mansion, fill = mansion)) + geom_density(size = 1.2, alpha = 0.2) + geom_vline(xintercept = 0.5, size = 1.2, lty = 2, color = grey) + scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Set2") + theme(legend.position = c(0.65, 0.73)) + labs(x = "predicted probability of being in the X-Mansion", y = "probability density", title = "The Bayesian logistic model's posterior predictions", color = "was in the\nX-mansion", fill = "was in the\nX-mansion")

The second common type of posterior prediction is to make data that

varies one or two variables and holds the rest of the variables

constant. Since `depicted`

had the largest predicted effect of the

non-intercept coefficients, I decided to conduct this posterior

predictive check on the values of this variable. Therefore, I created

the `pred_data`

data frame which has 100 values across the range of

`depicted`

and just the average values for the rest of the variables.

Making predictions on this artifical data set will show the effect of

the `depicted`

variable while holding the other variables constant.

Using the `add_fitted_draws()`

function from ‘tidybayes’, 200

predictions were made for the artificial data.

pred_data <- locations_joined %>% summarise(across(issue:narrative, mean)) %>% mutate(depicted = list(modelr::seq_range(locations_joined$depicted, n = 100))) %>% unnest(depicted) %>% add_fitted_draws(bayes_mansion, n = 200) pred_data #> # A tibble: 20,000 x 10 #> # Groups: issue, speech, thought, narrative, depicted, .row [100] #> issue speech thought narrative depicted .row .chain .iteration .draw .value #>#> 1 189. 143. 44.2 48.7 16 1 NA NA 1 0.125 #> 2 189. 143. 44.2 48.7 16 1 NA NA 6 0.143 #> 3 189. 143. 44.2 48.7 16 1 NA NA 15 0.117 #> 4 189. 143. 44.2 48.7 16 1 NA NA 47 0.0890 #> 5 189. 143. 44.2 48.7 16 1 NA NA 86 0.272 #> 6 189. 143. 44.2 48.7 16 1 NA NA 118 0.101 #> 7 189. 143. 44.2 48.7 16 1 NA NA 131 0.158 #> 8 189. 143. 44.2 48.7 16 1 NA NA 133 0.279 #> 9 189. 143. 44.2 48.7 16 1 NA NA 140 0.0990 #> 10 189. 143. 44.2 48.7 16 1 NA NA 154 0.109 #> # … with 19,990 more rows

The following plot shows the logistic curves for the artificial data.

Each curve represents an individual prediction over the range of

`depicted`

values. The original data is also plotted on top.

# Just to shift the `mansion` values for plotting purposes. locations_joined_mod <- locations_joined %>% mutate(mansion_num = as.numeric(mansion) + ifelse(mansion, -0.1, 0.1)) pred_data %>% ggplot(aes(x = depicted, y = .value)) + geom_line(aes(group = .draw), alpha = 0.1) + geom_jitter(aes(y = mansion_num, color = mansion), data = locations_joined_mod, height = 0.08, width = 0, size = 2.2, alpha = 0.5) + scale_color_brewer(palette = "Dark2") + scale_x_continuous(expand = expansion(mult = c(0.02, 0.02))) + scale_y_continuous(expand = c(0, 0), limits = c(0, 1)) + labs(x = "depicted", y = "probability of being in the X-mansion", color = "was in the\nX-Mansion", title = "Posterior predictions of the effect of the number\nof depictions of the main characters", subtitle = "All other predictors were held constant at their average value.")

We can see that there is a general tendency for the model to predict

that the episode visited the X-Mansion as the value for depicted

increases, but it is a very gradual sinusoidal curve because the

posterior distribution was located so close to zero. We can also see how

the curve is shifted closer to 0 for most values of `depicted`

. This is

because of the strong intercept value.

## Wrapping-up

Overall, I think that is was an interesting comparison between a

frequentist approach to building a distribution of coefficient values

and the Bayesian method of analyzing a model. I am not in the position

to provide a theoretical comparison between the two approaches, though I

would say that, personally, interpreting the Bayesian posterior

distribution is more intuitive than interpreting the bootstrapped

distribution.

**leave a comment**for the author, please follow the link and comment on their blog:

**Posts | Joshua Cook**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.