# An Educational Stroll With Stan – Part 2

**r on Everyday Is A School Day**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I learned a great deal throughout this journey. In the second part, I gained knowledge about implementing logistic regression in Stan. I also learned the significance of data type declarations for obtaining accurate estimates, how to use posterior to predict new data, and what generated quantities in Stan is for. Moreover, having a friend who is well-versed in Bayesian statistics proves invaluable when delving into the Bayesian realm! Very fun indeed!

We’ve looked at linear regression previously, now let’s take a look at logistic regression.

## Objectives

- Load Library & Simulate Simple Data
- What Does A Simple Logistic Regression Look Like In Stan?
- Visualize It!
- How To Predict Future Data ?
- Acknowledgement/Fun Fact
- Lessons Learnt

## Load Library & Simulate Simple Data

library(tidyverse) library(cmdstanr) library(bayesplot) library(kableExtra) set.seed(1) n <- 1000 w <- rnorm(n) x <- rbinom(n,1,plogis(-1+2*w)) y <- rbinom(n,1,plogis(-1+2*x + 3*w)) collider <- -0.5*x + -0.6*y + rnorm(n) df <- list(N=n,x=x,y=y,w=w, collider=collider) #cmdstanr uses list df2 <- tibble(x,y,collider,w) #this is for simple logistic regression check

Same DAG as
before but the difference is both `x`

and `y`

are binary via binomial distribution.

#### Look At GLM summary

model <- glm(y ~ x + w, df2, family = "binomial") summary(model) ## ## Call: ## glm(formula = y ~ x + w, family = "binomial", data = df2) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.93224 -0.42660 -0.06937 0.29296 2.79793 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.0352 0.1240 -8.348 <2e-16 *** ## x 2.1876 0.2503 8.739 <2e-16 *** ## w 2.9067 0.2230 13.035 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1375.9 on 999 degrees of freedom ## Residual deviance: 569.2 on 997 degrees of freedom ## AIC: 575.2 ## ## Number of Fisher Scoring iterations: 6

Nice! `x`

and `w`

coefficients and `y intercept`

are close to our simulated model! `x coefficient`

is 2.1875657 (true: 2), `w coefficient`

is 2.9066911 (true:3).

#### What About Collider?

model2 <- lm(collider ~ x + y, df2) summary(model2) ## ## Call: ## lm(formula = collider ~ x + y, data = df2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.5744 -0.6454 -0.0252 0.7058 2.8273 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.007275 0.044300 0.164 0.87 ## x -0.461719 0.090672 -5.092 4.23e-07 *** ## y -0.610752 0.086105 -7.093 2.48e-12 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.032 on 997 degrees of freedom ## Multiple R-squared: 0.1753, Adjusted R-squared: 0.1736 ## F-statistic: 105.9 on 2 and 997 DF, p-value: < 2.2e-16

Not too shabby. `x coefficient`

is -0.4617194 (true: -0.5), `y coefficient`

is -0.6107524 (true: -0.6). Perfect! But how do we do this on Stan?

Let’s break down what exactly we want to estimate.

`\begin{gather} y\sim\text{bernoulli}(p) \\ \text{logit}(p) = a_y+b_{yx}.x+b_{yw}.w \\ \\ collider\sim\text{normal}(\mu_{collider},\sigma_{collider}) \\ \mu_{collider}=a_{collider}+b_{collider\_x}.x+b_{collider\_y}.y \end{gather}`

we’re basically interested in `\(a_y\)`

, `\(b_{yx}\)`

, `\(b_{yw}\)`

, `\(a_{collider}\)`

, `\(b_{collider\_x}\)`

, `\(b_{collider\_y}\)`

. To see if they reflect the parameters set in our simulation.

## Logistic Regression via Stan

data { int N; array[N] int x; array[N] int y; array[N] real w; #note that this is not int but real array[N] real collider; #same here } parameters { // parameters for y (bernoulli) real alpha_y; real beta_yx; real beta_yw; // parameters for collider (normal) real alpha_collider; real beta_collider_x; real beta_collider_y; real sigma_collider; } model { // prior // default for real will be uniform distribution // likelihood y ~ bernoulli_logit(alpha_y + beta_yx * to_vector(x) + beta_yw * to_vector(w)); collider ~ normal(alpha_collider + beta_collider_x * to_vector(x) + beta_collider_y * to_vector(y), sigma_collider); }

Save the above under `log_sim.stan`

.Note that we didn’t have to use inverse_logit, bernoulli_logit nicely turn that equation into inverse logit for us.

Did you also notice that data declaration has `array[N] (data type) (variable)`

instead of `variable[N]`

. This is the new way of declaring the structure in Stan.

### Run The Model in R and Analyze

mod <- cmdstan_model("log_sim.stan") fit <- mod$sample(data = df, chains = 4, iter_sampling = 2000, iter_warmup = 1000, seed = 123, parallel_chains = 4 ) fit$summary()

Not too shabby either! `Stan`

model accurately estimated the `alpha_y, beta_yx, beta_yw, alpha_collider, beta_collider_x, beta_collider_y`

parameters. `Rhat`

is 1, less than 1.05. ess_bulk & ess_tail are >100. Model diagnostic looks good!

## Visualize The Beautiful Convergence

mcmc_trace(fit$draws())

## How To Predict Future Data ?

You know how in `R`

or `python`

, you can save the model and then type something like `predict`

and the probability will miraculously appear? Well, to my knowledge, you can’t in Stan or cmdstanr. I heard you can with rstanarm, rethinking, brms etc. But let’s roll with it and see how we can do this.

Instructions:

- Extract your coefficient of interest from posterior
- Write another Stan model with generated quantities
- Feed New Data onto the Stan model and extract the expected value.

Recall this was our formula:
`\begin{gather} y\sim\text{bernoulli}(p) \\ \text{logit}(p) = a_y+b_{yx}.x+b_{yw}.w \end{gather}`

We want to extract `alpha_y`

, `beta_yx`

, and `beta_yw`

.

#### Let’s Estimate `y`

#Extract parameters and then mean it alpha_y <- fit$draws(variables = "alpha_y") |> mean() beta_yx <- fit$draws(variables = "beta_yx") |> mean() beta_yw <- fit$draws(variables = "beta_yw") |> mean() #new data set.seed(2) n <- 100 x <- rbinom(n,1,0.5) #randomly assign a 1 for x w <- rnorm(n) #randomly generate a continuous data for w df_new <- list(N=n,x=x,w=w,alpha_y=alpha_y,beta_yx=beta_yx,beta_yw=beta_yw)

#### New Stan model

data { int<lower=0> N; array[N] int x; array[N] real w; real alpha_y; real beta_yx; real beta_yw; } // parameters { // array[N] real y_pred; // } generated quantities { array[N] real<lower=0,upper=1> y_pred; for (i in 1:N) { y_pred[i] = inv_logit(alpha_y + beta_yx * x[i] + beta_yw * w[i]); } }

Save the above to `log_sim_pred.stan`

.

Note that this time, instead of declaring the model equation, we provided equation on `generated quantities`

which essentially calculates the `y_pred`

according to our formula.

#### Load Prediction Stan Model

mod <- cmdstan_model("log_sim_pred.stan") fit2 <- mod$sample(data = df_new, iter_sampling = 1, chains =1, fixed_param = T)

Notice that we provided `df_new`

as data, changed `iter_sampling`

to 1, if not we’ll just get a bunch of same numbers. Give it a try yourself! same goes with `chains`

, additional chains of same values provide no additional value. Lastly, we have to specify `fixed_param`

.

#### Merge Predicted `y`

and True `y`

# create df with y_pred df_pred <- as.data.frame(fit2$draws(variables = "y_pred")) |> pivot_longer(cols = everything(), names_to = "y_pred1", values_to = "y_pred") |> select(y_pred) # create df w y_actual df_actual <- tibble(x=x,w=w,y_actual=plogis(-1+2*x + 3*w)) |> select(y_actual) # merge the 2 dfs and check the diff df_combined <- df_pred |> add_column(df_actual) |> mutate(diff = y_actual - y_pred) # load the first 5 rows df_combined |> head(5) |> kable()

y_pred | y_actual | diff |
---|---|---|

0.029370 | 0.0288923 | -0.0004777 |

0.999270 | 0.9992532 | -0.0000168 |

0.382207 | 0.3347584 | -0.0474486 |

0.936812 | 0.9441253 | 0.0073133 |

0.129852 | 0.1050137 | -0.0248383 |

Not too shabby! Differences are quite small for the first 5. Let’s histogram it and see.

df_combined |> ggplot(aes(x=diff)) + geom_histogram() + theme_minimal()

#### Visualize Predicted and Actual `y`

df_combined |> ggplot(aes(x=y_actual,y=y_pred)) + geom_point() + geom_smooth(formula = "y ~ x") + theme_minimal()

Almost a straight line through. Awesomeness! Not really sure why between 0.25 and 0.75 there were up down pattern. If you know, please let me know!

## Acknowledgement/Fun Fact:

I truly want to thank Alec Wong for helping me with a problem I encountered. Initially my estimates were off, especially the collider parameters. Spent a few days and was not able to find out the cause. When you run into problem, stop and work on something else, or ask a friend! I did both! Alec Wong wrote a JAGS model and was able to extract accurate estimates. He sent me the script, we fed the script into chatGPT to spit out a Stan model and then realized that my data declaration had a mistake! Well, 2 mistakes! I put both `w`

and `collider`

as `int`

instead of `real`

. Here are the estiamtes when both `w`

and `collider`

were declared as int.

Notice how the `collider`

parameters are off !?!?! The 95% credible intervals don’t even contain the true value.

Again, THANK YOU ALEC !!!

## Things To Improve On:

- Will explore further using informed prior in the future
- Will explore simulated data of sens/spec of a diagnostic test and then apply prior to obtain posterior
- Will explore how Stan behaves with NULL data variable.

## Lessons learnt:

- Learnt how to do logistic regression using cmdstanr
- Declaration of data type is important in Stan to get accurate estimates
- Stan has changed y[n] to array[n] y
- Learnt from Alec that rnorm(n, mu, sigma) == mu + rnorm(n, 0, sigma)
- Stan Manual is a good reference to go to

If you like this article:

- please feel free to send me a comment or visit my other blogs
- please feel free to follow me on twitter, GitHub or Mastodon
- if you would like collaborate please feel free to contact me

**leave a comment**for the author, please follow the link and comment on their blog:

**r on Everyday Is A School Day**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.