Bayesian linear regression

[This article was first published on Modeling with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

For statistical inferences we have tow general approaches or frameworks:

  • Frequentist approach in which the data sampled from the population is considered as random and the population parameter values, known as null hypothesis, as fixed (but unknown). To estimate thus this null hypothesis we look for the sample parameters that maximize the likelihood of the data. However, the data at hand, even it is sampled randomly from the population, it is fixed now, so how can we consider this data as random. The answer is that we assume that the population distribution is known and we work out the maximum likelihood of the data using this distribution. Or we repeat the study many times with different samples then we average the results. So if we get very small value for the likelihood of the data which is known as p-value we tend to reject the null hypothesis. The main problem, however, is the misunderstanding and misusing of this p-value when we decide to reject the null hypothesis based on some threshold, from which we wrongly interpreting it as the probability of rejecting the null hypothesis. For more detail about p-value click here.

  • Bayesian approach, in contrast, provides true probabilities to quantify the uncertainty about a certain hypothesis, but requires the use of a first belief about how likely this hypothesis is true, known as prior, to be able to derive the probability of this hypothesis after seeing the data known as posterior probability. This approach called bayesian because it is based on the bayes’ theorem, for instance if a have population parameter to estimate \(\theta\) , and we have some data sampled randomly from this population \(D\), the posterior probability thus will be \[\overbrace{p(\theta/D)}^{Posterior}=\frac{\overbrace{p(D/\theta)}^{Likelihood}.\overbrace{p(\theta)}^{Prior}}{\underbrace{p(D)}_{Evidence}}\] The Evidence is the probability of the data at hand regardless the parameter \(\theta\).

Data preparation

For simplicity we use the BostonHousing data from mlbench package, For more detail about this data run this command ?BostonHousing after calling the package. But first Let’s call all the packages that we need throughout this article.

suppressPackageStartupMessages(library(mlbench))
suppressPackageStartupMessages(library(rstanarm))
suppressPackageStartupMessages(library(bayestestR))
suppressPackageStartupMessages(library(bayesplot))
suppressPackageStartupMessages(library(insight))
suppressPackageStartupMessages(library(broom))
data("BostonHousing")
str(BostonHousing)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : num  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ b      : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

To well understand how the bayesian regression works we keep only three features, two numeric variables age, dis and one categorical chas, with the target variable medv the median value of owner-occupied homes.

bost <- BostonHousing[,c("medv","age","dis","chas")]
summary(bost)
##       medv            age              dis         chas   
##  Min.   : 5.00   Min.   :  2.90   Min.   : 1.130   0:471  
##  1st Qu.:17.02   1st Qu.: 45.02   1st Qu.: 2.100   1: 35  
##  Median :21.20   Median : 77.50   Median : 3.207          
##  Mean   :22.53   Mean   : 68.57   Mean   : 3.795          
##  3rd Qu.:25.00   3rd Qu.: 94.08   3rd Qu.: 5.188          
##  Max.   :50.00   Max.   :100.00   Max.   :12.127

From the summary we do not have any special issues like missing values for example.

Classical linear regression model

To highlight the difference between the bayesian regression and the traditional linear regression (frequentist approach), Let’s first fit the latter to our data.

model_freq<-lm(medv~., data=bost)
tidy(model_freq)
## # A tibble: 4 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   32.7      2.25      14.6   2.33e-40
## 2 age           -0.143    0.0198    -7.21  2.09e-12
## 3 dis           -0.246    0.265     -0.928 3.54e- 1
## 4 chas1          7.51     1.46       5.13  4.16e- 7

Using the p.value of each regressor, all the regressors ar significant. except for the dis variable. Since the variable chas is categorical with twolevels The coefficient of chas1 is the different between the madian price of houses on the bounds charles River and that of the others, so the median price of the former are higher about 7.513.

Bayesian regression

To fit a bayesian regresion we use the function stan_glm from the rstanarm package. This function as the above lm function requires providing the formula and the data that will be used, and leave all the following arguments with their default values:

  • family : by default this function uses the gaussian distribution as we do with the classical glm function to perform lm model.
  • prior : The prior distribution for the regression coefficients, By default the normal prior is used. There are subset of functions used for the prior provided by rstanarm like , student t family, laplace family…ect. To get the full list with all the details run this command ?priors. If we want a flat uniform prior we set this to NULL.
  • prior_intercept: prior for the intercept, can be normal, student_t , or cauchy. If we want a flat uniform prior we set this to NULL.
  • prior_aux: prior fo auxiliary parameters such as the error standard deviation for the gaussion family.
  • algorithm: The estimating approach to use. The default is "sampling MCMC1.
  • QR: FALSE by default, if true QR decomposition applied on the design matrix if we have large number of predictors.
  • iter : is the number of iterations if the MCMC method is used, the default is 2000.
  • chains : the number of Markov chains, the default is 4.
  • warmup : also known as burnin, the number of iterations used for adaptation, and should not be used for inference. By default it is half of the iterations.
model_bayes<- stan_glm(medv~., data=bost, seed=111)

if we print the model we get the following

print(model_bayes, digits = 3)
## stan_glm
##  family:       gaussian [identity]
##  formula:      medv ~ .
##  observations: 506
##  predictors:   4
## ------
##             Median MAD_SD
## (Intercept) 32.656  2.221
## age         -0.143  0.020
## dis         -0.244  0.268
## chas1        7.496  1.428
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 8.319  0.261 
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

The Median estimate is the median computed from the MCMC simulation, and MAD_SD is the median absolute deviation computed from the same simulation. To well understand how getting these outputs let’s plot the MCMC simulation of each predictor using bayesplot

mcmc_dens(model_bayes, pars = c("age"))+
  vline_at(-0.143, col="red")

As you see the point estimate of age falls on the median of this distribution (red line). The same thing is true for dis and shas predictors.

mcmc_dens(model_bayes, pars=c("chas1"))+
  vline_at(7.496, col="red")

mcmc_dens(model_bayes, pars=c("dis"))+
  vline_at(-0.244, col="red")

Now how can we evaluate the model parameters? The answer is by analyzing the posteriors using some specific statistics. To get the full statistics provided by bayestestR package, we make use of the function describe_posterior.

describe_posterior(model_bayes)
## Possible multicollinearity between dis and age (r = 0.75). This might lead to inappropriate results. See 'Details' in '?rope'.
## # Description of Posterior Distributions
## 
## Parameter   | Median | CI | CI_low | CI_high |    pd | ROPE_CI | ROPE_low | ROPE_high | ROPE_Percentage |  Rhat |  ESS
## ----------------------------------------------------------------------------------------------------------------------
## (Intercept) | 32.656 | 89 | 29.244 |  36.332 | 1.000 |      89 |   -0.920 |     0.920 |               0 | 1.000 | 2315
## age         | -0.143 | 89 | -0.175 |  -0.112 | 1.000 |      89 |   -0.920 |     0.920 |               1 | 1.000 | 2417
## dis         | -0.244 | 89 | -0.662 |   0.168 | 0.824 |      89 |   -0.920 |     0.920 |               1 | 1.000 | 2452
## chas1       |  7.496 | 89 |  5.348 |   9.944 | 1.000 |      89 |   -0.920 |     0.920 |               0 | 1.000 | 3752

Before starting analyzing the table we shoud first understanding the above various statistics commonly used in bayes regression.

  • CI : Credible Interval, it is used to quantify the uncertainty about the regression coefficients. Ther are tow methods to compute CI, the highest density interval HDI which is the default, and the Equal-tailed Interval ETI. with 89% probability (given the data) that a coefficient lies above the CI_low value and under CI_high value. This strightforward probabilistic interpretation is completely diffrent from the confidence interval used in classical linear regression where the coefficient fall inside this confidence interval (if we choose 95% of confidence) 95 times if we repeat the study 100 times.
  • pd : Probability of Direction , which is the probability that the effect goes to the positive or to the negative direction, and it is considered as the best equivalent for the p-value.
  • ROPE_CI: Region of Practical Equivalence, since bayes method deals with true probabilities , it does not make sense to compute the probability of getting the effect equals zero (the null hypothesis) as a point (probability of a point in continuous intervals equal zero ). Thus, we define instead a small range around zero which can be considered practically the same as no effect (zero), this range therefore is called ROPE. By default (according to Cohen, 1988) The Rope is [-0.1,0.1] from the standardized coefficients.
  • Rhat: scale reduction factor \(\hat R\), it is computed for each scalar quantity of interest, as the standard deviation of that quantity from all the chains included together, divided by the root mean square of the separate within-chain standard deviations. When this value is close to 1 we do not have any convergence problem with MCMC.
  • ESS : effective sample size, it captures how many independent draws contain the same amount of information as the dependent sample obtained by the MCMC algorithm, the higher the ESS the better. The threshold used in practice is 400.

Aternatively, we can get the coefficeient estimates (which are the medians by default) separatly by using the package insight

post <- get_parameters(model_bayes)
print(purrr::map_dbl(post,median),digits = 3)
## (Intercept)         age         dis       chas1 
##      32.656      -0.143      -0.244       7.496

We can also compute the Maximum A posteriori (MAP), and the mean as follows

print(purrr::map_dbl(post, map_estimate),digits = 3)
## (Intercept)         age         dis       chas1 
##      32.273      -0.143      -0.255       7.192
print(purrr::map_dbl(post, mean),digits = 3)
## (Intercept)         age         dis       chas1 
##      32.734      -0.143      -0.244       7.506

As we see the values are closer to each other due to the like normality of the distribution of the posteriors where all the central statistics (mean, median, mode) are closer to each other. Using the following plot to visualize the age coefficient using different statistics as follows

mcmc_dens(model_bayes, pars=c("age"))+
  vline_at(median(post$age), col="red")+
  vline_at(mean(post$age), col="yellow")+
  vline_at(map_estimate(post$age), col="green")

As expected they are approximately on top of each other.

Bayesian inferences

As we do with classical regression (frequentist), we can test the significance of the bayesian regression coefficients by checking whether the corresponding credible interval contains zero or not, if no then this coefficient is significant. Let’s go back to our model and check the significance of each coefficient (using credible based on the default hdi).

hdi(model_bayes)
## # Highest Density Interval
## 
## Parameter   |        89% HDI
## ----------------------------
## (Intercept) | [29.24, 36.33]
## age         | [-0.17, -0.11]
## dis         | [-0.66,  0.17]
## chas1       | [ 5.35,  9.94]

And based on the eti

eti(model_bayes)
## # Equal-Tailed Interval
## 
## Parameter   |        89% ETI
## ----------------------------
## (Intercept) | [29.21, 36.32]
## age         | [-0.17, -0.11]
## dis         | [-0.66,  0.17]
## chas1       | [ 5.17,  9.79]

Using both methods, the only non significant coefficient is dis variable, which is inline with the classical regression.

Note: this similar result between frequentist and bayesian regression may due to the normality assumption for the former that is well satisfied which gives satisfied results and due to the normal prior used in the latter. However, in real world it is less often to be sure about the normality assumption which may give contradict conclusions between the two approaches.

Another way to test the significance by checking the part of the credible interval that falls inside the ROPE interval. we can get this by calling the rope from bayestestR package

rope(post$age)
## # Proportion of samples inside the ROPE [-0.10, 0.10]:
## 
## inside ROPE
## -----------
## 0.00 %

For age almost all the credible interval (HDI) is outside the ROPE range, which means that coefficient is highly significant.

rope(post$chas1)
## # Proportion of samples inside the ROPE [-0.10, 0.10]:
## 
## inside ROPE
## -----------
## 0.00 %
rope(post$`(Intercept)`)
## # Proportion of samples inside the ROPE [-0.10, 0.10]:
## 
## inside ROPE
## -----------
## 0.00 %

The same thing is true for the chas and intercept variable.

rope(post$dis)
## # Proportion of samples inside the ROPE [-0.10, 0.10]:
## 
## inside ROPE
## -----------
## 23.28 %

In contrast, almost the quarter of the credible interval of dis variable is inside the ROPE interval. In other words, the probability of this coefficient to be zero is 23.28%.

rope_range(model_bayes)
## [1] -0.9197104  0.9197104

PD and P-value

Sometimes we are only interested to check the direction of the coefficient (positive or negative). this is the role of pd statistic in the above table, high value means that the associated effect is concentrated on the same side as the median. For our model, since pd’s equal to 1, almost all the posteriors of the two variables age and chas1 and the intercept are on the same side (if median negative all other values are negatives). However, it should be noted that this statistic does not assess the significance of the effect. Something more important to mention is that it exists a strong relation between this probability and the p-value approximated as follows: \(p-value=1-pd\). let’s check this with our variables.

df1 <-dplyr::select(tidy(model_freq), c(term,p.value))
df1$p.value <- round(df1$p.value, digits = 3)
df2 <- 1- purrr::map_dbl(post, p_direction)
df <- cbind(df1,df2)
df
##                    term p.value   df2
## (Intercept) (Intercept)   0.000 0.000
## age                 age   0.000 0.000
## dis                 dis   0.354 0.176
## chas1             chas1   0.000 0.000

## Conclusion

within the last decade more practitioner , specially in some fields such as medicine and psychology, are turning towards bayesian analysis since almost every thing can be interpreted straightforwardly with a probabilistic manner. However, the bayesian analysis has also some drawback , like the subjective way to define the priors (which play an important role to compute the posterior), or for problems that do not have conjugate prior, not always the mcmc alghoritm converges easily to the right values (specially with complex data).


  1. Kevin P.murphy, Machine Learning: A Probabilistic Perspective, 2012, page 589↩︎

To leave a comment for the author, please follow the link and comment on their blog: Modeling with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)