Bayesian linear regression
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Introduction
For statistical inferences we have tow general approaches or frameworks:

Frequentist approach in which the data sampled from the population is considered as random and the population parameter values, known as null hypothesis, as fixed (but unknown). To estimate thus this null hypothesis we look for the sample parameters that maximize the likelihood of the data. However, the data at hand, even it is sampled randomly from the population, it is fixed now, so how can we consider this data as random. The answer is that we assume that the population distribution is known and we work out the maximum likelihood of the data using this distribution. Or we repeat the study many times with different samples then we average the results. So if we get very small value for the likelihood of the data which is known as pvalue we tend to reject the null hypothesis.
The main problem, however, is the misunderstanding and misusing of this pvalue when we decide to reject the null hypothesis based on some threshold, from which we wrongly interpreting it as the probability of rejecting the null hypothesis. For more detail about pvalue click here. 
Bayesian approach, in contrast, provides true probabilities to quantify the uncertainty about a certain hypothesis, but requires the use of a first belief about how likely this hypothesis is true, known as prior, to be able to derive the probability of this hypothesis after seeing the data known as posterior probability. This approach called bayesian because it is based on the bayes’ theorem, for instance if a have population parameter to estimate \(\theta\) , and we have some data sampled randomly from this population \(D\), the posterior probability thus will be \[\overbrace{p(\theta/D)}^{Posterior}=\frac{\overbrace{p(D/\theta)}^{Likelihood}.\overbrace{p(\theta)}^{Prior}}{\underbrace{p(D)}_{Evidence}}\]
The Evidence is the probability of the data at hand regardless the parameter \(\theta\).
Data preparation
For simplicity we use the BostonHousing data from mlbench package, For more detail about this data run this command ?BostonHousing
after calling the package. But first Let’s call all the packages that we need throughout this article.
suppressPackageStartupMessages(library(mlbench)) suppressPackageStartupMessages(library(rstanarm)) suppressPackageStartupMessages(library(bayestestR)) suppressPackageStartupMessages(library(bayesplot)) suppressPackageStartupMessages(library(insight)) suppressPackageStartupMessages(library(broom)) data("BostonHousing") str(BostonHousing) ## 'data.frame': 506 obs. of 14 variables: ## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ... ## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ... ## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ... ## $ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... ## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ... ## $ rm : num 6.58 6.42 7.18 7 7.15 ... ## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ... ## $ dis : num 4.09 4.97 4.97 6.06 6.06 ... ## $ rad : num 1 2 2 3 3 3 5 5 5 5 ... ## $ tax : num 296 242 242 222 222 222 311 311 311 311 ... ## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ... ## $ b : num 397 397 393 395 397 ... ## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ... ## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
To well understand how the bayesian regression works we keep only three features, two numeric variables age, dis and one categorical chas, with the target variable medv the median value of owneroccupied homes.
bost < BostonHousing[,c("medv","age","dis","chas")] summary(bost) ## medv age dis chas ## Min. : 5.00 Min. : 2.90 Min. : 1.130 0:471 ## 1st Qu.:17.02 1st Qu.: 45.02 1st Qu.: 2.100 1: 35 ## Median :21.20 Median : 77.50 Median : 3.207 ## Mean :22.53 Mean : 68.57 Mean : 3.795 ## 3rd Qu.:25.00 3rd Qu.: 94.08 3rd Qu.: 5.188 ## Max. :50.00 Max. :100.00 Max. :12.127
From the summary we do not have any special issues like missing values for example.
Classical linear regression model
To highlight the difference between the bayesian regression and the traditional linear regression (frequentist approach), Let’s first fit the latter to our data.
model_freq<lm(medv~., data=bost) tidy(model_freq) ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value #### 1 (Intercept) 32.7 2.25 14.6 2.33e40 ## 2 age 0.143 0.0198 7.21 2.09e12 ## 3 dis 0.246 0.265 0.928 3.54e 1 ## 4 chas1 7.51 1.46 5.13 4.16e 7
Using the p.value of each regressor, all the regressors ar significant. except for the dis variable. Since the variable chas is categorical with twolevels The coefficient of chas1 is the different between the madian price of houses on the bounds charles River and that of the others, so the median price of the former are higher about 7.513.
Bayesian regression
To fit a bayesian regresion we use the function stan_glm
from the rstanarm package. This function as the above lm function requires providing the formula and the data that will be used, and leave all the following arguments with their default values:
 family : by default this function uses the gaussian distribution as we do with the classical
glm
function to performlm
model.  prior : The prior distribution for the regression coefficients, By default the normal prior is used. There are subset of functions used for the prior provided by rstanarm like , student t family, laplace family…ect. To get the full list with all the details run this command
?priors
. If we want a flat uniform prior we set this to NULL.  prior_intercept: prior for the intercept, can be normal, student_t , or cauchy. If we want a flat uniform prior we set this to NULL.
 prior_aux: prior fo auxiliary parameters such as the error standard deviation for the gaussion family.
 algorithm: The estimating approach to use. The default is "sampling MCMC^{1}.
 QR: FALSE by default, if true QR decomposition applied on the design matrix if we have large number of predictors.
 iter : is the number of iterations if the MCMC method is used, the default is 2000.
 chains : the number of Markov chains, the default is 4.
 warmup : also known as burnin, the number of iterations used for adaptation, and should not be used for inference. By default it is half of the iterations.
model_bayes< stan_glm(medv~., data=bost, seed=111)
if we print the model we get the following
print(model_bayes, digits = 3) ## stan_glm ## family: gaussian [identity] ## formula: medv ~ . ## observations: 506 ## predictors: 4 ##  ## Median MAD_SD ## (Intercept) 32.656 2.221 ## age 0.143 0.020 ## dis 0.244 0.268 ## chas1 7.496 1.428 ## ## Auxiliary parameter(s): ## Median MAD_SD ## sigma 8.319 0.261 ## ##  ## * For help interpreting the printed output see ?print.stanreg ## * For info on the priors used see ?prior_summary.stanreg
The Median estimate is the median computed from the MCMC simulation, and MAD_SD is the median absolute deviation computed from the same simulation. To well understand how getting these outputs let’s plot the MCMC simulation of each predictor using bayesplot
mcmc_dens(model_bayes, pars = c("age"))+ vline_at(0.143, col="red")
As you see the point estimate of age falls on the median of this distribution (red line). The same thing is true for dis and shas predictors.
mcmc_dens(model_bayes, pars=c("chas1"))+ vline_at(7.496, col="red")
mcmc_dens(model_bayes, pars=c("dis"))+ vline_at(0.244, col="red")
Now how can we evaluate the model parameters? The answer is by analyzing the posteriors using some specific statistics. To get the full statistics provided by bayestestR package, we make use of the function describe_posterior
.
describe_posterior(model_bayes) ## Possible multicollinearity between dis and age (r = 0.75). This might lead to inappropriate results. See 'Details' in '?rope'. ## # Description of Posterior Distributions ## ## Parameter  Median  CI  CI_low  CI_high  pd  ROPE_CI  ROPE_low  ROPE_high  ROPE_Percentage  Rhat  ESS ##  ## (Intercept)  32.656  89  29.244  36.332  1.000  89  0.920  0.920  0  1.000  2315 ## age  0.143  89  0.175  0.112  1.000  89  0.920  0.920  1  1.000  2417 ## dis  0.244  89  0.662  0.168  0.824  89  0.920  0.920  1  1.000  2452 ## chas1  7.496  89  5.348  9.944  1.000  89  0.920  0.920  0  1.000  3752
Before starting analyzing the table we shoud first understanding the above various statistics commonly used in bayes regression.
 CI : Credible Interval, it is used to quantify the uncertainty about the regression coefficients. Ther are tow methods to compute CI, the highest density interval
HDI
which is the default, and the Equaltailed IntervalETI
. with 89% probability (given the data) that a coefficient lies above the CI_low value and under CI_high value. This strightforward probabilistic interpretation is completely diffrent from the confidence interval used in classical linear regression where the coefficient fall inside this confidence interval (if we choose 95% of confidence) 95 times if we repeat the study 100 times.  pd : Probability of Direction , which is the probability that the effect goes to the positive or to the negative direction, and it is considered as the best equivalent for the pvalue.
 ROPE_CI: Region of Practical Equivalence, since bayes method deals with true probabilities , it does not make sense to compute the probability of getting the effect equals zero (the null hypothesis) as a point (probability of a point in continuous intervals equal zero ). Thus, we define instead a small range around zero which can be considered practically the same as no effect (zero), this range therefore is called ROPE. By default (according to Cohen, 1988) The Rope is [0.1,0.1] from the standardized coefficients.
 Rhat: scale reduction factor \(\hat R\), it is computed for each scalar quantity of interest, as the standard deviation of that quantity from all the chains included together, divided by the root mean square of the separate withinchain standard deviations. When this value is close to 1 we do not have any convergence problem with MCMC.
 ESS : effective sample size, it captures how many independent draws contain the same amount of information as the dependent sample obtained by the MCMC algorithm, the higher the ESS the better. The threshold used in practice is 400.
Aternatively, we can get the coefficeient estimates (which are the medians by default) separatly by using the package insight
post < get_parameters(model_bayes) print(purrr::map_dbl(post,median),digits = 3) ## (Intercept) age dis chas1 ## 32.656 0.143 0.244 7.496
We can also compute the Maximum A posteriori (MAP), and the mean as follows
print(purrr::map_dbl(post, map_estimate),digits = 3) ## (Intercept) age dis chas1 ## 32.273 0.143 0.255 7.192 print(purrr::map_dbl(post, mean),digits = 3) ## (Intercept) age dis chas1 ## 32.734 0.143 0.244 7.506
As we see the values are closer to each other due to the like normality of the distribution of the posteriors where all the central statistics (mean, median, mode) are closer to each other.
Using the following plot to visualize the age coefficient using different statistics as follows
mcmc_dens(model_bayes, pars=c("age"))+ vline_at(median(post$age), col="red")+ vline_at(mean(post$age), col="yellow")+ vline_at(map_estimate(post$age), col="green")
As expected they are approximately on top of each other.
Bayesian inferences
As we do with classical regression (frequentist), we can test the significance of the bayesian regression coefficients by checking whether the corresponding credible interval contains zero or not, if no then this coefficient is significant. Let’s go back to our model and check the significance of each coefficient (using credible based on the default hdi
).
hdi(model_bayes) ## # Highest Density Interval ## ## Parameter  89% HDI ##  ## (Intercept)  [29.24, 36.33] ## age  [0.17, 0.11] ## dis  [0.66, 0.17] ## chas1  [ 5.35, 9.94]
And based on the eti
eti(model_bayes) ## # EqualTailed Interval ## ## Parameter  89% ETI ##  ## (Intercept)  [29.21, 36.32] ## age  [0.17, 0.11] ## dis  [0.66, 0.17] ## chas1  [ 5.17, 9.79]
Using both methods, the only non significant coefficient is dis variable, which is inline with the classical regression.
Note: this similar result between frequentist and bayesian regression may due to the normality assumption for the former that is well satisfied which gives satisfied results and due to the normal prior used in the latter. However, in real world it is less often to be sure about the normality assumption which may give contradict conclusions between the two approaches.
Another way to test the significance by checking the part of the credible interval that falls inside the ROPE interval. we can get this by calling the rope
from bayestestR package
rope(post$age) ## # Proportion of samples inside the ROPE [0.10, 0.10]: ## ## inside ROPE ##  ## 0.00 %
For age almost all the credible interval (HDI) is outside the ROPE range, which means that coefficient is highly significant.
rope(post$chas1) ## # Proportion of samples inside the ROPE [0.10, 0.10]: ## ## inside ROPE ##  ## 0.00 % rope(post$`(Intercept)`) ## # Proportion of samples inside the ROPE [0.10, 0.10]: ## ## inside ROPE ##  ## 0.00 %
The same thing is true for the chas and intercept variable.
rope(post$dis) ## # Proportion of samples inside the ROPE [0.10, 0.10]: ## ## inside ROPE ##  ## 23.28 %
In contrast, almost the quarter of the credible interval of dis variable is inside the ROPE interval. In other words, the probability of this coefficient to be zero is 23.28%.
rope_range(model_bayes) ## [1] 0.9197104 0.9197104
PD and Pvalue
Sometimes we are only interested to check the direction of the coefficient (positive or negative). this is the role of pd statistic in the above table, high value means that the associated effect is concentrated on the same side as the median. For our model, since pd’s equal to 1, almost all the posteriors of the two variables age and chas1 and the intercept are on the same side (if median negative all other values are negatives). However, it should be noted that this statistic does not assess the significance of the effect.
Something more important to mention is that it exists a strong relation between this probability and the pvalue approximated as follows: \(pvalue=1pd\). let’s check this with our variables.
df1 <dplyr::select(tidy(model_freq), c(term,p.value)) df1$p.value < round(df1$p.value, digits = 3) df2 < 1 purrr::map_dbl(post, p_direction) df < cbind(df1,df2) df ## term p.value df2 ## (Intercept) (Intercept) 0.000 0.000 ## age age 0.000 0.000 ## dis dis 0.354 0.176 ## chas1 chas1 0.000 0.000
## Conclusion
within the last decade more practitioner , specially in some fields such as medicine and psychology, are turning towards bayesian analysis since almost every thing can be interpreted straightforwardly with a probabilistic manner. However, the bayesian analysis has also some drawback , like the subjective way to define the priors (which play an important role to compute the posterior), or for problems that do not have conjugate prior, not always the mcmc alghoritm converges easily to the right values (specially with complex data).

Kevin P.murphy, Machine Learning: A Probabilistic Perspective, 2012, page 589↩︎
Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job.
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.