Bayesian data analysis follows a very simple and general recipe:
- Specify a model and likelihood, i.e. what process do you think is generating your data?
- Specify a prior distribution, i.e. quantify what you know about a problem before having seen the data.
- Apply Bayes’ rule and out comes the posterior distribution, i.e. what you know about the problem after having seen the data. This is typically what you would report as the result of you analysis.
Because of step 2, Bayesian methods are sometimes criticized for being “subjective”: you and I may have different ideas about a particular problem and therefore we may have different prior distributions, leading to different posterior results in step 3. Such criticism is largely unfounded since step 1 almost always has much more influence on the results and since there exist good methods for specifying “objective priors”. Nevertheless it does point to a problem in communicating the results of a Bayesian data analysis: Should I report only the results obtained using my personal (subjective) prior? Should I try to make the analysis as objective as possible by specifying an “uninformative” prior? Or should I report the results obtained under various different prior specifications? This problem often comes up in practice, with one of the most common questions in academic referee reports being: “Are your results robust to changing the prior distribution on variable X?”.
Ideally, the reader of a data analysis report should be able to easily obtain the posterior results under different prior specifications without having to repeat the complete analysis. In his book, Bayesian econometrician John Geweke points out that this is often easy to do: If the reader has access to draws from the posterior distribution resulting from the data analysis, the results under a different prior can be obtained by simple re-weighting these draws using the principle of importance sampling. Although this is a simple and powerful idea, I have never seen it being used in practice. Presumably that is because until recently such functionality was difficult to offer online, and required running specialized software such as R or Matlab on the reader’s PC. All of this changed however with the recent announcement of a new R package called ‘Shiny’ by the folks at RStudio. In their own words “Shiny makes it super simple for R users like you to turn analyses into interactive web applications that anyone can use.”
Today I have been playing around with Shiny, and I can say: they’re not lying! I produced the interactive graph below using just a few lines of R code. The graph is from my recent paper analyzing economic growth data for a cross-section of countries. One of the problems in studying economic growth is that we have relatively little data and many potential explanatory variables: In my case 88 observations of economic growth rates, and 67 explanatory variables. Because it is hard to be sure which and how many of these explanatory variables to include in our model it is customary to perform a Bayesian model averaging, which averages over all possible selections of variables. The graph below shows the posterior distribution of the model size in this model averaging, i.e. the distribution of the number of explanatory variables that ended up being included in the regression models. In my paper I show this graph for a flat prior on the model size (represented by the black line in the graph), but really this choice is just as arbitrary as any other. By playing with the sliders at the bottom of the graph you can now easily see how this posterior distribution changes with the prior specification:
Refresh your browser if the graph is not showing.
The graph above is being generated by an R script that is currently being hosted free of charge under beta-testing on the Shiny server platform, and will stay up at least until they decide to charge me Overall I’m very impressed with the capabilities and ease of use offered by this package and I’m looking forward to seeing how other data scientists will use it to present data analyses in an interactive way.