Bayes for President!

[This article was first published on Gianluca Baio's blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I couldn’t resist getting sucked into the hype associated with the US election and debates, and so I thought I had a little fun of my own and played around a bit with the numbers. [OK: you may disagree with the definition of “fun” $-$ but then again, if you’re reading this you probably don’t…]

So, I looked on the internet to find reasonable data on the polls. Of course there are a lot of limitations to this strategy. First, I’ve not bothered doing some sort of proper evidence synthesis, taking into account the different polls and pooling them in a suitable way. There are two reasons why I didn’t: the first one is that not all the data are publicly available (as far as I can tell), so you have to make do with what you can find; second, I did find some info here, which seems to have accounted for this issue anyway. In particular, this website contains some sort of pooled estimates for the proportion of people who are likely to vote for either candidate, by state, together with a “confidence” measure (more on this later). Because not all the states have data, I have also looked here and found some additional info. 

Leaving aside representativeness issues (which I’m assuming are not a problem, but may well be, if this were a real analysis!), the second limitation is of course that voting intentions may not directly translate into actual votes. I suppose there are some studies out there to quantify this, but again, I’m making life (too) easy and discount this effect.

The data on the polls that I have collected in a single spreadsheet look like this
ID State Dem Rep   State_Name Voters Confidence
1    AK  38  62       Alaska      3    99.9999
2    AL  36  54      Alabama      9    99.9999
3    AR  35  56     Arkansas      6    99.9999
4    AZ  44  52      Arizona     11    99.9999
5    CA  53  38   California     55    99.9999
…      …       …      …       …

The columns Dem and Rep represent the pooled estimation of the proportion of voters for the two main parties (of course, they may not sum to 100%, due to other possible minor candidates or undecided voters). The column labelled Voters gives the number of Electoral Votes (EVs) in each state (eg if you win at least 50% of the votes in Alaska, this is associated with 3 votes overall, etc). Finally, the column Confidence indicates the level of (lack of) uncertainty associated with the estimation. States with high confidence are “nearly certain” $-$ for example, the 62% estimated for Republicans in Alaska is associated with a very, very low uncertainty (according to the polls and expert opinions). In most states, the polls are (assumed to be) quite informative, but there are some where the situation is not so clear cut.

I’ve saved the data in a file, which can be imported in R using the command
polls <- read.csv(“”)

At this point, I need to compute the 2-party share for each state (which I’m calling $m$) and fix the number of states at 51
m <- Dem/(Dem+Rep)
Nstates <- 51

Now, in truth this is not a “proper” Bayesian model, since I’m only assuming informative priors (which are supposed to reflect all the available knowledge on the proportion of voters, without any additional observed data). Thus, all I’m doing is a relatively easy analysis. The idea is to first define a suitable informative prior distribution based on the point estimation of the democratic share and with uncertainty defined in terms of the confidence level. Then I can use Monte Carlo simulations to produce a large number of “possible futures”; in each future and for each state, the Democrats will have an estimated share of the popular vote. If that is greater than 50%, Obama will have won that state and the associated EVs. I can then use the induced predictive distribution on the number of EVs to assess the uncertainty underlying an Obama win (given that at least 272 votes are necessary to become president).

In their book, Christensen et al show a simple way of deriving a Beta distribution based on an estimation of the mode, an upper limit and a confidence level that the variable is below that upper threshold. I’ve coded this in a function betaPar2, which I’ve made available from here (so you need to download it, if you want to replicate this exercise).

Using this bit of code, one can estimate the parameters of a Beta distribution centered on the point estimate and for which the probability of exceeding the threshold 0.5 is given by the level of confidence.
a <- b <- numeric()
for (s in 1:Nstates) {
if (m[s] < .5) {
bp <- betaPar2(m[s],.499999,Confidence[s]/100)
a[s] <- bp$res1
b[i] <- bp$res2
if (m[s] >=.5) {
bp <- betaPar2(1-m[s],.499999,Confidence[s]/100)
a[s] <- bp$res2
b[s] <- bp$res1

The function
betaPar2 has several outputs, but the main ones are res1 and res2, which store the values of the parameters $\alpha$ and $\beta$, which define the suitable Beta distribution. In fact, the way I’m modelling is to say that if the point estimate is below 0.5 (a state $s$ where Romney is more likely to win), then I want to derive a suitable pair $(\alpha_s,\beta_s)$ so that the resulting Beta distribution is centered on $m_s$ and for which the probability of not exceeding 0.5 is given by $c_s$ (which is defined as the level of confidence for state $s$, reproportioned in [0;1]). However, for states in which Obama is more likely to win ($m_s\geq 0.5$), I basically do it the other way around (ie working with 1$-m_s$). In these cases, the correct Beta distribution has the two parameters swapped (notice that I assign the element res2 to $\alpha_s$ and the element res1 to $\beta_s$).

For example, for Alaska (the first state), the result is an informative prior like this.

In line with the information from the polls, the estimated average proportion of Democratic votes is around 38% and effectively there’s no chance of Obama getting a share that is greater than 50%.

Now, I can simulate the $n_{\rm{sims}}=10000$ “futures”, based on the uncertainty underlying the estimations using the code
nsims <- 10000
prop.dem <- matrix(NA,nsims,Nstates)
for (i in 1:Nstates) {
prop.dem[,i] <- rbeta(nsims,a[i],b[i])
The matrix prop.dem has 10000 rows (possible futures) and 51 columns (one for each state).

I can use the package coefplot2 and produce a nice summary graph
means <- apply(prop.dem,2,mean)
sds <- apply(prop.dem,2,sd)
low <- means-2*sds
upp <- means+2*sds
reps <- which(upp<.5)         # definitely republican states
dems <- which(low>.5)         # definitely democratic states
m.reps <- which(means<.5 & upp>.5) # most likely republican states
m.dems <- which(means>.5 & low<.5) # most likely democratic states
cols <- character()
cols[reps] <- "red"
cols[dems] <- "blue"
cols[m.reps] <- "lightcoral"
cols[m.dems] <- "deepskyblue"
vn <- paste(as.character(State)," (",Voters,")",sep="")
coefplot2(means,sds,varnames=vn,col.pts=cols,main=”Predicted probability of democratic votes”)

This gives me the following graph showing the point estimate (the dots), 95% and 50% credible intervals (the light and dark lines, respectively). Those in dark blue and bright red are the “definites” (ie those that are estimated to be definitely Obama or Romney states, respectively). Light blues and reds are those undecided (ie for which the credible intervals cross 0.5).

Finally, for each simulation, I can check that the estimated proportion of votes for the Democrats exceeds 0.5 and if so allocate the EVs to Obama, to produce a distribution of possible futures for this variable.
obama <- numeric()
for (i in 1:nsims) {
obama[i] <- (prop.dem[i,]>=.5)%*%Voters
hist(obama,30,main=”Predicted number of Electoral Votes for Obama”,xlab=””,ylab=””)
abline(v=270,col=”dark grey”,lty=1,lwd=3)

So, based on this (veeeeery simple and probably not-too-realistic!!) model, Obama has a pretty good chance of being re-elected. In almost all the simulations his share of the votes guarantees he gets at least the required 272 EVs $-$ in fact, in many possible scenarios he actually gets many more than that.

Well, I can only hope I’m not jinxing it!

To leave a comment for the author, please follow the link and comment on their blog: Gianluca Baio's blog. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)