**Gianluca Baio's blog**, and kindly contributed to R-bloggers)

I couldn’t resist getting sucked into the hype associated with the US election and debates, and so I thought I had a little fun of my own and played around a bit with the numbers. *[OK: you may disagree with the definition of “fun” $-$ but then again, if you’re reading this you probably don’t…]*

So, I looked on the internet to find reasonable data on the polls. Of course there are a lot of limitations to this strategy. First, I’ve not bothered doing some sort of proper evidence synthesis, taking into account the different polls and pooling them in a suitable way. There are two reasons why I didn’t: the first one is that not all the data are publicly available (as far as I can tell), so you have to make do with what you can find; second, I did find some info here, which seems to have accounted for this issue anyway. In particular, this website contains some sort of pooled estimates for the proportion of people who are likely to vote for either candidate, by state, together with a “confidence” measure (more on this later). Because not all the states have data, I have also looked here and found some additional info.

Leaving aside representativeness issues (which I’m assuming are not a problem, but may well be, if this were a real analysis!), the second limitation is of course that voting intentions may not directly translate into actual votes. I suppose there are some studies out there to quantify this, but again, I’m making life (too) easy and discount this effect.

The data on the polls that I have collected in a single spreadsheet look like this

ID State Dem Rep State_Name Voters Confidence

1 AK 38 62 Alaska 3 99.9999

2 AL 36 54 Alabama 9 99.9999

3 AR 35 56 Arkansas 6 99.9999

4 AZ 44 52 Arizona 11 99.9999

5 CA 53 38 California 55 99.9999

… … … … …

**the available knowledge on the proportion of voters, without any additional observed data). Thus, all I’m doing is a relatively easy analysis. The idea is to first define a suitable informative prior distribution based on the point estimation of the democratic share and with uncertainty defined in terms of the confidence level. Then I can use Monte Carlo simulations to produce a large number of “possible futures”; in each future and for each state, the Democrats will have an estimated share of the popular vote. If that is greater than 50%, Obama will have won that state and the associated EVs. I can then use the induced predictive distribution on the number of EVs to assess the uncertainty underlying an Obama win (given that at least 272 votes are necessary to become president).**

*all*The function betaPar2 has several outputs, but the main ones are res1 and res2, which store the values of the parameters $\alpha$ and $\beta$, which define the suitable Beta distribution. In fact, the way I’m modelling is to say that if the point estimate is below 0.5 (a state $s$ where Romney is more likely to win), then I want to derive a suitable pair $(\alpha_s,\beta_s)$ so that the resulting Beta distribution is centered on $m_s$ and for which the probability of not exceeding 0.5 is given by $c_s$ (which is defined as the level of confidence for state $s$, reproportioned in [0;1]). However, for states in which Obama is more likely to win ($m_s\geq 0.5$), I basically do it the other way around (ie working with 1$-m_s$). In these cases, the correct Beta distribution has the two parameters swapped (notice that I assign the element res2 to $\alpha_s$ and the element res1 to $\beta_s$).

**leave a comment**for the author, please follow the link and comment on their blog:

**Gianluca Baio's blog**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...