**Statistical Research » R**, and kindly contributed to R-bloggers)

The New York City mayoral Democratic primary election is taking place this coming Tuesday (Sep. 10th) and there are several candidates in the running. Bill de Blasio is the front runner and is expected to win. However, there is a catch. Even if he takes the plurality of the vote he may not actually win. In New York if the candidate does not get at least 40% of the vote then the election goes into a run-off on October 1st. Meaning that Bill Thompson or Christine Quinn (whoever takes 2nd) will then have a second chance to win. This type of problems can be a little difficult to handle using standard univariate statistics. When there are only two candidates then a simple Beta distribution can be used to adequately model the data. However, with there being nine declared candidates the Beta distribution just doesn’t work very well. The Dirichlet distribution is the multivariate generalization of the Beta and this can be used to simulate the outcome when there are multiple candidates. A convenient feature of the Dirichlet is that the marginal distributions are distributed as a Beta making computations on the marginals fairly simple.

I gave a presentation at an AAPOR conference a while back where I used this approach but expanded the model to all 50 states and then applied winner-take-all electoral votes (except for Maine and Nebraska which have a proportion style). In this way I was able to calculate the posterior distribution of the electoral vote and calculate credible intervals and other distributional characteristics. This approach has proved to be a very good way to make estimates and a great way to visual the distribution of the data.

So then the question is whether Bill de Blasio will get more votes than the next candidate and what is the probability that he will take it all without a run-off. There have been several polls by many well respected polling organizations including Quinnipiac, NY Times and NBC, among others.

This post is designed to be a more technical discussion of the underlying and distributional components of the 2013 NYC Democratic Mayoral election (and really elections in general) rather than for a comprehensive discussion. For demonstration simplicity sake I will take the most recent available poll on www.realclearpolitics.com which was Quinnipiac who fielded from Aug 8th to Sep 1. More polls are certain to arrive after this weekend so it will be interesting to see what happens to the probability distributions prior to Election Day.

Here we can see that de Blasio comes very close to obtaining not only a plurality of votes but the necessary 40%. However, the one item that may not necessarily be emphasized in these polls is that there is still a group of undecided voters. Normally, that does not make a major difference if it is expected that the undecided voters will break proportionally (or even somewhat proportionally) among other voters. Because when the candidates take the most votes they simply win. In this case even if a small portion of the voters start to break for de Blasio then that could put him over the top and prevent a run-off. So then what is the probability that he can avoid a run-off? Well, by running a simulation we can put in those constraints that he must beat the next candidate AND get more than 40% of the vote. In this case I set Thompson as the next candidate to compare.

Based on the given data his chances of avoiding a run-off are not that great as can be seen in the graph below. However, if he gains his proportion of the undecided voters then that will increase his probability of avoiding a run-off. Not but much but it ends up being just short of 50/50 chance. The given model can, of course, be improved with some work.

Maybe I’ll put a little bit of work into the upcoming New Jersey Senate election with Cory Booker and the New Jersey (with Chris Christie) & Virginia (with Cuccinelli & McAuliffe) Governor elections. I’ll see if time permits.

library(MCMCpack) ## Set up several of the recent polls but will only work with the most recent on raw.1 = NULL raw.1 = data.frame( rbind( Quinnipiac = c(.43,.20,.18,750), NYT = c(.32,.18,.17,505), Quinnipiac2= c(.36,.20,.21,602), amNY = c(.29, .24,.17, 600), NBC = c(.24, .18,.24, 355), Quinnipiac.adj = c(.465, .226, .194, 750) ## adjust for undecided ) ) names(raw.1) < - c("Cand1","Cand2","Cand3","size") raw.1$Other <- 1-raw.1$Cand1-raw.1$Cand2-raw.1$Cand3 raw <- raw.1 ################################################################### ## More than two candidates so Beta distribution won't work ## Function to randomly generate data from a dirichlet distribution ################################################################### prob.win <- function(j,export=1){ p=rdirichlet(100000, raw$size[j] * c(raw$Cand1[j],raw$Cand2[j],raw$Cand3[j],1-raw$Cand1[j]-raw$Cand2[j]-raw$Cand3[j]) /100+1) if(export==1){ mean(p[,1]>p[,2] & p[,1]>=.4) ## de Blasio need to beat Thompson AND get >= .40 } else { return(p) } } ( cand1.win.probs < - sapply(1:nrow(raw),prob.win) ) sim.dir = prob.win(1,export=2) ## set simulated data for plotting and determining parameters ## The shape parameters (shape1 and shape2) might need some manual adjusting fit.distr.1 = fitdistr(sim.dir[,1], "beta", start=list(shape1=5,shape2=5)) fit.distr.2 = fitdistr(sim.dir[,2], "beta", start=list(shape1=4,shape2=7)) fit.distr.3 = fitdistr(sim.dir[,3], "beta", start=list(shape1=2,shape2=9)) ## Could also draw a histogram of simulated data curve(dbeta(x,fit.distr.1$estimate[1],fit.distr.1$estimate[2]), ylim=c(0,4), col=1, lty=1, lwd=2, ylab="Density", xlab="theta", main="Distribution of the Democratic NYC Mayor Election 2013", sub=paste("Probability that de Blasio beats Thompson AND gets > 40% of Vote: ", round(cand1.win.probs[1],2) ) ) ## Candidate 1 curve(dbeta(x,fit.distr.2$estimate[1],fit.distr.2$estimate[2]), add=T, col=3, lty=2, lwd=2) ## Candidate 2 curve(dbeta(x,fit.distr.3$estimate[1],fit.distr.3$estimate[2]), add=T, col=4, lty=3, lwd=2) ## Candidate 3 abline(v=c(median(sim.dir[,1]), median(sim.dir[,2]), median(sim.dir[,3])), col=c(1,3,4), lwd=2, lty=c(1,2,3)) legend("topright",c("de Blasio","Thompson","Quinn"), lwd=2, col=c(1,3,4), lty=c(1,2,3))

**leave a comment**for the author, please follow the link and comment on their blog:

**Statistical Research » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...