**Statistical Research » R**, and kindly contributed to R-bloggers)

The beta distribution is highly flexible distribution and applies to many situations and environments. The beta distribution applies well when there are percentages. The upcoming New Jersey U.S. Senate election on Wednesday fits that criterion quite well. So here I applied the beta distribution to some pre-election polls where the numbers were obtained through the poll aggregator www.realclearpolitics.com.

The candidates for New Jersey election this Wednesday — to fill the vacant seat left by the death of Frank Lautenberg — are Cory Booker and Steve Lonegan. Though there are other third-party candidates running the race it is effectively between Booker and Lonegan. Though more complex models can be used reducing the candidates to two the beta distribution can be applied to these data and the outcomes and a simple simulation can be achieved using the given data.

**Some Historical Notes
**

This general election is on a non-standard Election Day (Wednesday, October 16th). It happens to be the first time that a New Jersey general election has been held on a Wednesday. Aside from the current Republican senator who was appointed by Chris Christie the last time there was a Republican U.S. Senator in New Jersey was back in the early 1980′s and even then he too was appointed to the office.

**The Beta Distribution**

As can be seen from the elections since 1990 the democratic candidate has won by an average of about 8.9%.

2012 — Menendez: 58.9% v. Kyrillos: 39.4%

2008 — Lautenberg: 55.5% v. Zimmer: 42.5%

2006 — Menendez: 53.3% v. Kean Jr.: 44.3%

2002 — Lautenberg: 53.9% v. Forrester: 44.0%

2000 — Corzine: 50.1% v. Franks: 47.1%

1996 — Torricelli: 52.7% v. Zimmer: 42.6%

1994 — Lautenberg: 50.3% v. Haytaian: 47.0%

1990 — Bradley: 50.5% v. Whitman: 47.4%

Based on recent pre-election polling it looks like Booker will likely win by a similar margin and maybe a little higher than the average of 8.9% and, based on pre-election polls, closer to 12 percentage points. The marginal difference between Booker and Lonegan is distributed as a beta distribution and we can see that the threshold of zero (0) is out in the far tail of the distribution. So based on historical election and current pre-election polling it seems that the likelihood that Booker will win is very high.

**Example Code**

library(MCMCpack)

## Set up several of the recent polls but will only work with the most recent on

raw.1 = NULL

raw.1 = data.frame( rbind(

Quinnipiac = c(.53,.41,899),

RSC = c(.50,.39,729),

FD= c(.45,.29,702),

Mon = c(.53, .40,571)

)

)

raw.1 = rbind(raw.1, c(apply(raw.1[,1:2],2,weighted.mean,raw.1[,3]),sum(raw.1[,3])))

names(raw.1) = c(“Cand1″,”Cand2″,”size”)

raw.1$Other.und = 1-raw.1$Cand1-raw.1$Cand2

raw.1.no.und = data.frame(raw.1[5,1:2] + raw.1[5,1:2]/sum(raw.1[5,1:2])*raw.1[5,4],size=raw.1[5,3],Other.und=0)

raw = rbind(raw.1, raw.1.no.und)

###################################################################

## More than two candidates so Beta distribution won’t work

## Function to randomly generate data from a dirichlet distribution

###################################################################

j= 4

prob.win = function(j,export=1){

p=rdirichlet(100000,

raw$size[j] *

c(raw$Cand1[j], raw$Cand2[j], 1-raw$Cand1[j]-raw$Cand2[j])+1

)

if(export==1){

mean(p[,1]>p[,2])

} else {

return(p)

}

}( cand1.win.probs = sapply(1:nrow(raw),prob.win) )

sim.dir = prob.win(4,export=2) ## set simulated data for plotting and determining parameters

sim.dir.diff = sim.dir[,1]-sim.dir[,2] ## Get the marginal. From a Dirichlet the is distributed as a Beta.

sim.dir = cbind(sim.dir, sim.dir[,1]-sim.dir[,2])

## The shape parameters (shape1 and shape2) might need some manual adjusting and tweaking.

## In this case I ran the function a few time to set the start value close to the output

fit.distr.1 = fitdistr(sim.dir[,1], “beta”,

start=list(shape1=302,shape2=270))

fit.distr.2 = fitdistr(sim.dir[,2], “beta”,

start=list(shape1=229,shape2=343))

fit.distr.margin = fitdistr(sim.dir[,4], “beta”,

start=list(shape1=5,shape2=5))

## Could also draw a histogram of simulated data

curve(dbeta(x,fit.distr.1$estimate[1],fit.distr.1$estimate[2]),

ylim=c(0,20), xlim=c(.3,.6), col=’blue’, lty=1, lwd=2, ylab=”Density”, xlab=”theta”,

main=”Distribution of the NJ U.S. Senate Election 2013″,

sub=paste(“Probability that Booker beats Lonegan: “, round(cand1.win.probs[6],2) ) ) ## Candidate 1

curve(dbeta(x,fit.distr.2$estimate[1],fit.distr.2$estimate[2]), add=T, col=’red’, lty=2, lwd=2) ## Candidate 2abline(v=c(median(sim.dir[,1]), median(sim.dir[,2])), col=c(‘blue’,'red’), lwd=2, lty=c(1,2,3))

legend(“topleft”,c(“Booker”,”Lonegan”), lwd=2, col=c(‘blue’,'red’), lty=c(1,2))

## Draw a histogram of simulated data

hist(sim.dir[,4], nclass=100, main=”Histogram of the Candidate Differences”, xlab=”Candidate Difference”)

abline(v=0, col=c(‘black’), lwd=2, lty=c(1))

**leave a comment**for the author, please follow the link and comment on their blog:

**Statistical Research » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...