Venezuelan Parliamentary Election: What do the Polls Say?

[This article was first published on Daniel, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

There is not a huge population of opinion polls covering this parliamentary election in Venezuela, but all I've can be used to gauge the public opinion by the local polling houses. This posting begs an obvious question: how has the mood in Venezuela varied over time with respect to voting intentions for the two political blocs? Next, can we detect any biases among those publishing polls?

The data

I've collected some polls available on the internet dating back to January 2014, which I made available here after some data janitor work.

Polls over time

After a bit filling-in-the-blanks working with missing date values, we can visualize the poll trends over time. Given the sample size, sampling error and other sources of noise, a loess model can pretty much pick out the signals of long-term trends.

Loess model

Pollster biases

Let's pretend we can trust on all those polls despite the huge variability among them as already mentioned here. In fact, the problem is not the variability as such, but my lack of knowledge about who are the pollsters and their past performance, so I can't judge them at first, let's say it clearly. Nonetheless, if we accept the above models as a sound estimate of the expected poll response at a given time, we can analyze the residuals of actual poll results and look for systematic biases. In theory, with a decent sample size (all have ~ 1300) and a reasonably stratified sampling method (I'm not even assuming random samples here), we might expect polls results to be roughly normally distributed around the expected polls result, regardless of who performed or commissioned the poll, right?

The graph below shows the distributions per polling house for those who polled more than a single poll in this dataset.

House Effects

We've to keep in mind that there are important caveats which we're not addressing here, as that different polls have used different question sets, methods etc, so this isn't evidence for anything underhanded per se. It seems reasonable to expect that while parties might have good reasons to publish polls in their favor, pollsters conducting the polls should generally be more or less indifferent.

The results are hampered by a small number of data points per pollster, and that to claim they are polling significantly above or below expectation, save for the Hercón, which is significantly more pro opposition (MUD) than expected, given the probability laws, although the p-value is just above the 5% thumb/convention. With a little research, I figure out that Datanálisis performed fine in the previous elections, and here it appears just around the center of the distribution leaning toward the Socialists (PSUV).

pollster p
Meganalisis 0.1875000
Venebarómetro 0.2500000
IVAD 0.4375000
ICS 0.5000000
Delphos 1.0000000
Datanálisis 0.7646484
VARIANZAS 1.0000000
Consultores 0.2500000
DatinCorp 0.5625000
Keller y Asociados 0.2500000
Hercón 0.0625000


What do the polls say? Well, the majority of Venezuelans are favoring opposition candidates and this has been the trend for at least the latter two years, however polls appear to have been more variable in recent months. This election is expected to bring the opposition to control the National Assembly after 16 years loosing the elections in the country. The Venezuela's Socialists seem to be at risk, but predicting the final number of seats is a tough task that I'm not considering in this post. In fact, it might be really difficult to set forth a range of winning seats as the government recently enacted some redistricting seats in order to weaken an eventual absolute majority by the opposition. Somehow, the polls show this will be a significant symbolic defeat for the government that shows it lost despite all the advantages in state power and control over the media.


source = ""

data <- read.csv(source, sep="t", encoding = "UTF-8")

# Correcting for empty date values

data[,2:3]<-lapply(data[,2:3],as.Date, format = "%d-%m-%Y")

times <- function(x)(x*100)


days = round(mean(data$end-data$begin, na.rm=TRUE))
mask =$end)

data$end[mask] = data$begin[mask]+days

# Find middle time point
DaysInField = round(mean(data$end-data$begin, na.rm=TRUE))
data$date = data$begin+DaysInField
polls <- melt(data, id.vars=c("house", "date"), 
     measure.var=c("MUD", "PSUV", "Others", "Undecided"))
colnames(polls)[3] <- "response"
levels(polls$response) <- c("MUD", "PSUV", "Others", "Swing")

ggplot(polls, aes(x=date, y=value, col=response, fill=response)) + 
  geom_point() + geom_smooth(method="loess", alpha=I(.2)) +
  theme_538() + 
 theme(legend.position=c(.5,.95), legend.direction="horizontal") +
  scale_color_manual(values = c("blue", "red", "orange", "grey40")) +
  scale_fill_manual(values = c("blue", "red", "orange", "grey40")) +
  scale_x_date(labels = date_format("%b '%y")) +
  scale_y_continuous(breaks=seq(0, 70, 10), limits=c(0,70)) +
  geom_hline(yintercept=0,size=1.2,colour="#535353") +
  ggtitle("Vote Intention Among Venezuelans") +
  labs(x="", y="%", fill="Poll response:", col="Poll response:")
# credits
## Residual analysis per pollster
l.MUD <- loess(value ~ as.numeric(date), data=subset(polls, response=="MUD"))
l.PSUV <- loess(value ~ as.numeric(date), data=subset(polls, response=="PSUV"))
l.Others <- loess(value ~ as.numeric(date), data=subset(polls, response=="Others"))
l.Swing <- loess(value ~ as.numeric(date), data=subset(polls, response=="Swing"))

with(polls, plot(as.numeric(date), value))
lines(as.numeric(polls[polls$response == "MUD",]$date),
      predict(l.MUD, as.numeric(polls[polls$response == "MUD",]$date)))

# Calculate predicted values per row, 
polls$predicted <- NA

loessPrediction <- function(resp, model){
  rows <- polls$response == resp
  curr <- polls[rows,]
  preds <- with(curr, predict(model, as.numeric(date)))
  polls[rows,]$predicted <<- preds

loessPrediction("MUD", l.MUD)
loessPrediction("PSUV", l.PSUV)
loessPrediction("Others", l.Others)
loessPrediction("Swing", l.Swing)

polls$residual <- polls$value - polls$predicted

## Order pollster by median residual:
ordering <- group_by(polls, pollster) %>%
  filter(response == "MUD") %>%
  summarize(med = median(residual, na.rm=T), count=n()) %>%

polls$pollster <- factor(polls$pollster, levels=ordering$pollster)

## Testing for biases by a given pollster 
ggplot(subset(polls, response == "MUD"), 
       aes(x=pollster, y=residual)) +
  geom_hline(aes(yintercept=0)) +
  geom_violin(scale="width", fill=I("grey50"), col=I("grey50")) + 
  geom_jitter(position=position_jitter(width=.05)) + 
  stat_summary(geom = "crossbar", width=0.75, fatten=2, 
               color="grey20", fun.y=median, fun.ymin=median, fun.ymax=median) +
  coord_flip() + theme_538() + ggtitle("Relative MUD voting intentions") +
  labs(x="Polling house",
       y="Comparison with other polls at the time") +
## Stats significance, is it any?
sigTable <- polls %>% filter(response == "MUD") %>%
  group_by( pollster) %>%
  summarise(p=wilcox.test(residual, mu=0)$p.value) 

To leave a comment for the author, please follow the link and comment on their blog: Daniel. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)