Modelling individual party vote from the 2014 New Zealand election study

[This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Someone asked on Twitter about the characteristics of New Zealand First voters. While some crude conclusions can be drawn from examining votes by location cast and then comparing that with census data, we really need individual level data to answer the question properly. I set myself this task as a motivation for exploring the New Zealand Election Study.

This turned into a fairly lengthy blog post and includes more than 500 lines of code. Skip down to the very end if you just want to see how things turn out for the four main parties and the main conclusions from that. But first here’s the best version of the result with relation to New Zealand First:

Later down there’s a chart comparing this to the voters for the National, Labour and Green parties.

Functionality

Here’s the R packages I needed to do this analysis, plus a few convenience functions I wrote specifically for it.

library(tidyverse)
library(scales)
library(foreign)   # for importing SPSS data
library(survey)    # for survey weighting and analysis
library(forcats)   # for manipulating factor levels
library(mgcv)      # for generalized additive models
library(glmnet)    # for elastic net regularisation
library(mice)      # for imputation
library(testthat)
library(broom)     # for reshaping model outputs
library(stringr)   # for str_wrap
library(boot)
library(doParallel)
library(gridExtra)
library(nzelect)   # for party colours
library(lme4)      # for mixed effects / multilevel modelling

#-------------convenience functions--------------
camel_to_english <- function(camelCase){
   return(gsub("([a-z])([A-Z])", "\\1 \\L\\2", camelCase, perl = TRUE))
}

# Convert five category membership question (for trade unions, business
# associations, etc) into Yes or No.
membership <- function(x){
   tmp <- fct_recode(x,
                     Yes = "I belong, but no one else in the house does",
                     Yes = "I do, plus another in the house",
                     Yes = "Another person belongs, but not me",
                     No  = "No, no one belongs",
                     No  = "Don't know") 
   tmp <- ifelse(tmp == "Yes", 1, 0)
   # Uncomment the next line if we want to turn NA into 0.
   # tmp <- ifelse(is.na(tmp), 0, tmp)
   return(tmp)
}

# Draw confidence interval segment chart for two types of models:
# those with variance-covariance matrices, and hence can use confint()
# on them; and pooled models created by with.mice() after multiple
# imputation.
sum_plot <- function(model, 
                     title = NULL, 
                     party = gsub("PartyVote", "", as.character(model$formula)[2]),
                     colour = "#000000",
                     type = c("vcov", "pool")){
   type <- match.arg(type)
   if(type == "vcov"){
      conf_ints <- cbind(tidy(confint(model)), coef(model))
   } else {
      conf_ints <- tidy(summary(pool(model))) %>%
         select(.rownames, lo.95, hi.95, est)
   }
   
   tmp <- conf_ints %>%
      filter(.rownames != "(Intercept)")
   names(tmp)    <- c("var", "lower", "upper", "point")

   p <- tmp %>%
      mutate(var = camel_to_english(var),
             var = fct_reorder(var, point)) %>%
      ggplot(aes(y = var))  +
      geom_vline(xintercept= 0) +
      geom_segment(aes(x = lower, xend = upper, yend = var), 
                   size = 3, colour = colour, alpha = 0.5) +
      geom_point(aes(x = point)) +
      ggtitle(title) +
      labs(caption = "New Zealand Election Survey, analysed at https://ellispgithub.io",
           x = paste("Impact on log odds of voting", party),
           y = str_wrap("Compared to 'no religion', ''age30-55', 'high household income', 'school qualification', 'working full time", 50))
   return(p)
}

Sourcing the New Zealand Election Study data

The New Zealand Election Study generously makes its data available for free. The code snippet below assumes the zip file with the SPSS version of the data has been downloaded into the location specified on d: drive; obviously change this to wherever you have it saved.

#------------------data download-----------------

# Data downloaded from http://www.nzes.org/exec/show/data and because
# they want you to fill in a form to know who is using the data, I
# won't re-publish it myself
unzip("D:/Downloads/NZES2014GeneralReleaseApril16.sav.zip")

nzes_orig <- read.spss("NZES2014GeneralReleaseApril16.sav", 
                  to.data.frame = TRUE, trim.factor.names = TRUE)
varlab <- cbind(attributes(nzes_orig)$variable.labels)

DT::datatable(varlab)

SPSS data includes both short column names and full verbose questions, so the data frame of metadata called varlab in the R snippet above looks like this:

Data prep

The sample size is 2,835 and it includes 234 respondents who claim they party-voted for New Zealand First (note that nine of these were identified by the researchers as actually not voting at all; I made the call to count people on the basis of their self-report). I’m interested in a demographic characterisation of the New Zealand First voters compared to the rest of the New Zealand population (rather than, for example, compared to the voters of a single other party) so I’m going to be using a logistic regression with a response variable of 1 if the respondent voted for NZ First and 0 otherwise.

It’s often not appreciated that, on a rule of thumb, you need more data per explanatory variable for logistic regression than a continuous response: twice as much if the response is split 50/50, 10 times as much if it’s split 90/10.

Based on Frank Harrell’s sample size advice in Regression Modeling Strategies I’ve got somewhere between 11 and 23 degrees of freedom to “spend” on explanatory variables; that is 1/10 or 1/20 of the number of cases of the least frequent value of the categorical response variable. I certainly can’t afford to use all 437 columns in the data. Many of those columns are attitudinal variables that are out of scope for today (hope to come back later), but there’s still far too much socio-economic and demographic data to use it all in a model without more data. So I follow Harrell’s general approach of working out how many degrees of freedom I have to spend; which explanatory variables are committed to be in (no sneaky stepwise reductions based on p values); and how to allocate the degrees of freedom to those variables.

My options were restricted by what is available, and with what is available I had to make choices. For example, I collapsed down the nine household income categories into just two: “lower” and “higher”. In some cases, not being an expert in the exactly best data for theoretically interesting questions, I made some fairly arbitrary choices eg to use the answer to “what is your housing status?” to identify owner-occupiers, rather than “do you or any family member own a residence?”

Here’s the code that created the explanatory variables I wanted, a data set that is just a subset of the explanatory variables, the survey weights, and four possible response variables (party vote for the four highest voting parties).

#============rationalised version of feature creation===========
# This is a single 100 line command to aggregate various answers into
# simpler cateogrisations, because we don't have enough data to 
# analyse the original granular detail.
nzes <- nzes_orig %>%
   
   # party vote:
   mutate(dpartyvote2 = ifelse(is.na(dpartyvote), "Did not vote", as.character(dpartyvote)),
          dpartyvote2 = gsub("M.ori", "Maori", dpartyvote2),
          dpartyvote2 = gsub("net.Man", "net-Man", dpartyvote2),
          dpartyvote2 = fct_infreq(dpartyvote2)) %>%
   
   mutate(PartyVoteNZFirst = (dpartyvote2 == "NZ First"),
          PartyVoteLabour =  (dpartyvote2 == "Labour"),
          PartyVoteNational = (dpartyvote2 == "National"),
          PartyVoteGreen    = (dpartyvote2 == "Green")) %>%
   # voted at all (needed for weighting):
   mutate(Voted = 1 * (ddidvote == 1)) %>%
   
   # Two degrees of freedom for ethnicity:
   mutate(NotEuropean = 1 - dethnicity_e,
          Maori = dethnicity_m) %>%
   
   # Two degrees of freedom for income (lower, higher, don't know):
   mutate(HHIncome = fct_recode(dhhincome,
                                Lower = "No income",
                                Lower = "$NZ23,000 or less",
                                Lower = "$NZ23,001-$NZ31,000",
                                Lower = "$NZ31,001-$NZ39,800",
                                Lower = "$NZ39,801-$NZ55,000",
                                Higher = "$NZ55,001-$NZ76,100",
                                Higher = "$NZ76,101-$NZ110,800",
                                Higher = "$NZ110,801-$NZ147,699",
                                Higher = "$NZ147,700 or over",
                                `Don't know / NA` = "Don't know"),
          HHIncome = ifelse(is.na(HHIncome), "Don't know / NA", as.character(HHIncome)),
          HHIncome = fct_infreq(HHIncome)) %>%
   
   ## Two - four degrees of freedom for associations?
   mutate(HHMemberTradeUnion = membership(dtradeunion),
          HHMemberProfAssoc = membership(dprofassoc)) %>%
   
   ## One degree of freedom for born in NZ
   mutate(NZBorn = ifelse(dnzborn == "New Zealand", 1, 0)
          # uncomment the next line if you want to turn NA into zero:
          # , NZBorn = ifelse(is.na(NZBorn), 0, NZBorn)
   ) %>%
   
   ## One for sex
   mutate(Male = 1 * (dsex == "Male"),
          Male = ifelse(dsex == "Transsexual or transgender", NA, Male)) %>%
   
   ## Two-four for age
   mutate(Age = fct_collapse(as.character(dage),
                             `18-29` = as.character(18:29),
                             `30-55` = as.character(30:55),
                             `56+` = as.character(56:100)),
          # Uncomment the next line if you want to explicitly code the NAs
          # Age = ifelse(is.na(dage), "unknown", as.character(Age)),
          Age = fct_relevel(Age, "30-55")
   ) %>%
   
   ## One for housing.  Note there is also an alternative question "do you or any family member own a residence"
   mutate(OwnHouseOrFlat = 1 * grepl("Own house or flat", dhousing)) %>%
   
   # Two for religion
   mutate(Religion = fct_lump(dreligion, 2)) %>%
   
   # One for marital status
   mutate(Marital = 1 * (dmarital == "Married, in a civil union, or living with a partner")) %>%
   
   # One for self-identified class
   mutate(IdentifyWorkingClass = 1 * (dclass == "Working class")) %>%
   
   ## Two for education (University, None, Other)
   mutate(HighestQual = fct_collapse(dedcons, University = c("Undergraduate Degree", "Postgraduate degree", "PhD")),
          HighestQual = fct_lump(HighestQual, 2),
          HighestQual = ifelse(dedcons == "Not known", NA, as.character(HighestQual)),
          HighestQual = fct_relevel(HighestQual, "Other")
   ) %>%
   
   ## Two for working status
   mutate(WorkStatus = ifelse(!is.na(dwkpt), "Part time", NA),
          WorkStatus = ifelse(!is.na(dwkft), "Full time", WorkStatus),
          WorkStatus = ifelse(is.na(WorkStatus), "Other or unknown", WorkStatus),
          WorkStatus = fct_infreq(WorkStatus),
          Student = 1 * (!is.na(dwksch))) %>%
   
   ## One for occupation
   mutate(SuperviseAnyone = 1 * (dsupervise == "Yes")) %>%
   # Note there is detailed occupation information (for respondent and partner)
   # but I don't think we hav eneough data to use this in the model.
   
   ## None for industry.
   # We have industry of employhment for respondent and partner
   # but I don't think we have enough data to use this.  Plus many people
   # don't know what industry they are in anyway.
   #
   
   ## One degree of freedom for area lived in?
   # Five nice categories here, not enough data so we'll split into one
   mutate(City = 1 * (dregsize == "A major city (over 100,000 population)")) 

#============filter and stockcheck missing data===========
# nzes_subset is a subset of all the columns in the original nzes.
# we want a dataframe with only the variables we intend to use;
# will use it down the track for imputation.
nzes_subset <- nzes %>%
   select(PartyVoteNZFirst,
          PartyVoteNational,
          PartyVoteLabour,
          PartyVoteGreen,
          NotEuropean, Maori,
          HHIncome, 
          OwnHouseOrFlat,
          HHMemberTradeUnion, HHMemberProfAssoc,
          NZBorn,
          Religion,
          Marital,
          HighestQual,
          IdentifyWorkingClass,
          WorkStatus,
          Student,
          SuperviseAnyone,
          City,
          Male,
          Age,
          dage,
          dwtfin,
          Voted)

# 29% rows missing at least one value:
sum(complete.cases(nzes_subset)) / nrow(nzes_subset)

As the last comment in the code above notes, about 29% of rows are missing at least one value. There are three ways of dealing with this:

  1. knock out all rows missing a value
  2. explicitly include a “missing” dummy variable for each factor (which means about a dozen or more extra degrees of freedom)
  3. multiple imputation (creating multiple datasets with different imputed values, reflecting uncertainty in their expected values as well as population randomness - and pooling models estimated from each dataset).

Imputation is definitely the best and I will come back to it, but in the early stages I am knocking out those 29% of the data, for workflow simplicity.

glm versus svyglm versus glm with weights

The NZES data deliberately oversampled some groups (conspicuously, Māori) and used weights to control for this and for the incidental imbalance by sex, age, education, and non-voting (voters were also oversampled).

My first model was deliberately naive and I chose to ignore the weights, something I wouldn’t do but I wanted to see what impact it had. This gave me these results:

When I correctly used the survey weights and estimated the model via Thomas Lumley’s survey package I got this, slightly differing set of results:

To check this, I also tried the base function glm with a weights = argument. The publishers normalised the weights so their mean value is nearly exactly 1 (which wouldn’t be the case with a survey used to estimate population totals for official statistics purposes), so in principle this should be basically ok:

The naive method returns different results from the two that correctly adjust for the weights. In particular, it over-estimates the evidence linking Māori ethnicity and unknown household income to a New Zealand First vote. The two methods that adjust for weights give basically identical results to eachother.

Here’s the code that fitted those first three trial models:

# For convenience, here is a model formulation we'll be using several times:
form <- as.formula(PartyVoteNZFirst ~ 
                      NotEuropean + Maori + 
                      HHIncome + 
                      OwnHouseOrFlat +
                      HHMemberTradeUnion + HHMemberProfAssoc +
                      NZBorn +
                      Religion +
                      Marital + 
                      HighestQual +
                      IdentifyWorkingClass +
                      WorkStatus +
                      Student +
                      SuperviseAnyone +
                      City +
                      Male +
                      Age)

#---------glm v svyglm with knocking out the NAs--------------------
model_naive <- glm(form, data = nzes, family = "binomial")
sum_plot(model_naive, "Generalized linear model, no weights or imputation")

nzes_svy1 <- svydesign(~1, data = nzes, weights = ~dwtfin)
model_svy1 <- svyglm(form, design = nzes_svy1, family = "quasibinomial")
sum_plot(model_svy1, "Survey-specific `svyglm``, survey weights, no imputation")

model_glmw <- glm(form, data = nzes, family = "quasibinomial", weights = dwtfin)
sum_plot(model_glmw, "'Usual' `glm`, survey weights, no imputation")

Non-linear age effect

The data provide an age in years for each respondent (except missing values), which unlike the other variables leaves open the possibility of modelling a non-lineary curved relationship between age and the logarithm of the odds of voting New Zealand First. In my standard models I had split age into three categories (18-29, 30-55, and 56+, with 30-55 the reference point) which would allow some crude non-linearity but it is of interest to see if there is something else going on. To test this, I fit a generalized additive model (GAM) with the mgcv package, which allows the user to specify weights. The results for all the variables of interest were very similar and there are complexities in producing confidence intervals for fixed effects out of a GAM, so I’m not presenting them here. Instead, here’s the relationship between age and voting New Zealand First:

There’s a definite non-linearity here, but it’s not strong or defined enough for me to worry about, and I decided it’s easier to stick to my three simple categories of young, middle and old from here on. It looks like the strong impact really kicks in at about aged 65 rather 55, but I can’t go back and change my category coding just to accommodate that or I’ll be guilty of p-hacking.

Here’s the code for fitting the GAM:

#---------------GAM with weights------------------------------
model_gam <- gam(PartyVoteNZFirst ~ 
                    NotEuropean + Maori + 
                    HHIncome + 
                    OwnHouseOrFlat +
                    HHMemberTradeUnion + HHMemberProfAssoc +
                    NZBorn +
                    Religion +
                    Marital + 
                    HighestQual +
                    IdentifyWorkingClass +
                    WorkStatus +
                    Student +
                    SuperviseAnyone +
                    City +
                    Male +
                    s(dage),
              data = nzes, family = "quasibinomial", weights = dwtfin)
summary(model_gam)

plot(model_gam, shade = TRUE, main = "Non-linear impact of age on voting for NZ First")

Elastic net regularization

To get a feel for the robustness of the results, I also tried using both the “lasso” and “ridge regression” to shrink coefficient estimates towards zero. Results not shown here but basically all coefficient estimates were reduced to zero in either method. This intrigues me… but a thorough investigation will have to wait for a later post (maybe).

Mixed-effects (multilevel) modelling with a regional electorate effect

I’ve just finished Andrew Gelman’s Red State, Blue State, Rich State, Poor State: Why Americans Vote the Way They Do. One of my key takeaways was that while richer states vote Democrats over Republicans (and poor states vice versa), within each state richer individuals vote Republican over Democrats (and poorer individuals vice versa); and the slopes vary in each state. That is, there’s an interaction effect between region and individual income as explanatory variables, and also quite a noticeable region effect that is independent of individual characteristics.

More generally, Gelman uses multi-level models with a regional (and district, perhaps?) random effect as well as individual randomness, when he can get the data. My first ever analysis of voting behaviour was working with David Charnock when he was building on similar research he’d done with Australian federal politics.

We don’t have anywhere near enough data to look into the interaction effects, but we can look to see if there is a regional voting pattern that can’t be explained by individual characteristics.

The NZES data has too many missing values for Regional Council, but has near-complete data on which electorate each respondent is in. So I fit a multi-level model with an electorate-level random effect. There was a noticeable amount of variation by electorate - standard deviation of 0.33 - in the tendency to vote New Zealand First in a model with no individual explanatory variables. However, this went all the way down to zero when the individual variables were included.

Interestingly, once we control for individual characteristics, we see no evidence of a residual region-based electorate effect in tendency to vote New Zealand First.

As an aside, this doesn’t apply to the tendency to vote National, Labour or Green (ie there is a residual electorate effect for those three parties), but investigating why would take me too far afield.

Code for the multilevel models:

#------------------regional multilevel model-----------------------
table(nzes$kregcon, useNA = "always") # too many missing regions to use
table(nzes$delect, useNA = "always")  # only four missing electorates

# Note that glmer interprets "weights" as meaning number of trials
# for a binomial glm, so I'm ignoring them

nzes <- nzes %>%
   mutate(Electorate = ifelse(is.na(delect), "unknown", delect))

model_ml_null <- glmer(PartyVoteNZFirst ~ (1 | Electorate), 
                       data = nzes, family = "binomial")
summary(model_ml_null) # standard deviation of 0.3297, quite noticeable


model_ml <- glmer(PartyVoteNZFirst ~ NotEuropean + Maori + HHIncome + OwnHouseOrFlat + 
                     HHMemberTradeUnion + HHMemberProfAssoc + NZBorn + Religion + 
                     Marital + HighestQual + IdentifyWorkingClass + WorkStatus + 
                     Student + SuperviseAnyone + City + Male + Age + (1 | Electorate), 
                  data = nzes, family = "binomial")

summary(model_ml)
# but "electorate effect" completely disappears to nearly zero when individual variables were included
# - at least for NZ First.  For other parties there is a) problem converging and b)
# there does seem to be a residual electorate effect

Bootstrap with imputation and recalibration of survey weights each resample

All the various models described above I regarded as basically exploratory. My “real” model and estimation process was always going to involve a bootstrap, and imputation of the missing values in a way that captures the randomness that would have been expected to be seen if we had a full set of data.

Recalibrating survey weights

The bootstrap involves a resample with replacement of a full sized sample from the sample we’ve got. One result of doing such a resample is that the weights will no longer correctly represent the proportions of people of various age, sex, ethnicity and voting status (did or didn’t vote) as the original weighted sample did. Fixing this requires creating a new set of weights for each bootstrap resample.

To calculate those weights, I need some more information on how the original weights were set. I haven’t been able to find (admittedly, after minimal effort) a definitive description of the weighting scheme of the NZES, but it definitely includes ethnicity, age, sex, and voting status; and I thought it included education. In fact, I particularly hope it included education, because a much-awaited review out today on the polling performance leading up to the 2016 US Presidential election pinged the failure of many state polls to weight by education as a major contributor to error (more educated people respond more to surveys; and in that particular year, education was more strongly related to vote than had been the case in previous elections).

We can get a glimpse of the sampling scheme (both deliberate and incidental under and oversampling) by a regression of the published survey weight on our explanatory variables. Looking at this, I’m sure that the weights are taking education into account:

This graphic nicely shows that Maori, voters, older people, and highly educated people were over-sampled and hence have low weights; whereas young people, men and students were under-sampled (probably through lower response rates rather than research design) and hence get slightly higher weights than usual.

mod <- lm(dwtfin ~ . -dage, data = nzes_subset)

tidy(mod) %>%
   filter(term != "(Intercept)") %>%
   mutate(term = fct_reorder(term, estimate)) %>%
   ggplot(aes(x = estimate, y = term)) + 
   geom_point() +
   labs(x = "Impact on weight") +
   ggtitle("Relative weights of different demographics, NZ Election Study 2014",
           "Voters, M\u0101ori, women, older, and university qualified people are over-sampled and have low weights to compensate")

Most commonly, in a bootstrap of a complex survey one would calculate a set of “replicate weights”, which have a one-off calibration to the correct population totals using Lumley’s survey package. However, I wanted to do the calibration for each resample separately, for convenience, because I was going to perform imputation separately for each resample, which adds quite a complication. Imputation depends on the observed data, hence should be performed for each resample to ensure the full randomness of the approach is taken into account.

To be ready for this recalibration, I needed the sum of weights for various marginal totals from the very first sample - these totals will reveal to me the implicit known, correct population totals the original researchers used in their weighting scheme. I won’t be doing it exactly the way they did, becuase I’m using my own simplified versions of age, qualifications, etc.; but it should be close enough for my purposes. Here’s the code that calculates those marginal totals for me, for later use in the bootstrap cycle:

#--------------define population marginal totals that we will force weights to add up to-------------
age_pop <- nzes_subset %>%
   group_by(Age) %>%
   summarise(Freq = sum(dwtfin)) %>%
   filter(!is.na(Age))

male_pop <- nzes_subset %>%
   group_by(Male) %>%
   summarise(Freq = sum(dwtfin)) %>%
   filter(!is.na(Male))

qual_pop <- nzes_subset %>%
   group_by(HighestQual) %>%
   summarise(Freq = sum(dwtfin)) %>%
   filter(!is.na(HighestQual))

maori_pop <- nzes_subset %>%
   group_by(Maori) %>%
   summarise(Freq = sum(dwtfin)) %>%
   filter(!is.na(Maori))

voted_pop <- nzes_subset %>%
   group_by(Voted) %>%
   summarise(Freq = sum(dwtfin)) 

Bootstrap results

Now, I’m ready to run my loop of bootstrap resamples, to get a decent “best possible” estimate of the relationship between demographic explanatory variables and tendency to party-vote New Zealand First, compared to the rest of the New Zealand adult population. Each resample will:

  • have a single set of imputed values calculated to replace its NAs, using the mice package that is usually used for multiple imputation even though I only want one imputation set per resample. mice adequately captures the existing observed structure in the data (much better than just imputing average values), as well as the observed randomness.
  • have its weights recalibrated to the correct marginal totals
  • have a model fit to it with that full set of data and new set of weights

The end result has already been shown at the top of this post, but as this is all so long here it is reproduced:

The code that does this exercise took about 90 minutes to run on 7 cores on my laptop:

#---------------define a bootstrap function-------------
# Function that does imputation, calibration of weights, and fits
# a survey GLM to the result:
imp_cal_fit <- function(x, i){
   # for dev:
   # x <- nzes_subset; i <- sample(1:nrow(x), nrow(x), replace = TRUE)
   
   # Resample the data
   nzes_res <- x[i, ]
   
   # Create a single complete, imputed version of the data
   nzes_imp <- complete(mice(nzes_res, 1, printFlag = FALSE))
   attributes(nzes_imp$Religion)$contrasts    <- NULL
   attributes(nzes_imp$WorkStatus)$contrasts  <- NULL
   attributes(nzes_imp$HHIncome)$contrasts    <- NULL
   attributes(nzes_imp$HighestQual)$contrasts <- NULL
   attributes(nzes_imp$Age)$contrasts         <- NULL
   
   # Set up as a survey object
   nzes_svy <- svydesign(~1, data = nzes_imp, weights = ~dwtfin)
   
   # force the marginal totals to match those from the original weighting
   nzes_svy_cal <- calibrate(nzes_svy, 
                         list(~Maori, ~Age, ~HighestQual, ~Male, ~Voted), 
                         list(maori_pop, age_pop, qual_pop, male_pop, voted_pop))
   
   # Fit the model
   model_svy_cal <- svyglm(form, design = nzes_svy_cal, family = "quasibinomial")
   
   
   # Return the results
   return(coef(model_svy_cal))
}

# Set up a cluster for parallel computing
cluster <- makeCluster(7) # only any good if you have at least 7 processors
registerDoParallel(cluster)

clusterEvalQ(cluster, {
   library(survey)
   library(mice)
})

# to mimic the original deliberate over-sampling of Maori, I make the bootstrap
# resampling stratify itself on Maori too.
# Note that this takes about 90 minutes, even with the parallel computing.  Each resample
# has to do the iterative imputation from scratch, then calibrate survey weights.
system.time({
nzes_booted <- boot(nzes_subset, imp_cal_fit, 
                    R = 999, strata = nzes_subset$Maori, 
                    parallel = "snow", cl = cluster)
})

# It's not common to do it this way.  More usually you make a set of many replicate
# weights and calibrate them as a once-off job, then use those weights for any future
# estimation.  But this doesn't conveniently fit in a workflow where we are doing a 
# new imputation for each bootstrap resample (which is usually admitted to be good 
# practice, but very rarely done, precisely because it's inconvenient for workflow).


# extract the 95% confidence intervals
k <- ncol(nzes_booted$t)
results <- as.data.frame(t(sapply(1:k, function(i){boot.ci(nzes_booted, type = "perc", index = i)$percent[4:5]})))
names(results) <- c("lower", "upper")
results$term <- names(coef(model_naive))

# draw graphic:
results %>%
   filter(term != "(Intercept)") %>%
   mutate(term = camel_to_english(term),
          mid = (lower + upper ) / 2,
          term = fct_reorder(term, mid)) %>%
   ggplot(aes(y = term)) +
   geom_vline(xintercept = 0) +
   geom_segment(aes(x = lower, xend = upper, yend = term), 
                colour = parties_v["NZ First"], alpha = 0.5, size = 3) +
   ggtitle("Party vote for New Zealand First in the 2014 election",
      "Bootstrap, imputation, recalibrated survey weights, svyglm") +
   labs(caption = "New Zealand Election Survey, analysed at https://ellispgithub.io",
        x = "Impact on log odds of voting New Zealand First",
        y = str_wrap("Compared to 'no religion', 'age30-55', 'high household income', 'school qualification', 'working full time'", 50))

Multiple imputation without bootstrap

The bootstrapped, multiply-imputed model gives a warm glow of thorough justificationness (if there is such a word) but is expensive in processor time. I was left with the impression that the standard model was fitting ok enough that even analytically calculated confidence intervals are probably good enough - after all, the bootstrap results looked pretty similar to the original non-bootstrap versions with no imputation at all.

To round out the analysis with fitting similar models to all parties, I skipped the bootstrap part and just used the original sample, but with five different sets of imputed values using the core functionality provided by mice. I didn’t bother to re-calibrate the weights for each set of imputed values because I’m fairly confident changes from the original weights would have been minimal. This means I have five different complete (no NA values) datasets, each of my four models is fit to each dataset, and the five results for each party are pooled using the Barnard-Rubin method.

Here’s the end results (click image to enlarge):

… and the code to do it is pretty simple:

#============multiple imputation all four main parties===============
# see http://r-survey.r-forge.r-project.org/survey/svymi.html for an alternative way to do this
# Also see https://stats.stackexchange.com/questions/149053/questions-on-multiple-imputation-with-mice-for-a-multigroup-sem-analysis-inclu

nzes_mi <- mice(nzes_subset)

attributes(nzes_mi$data$Age)$contrasts <- NULL
attributes(nzes_mi$data$Religion)$contrasts <- NULL
attributes(nzes_mi$data$HighestQual)$contrasts <- NULL
attributes(nzes_mi$data$WorkStatus)$contrasts <- NULL
attributes(nzes_mi$data$HHIncome)$contrasts <- NULL

responses <- paste0("PartyVote", c("National", "Labour", "NZFirst", "Green"))
colours <- parties_v[c("National", "Labour", "NZ First", "Green")]
p <- list()

form_gen <- "XXX ~ NotEuropean + Maori + HHIncome + OwnHouseOrFlat +
HHMemberTradeUnion + HHMemberProfAssoc + NZBorn + Religion +
Marital + HighestQual + IdentifyWorkingClass + WorkStatus +
Student + SuperviseAnyone + City + Male + Age"

for(i in 1:length(responses)){
   form2 <- gsub("XXX", responses[i], form_gen)
   model_mi <- with(data = nzes_mi,
                    glm(as.formula(form2), 
                        family = "quasibinomial", weights = dwtfin))
   p[[i]] <- sum_plot(model_mi, 
                      title = paste(gsub("PartyVote", "Party vote for ", responses[i]),
                                    "compared to rest of population"),
                      colour = colours[i],
                      type = "pool",
                      party = gsub("PartyVote", "", responses[i]))
}

grid.arrange(p[[1]], p[[2]], p[[3]], p[[4]])

Discussion

While I do a chunk of analysis on voting behaviour (eg see my election forecasts), I’m the wrong person for all sorts of reason to give actual commentary on New Zealand politics. So I’m going to make only the most obvious comments:

  • To answer the original question at the beginning of this post, New Zealand First voters were more likely than the general population to be older, born in New Zealand, identifying as working class and male.
  • No surprise that people such as home owners, higher income people, European heritage, those who don’t identify as working class, don’t have a trade union member in the household voted for the centre-right National party
  • Perhaps more surprising that being born in New Zealand is estimated to make people less likely to vote for National after controlling for education, ethnicity, income etc (in fact this one is surprising enough for me I’d like to see it replicated with fresh data before making too much of it)
  • In addition to the variables that would be expected to lead to Labour vote (union membership, lower income, less qualification), being Maori and female are both estimated to point in that direction
  • The characteristics of Green voters are an interesting but not that surprising list - one or more of university qualifications, beign a student, part time, union member in the household, living in a city, not identifying as Christian, not of European ethnicity.

So there we go. Relationship of individual socio-economi status on voting behaviour in the 2014 New Zealand General Election. Rather sadly, I don’t see much of this amazing dataset in the academic literature. Maybe I’m looking in the wrong spot.

Corrections and other comments welcomed. Beware - all the above code isn’t checked by anyone except me.

Big thanks to the Electoral Commission for funding the NZES, the NZES researchers themselves, and (as always) everyone behind the R core project and the 16 R packages listed near the beginning of this post.

> thankr::shoulders() %>%
+    mutate(maintainer = gsub("<.*>", "", maintainer)) %>%
+    group_by(maintainer) %>%
+    summarise(no_packages = sum(no_packages)) %>%
+    arrange(desc(no_packages)) %>%
+    as.data.frame()
             maintainer no_packages
1       Hadley Wickham           18
2          R Core Team           12
3         Brian Ripley            4
4        Kirill Müller            3
5         Rich Calaway            3
6           Yixuan Qiu            3
7    Dirk Eddelbuettel            2
8       Jennifer Bryan            2
9      Martin Maechler            2
10     "Thomas Lumley"            1
11       Achim Zeileis            1
12    Adelchi Azzalini            1
13     Baptiste Auguie            1
14          Ben Bolker            1
15   Charlotte Wickham            1
16      David Robinson            1
17     Deepayan Sarkar            1
18  Duncan Temple Lang            1
19        Gábor Csárdi            1
20        James Hester            1
21         Jelmer Ypma            1
22         Jeroen Ooms            1
23          Jim Hester            1
24          JJ Allaire            1
25           Joe Cheng            1
26 Katharine M. Mullen            1
27        Luke Tierney            1
28    Marek Gagolewski            1
29         Peter Ellis            1
30              R-core            1
31          Simon Wood            1
32     Stef van Buuren            1
33 Stefan Milton Bache            1
34    Terry M Therneau            1
35       Trevor Hastie            1
36       Vitalie Spinu            1
37     William Revelle            1
38       Winston Chang            1
39           Yihui Xie            1

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)