Aggreate electoral targeting with R

October 22, 2009
By

(This article was first published on Offensive Politics, and kindly contributed to R-bloggers)

Introduction

Electoral targeting is the process of quantifying the partisan bias of a single voter or subset of voters in a geographic region. Bias can be calculated using an individual’s demographic and voting behavior or by aggregating results from an entire election precinct. Targeting is traditionally performed by national committees (e.g., National Committee for an Effective Congress, National Republican Congressional Committee), state political parties, interest groups (e.g., EMILY’s List, National Rifle Association), or campaign consultants. Targeting data is consumed by campaign managers and analysts, and it is used along with polling data to build strategy, direct resources, and project electoral outcomes.

While aggregate electoral targeting can build a sophisticated picture of a district, the mathematics behind targeting are very simple. Targeting can be performed by anyone with previous electoral data, and calculations can be done using 3×5 note cards, with simple spreadsheets, or high-end software packages like SPSS. The targeting methods discussed in this post are taken from academic publications on electioneering: Campaign Craft (Burton, Shea 2006) and The Campaign Manager (Shaw, 2004).

Although targeting data is usually usually inexpensive or free, a down-ballot campaign or a primary challenger might not have the connections or support of a PAC or party to obtain the data. In these cases, a campaign will probably purchase one of the books listed above to perform its own analysis. Even an established campaign may run its own analysis, possibly to test different turnout theories or to integrate additional data. This post is directed towards these groups.

Together, we will assume the role of campaign consultant and perform an aggregate electoral analysis on the 13th House of Delegates seat (HOD#13) in the Commonwealth of Virginia. In HOD#13, the 18-year Republican incumbent Bob Marshall is being challenged by Democrat John Bell. This analysis will compute and visualize turnout, partisan bias, and a precinct ranking based on projected turnout and historical Democratic support.

The analysis of HOD#13 will be performed using R, an open-source computing platform. R is free, extensible, and interactive, making it an ideal platform for experimentation. The R package aggpol was created specifically for this tutorial, and it contains all the data and operations required to execute an aggregate electoral analysis. Readers can execute the provided R code to reproduce the analysis or simply follow along to learn how it was performed. Readers unfamiliar with R should read Introduction to R, which is available on the R project homepage.

The electoral and registration data used were compiled from the Virginia State Board of Elections using several custom written parsers and two different PDF-to-text engines. Please contact me for source data or more information at: [email protected].

Prerequisites

This section only applies to readers interested in recreating the analysis and graphics produced in this tutorial. To completely recreate this analysis you will need the following:

  1. The latest version of the R statistical computing environment. Binaries, source, and installation instructions can be downloaded from R homepage.
  2. Additional R packages. This analysis requires several packages that provide additional functionality on top of the existing R system. Install the appropriate R environment for your system and run the program.
      plyr, ggplot2, RColorBrewer. To install these packages execute the following in your R environment:

      1
      
      install.packages(c("plyr", "RColorBrewer", "ggplot2"))
      Next you’ll need to install the aggpol package for calculating aggregate political statistics. You will need to download the latest version: For Unix-style systems or For Windows systems. Installation of local packages is detailed in the R Manual on package installation.

Getting Started

Now that the prerequisites are installed we can get started with our data analysis. Start up your R environment and load the required libraries by typing in the following commands:

1
2
3
4
library(plyr)
library(aggpol)
library(ggplot2)
library(RColorBrewer)

We need to attach the VAHOD data set that comes with aggpol. This data set contains precinct-level electoral returns for state and federal elections in the Commonwealth of Virginia from 2001 to 2008. Since we are focusing on HOD#13, we’ll need to select just the records that have to do with that seat.

1
2
data(VAHOD)
hd013 <- vahod[which(vahod$seat == "HD-013"),]

The data set contains precinct-level electoral results for the following races: U.S. President, U.S.Senate, U.S.House of Representatives, Virginia Governor, Senate of Virginia, and Virginia House of Delegates. This breadth of electoral returns allows us to build a very detailed profile of the partisan bias of a district.

We will first determine the historical partisanship in HOD#13. Since partisanship can fluctuate over the years and different seats have different turnout expectations, we’ll first need to see the major party support for every seat in each election for precincts in HOD#13. We can use the historical.election.summary function from the aggpol package to group the precinct results into district results, and then break them down by seat and year.

1
esum <- historical.election.summary(hd013)

esum now contains:

.yeardistrict_typetotal.turnoutrep.turnoutrep.turnout.percentdem.turnoutdem.turnout.percentoth.turnoutoth.turnout.percent
12001GV552732660.590922070.3993540.0097
22001HD539934750.643619240.356300
32001LG543232910.605820250.37271160.0213
42003HD10299101030.98091100.0106860.0083
13 more lines…

We now have major-party turnout for every election in our data set. To best visualize the results we’ll build a bar graph comparing major-party turnout in each seat over time. We first need to transpose the election summary object (esum) from a summary format to an observation format, one line per distinct year+district+party. The plyr package makes this task extremely simple.

1
2
3
elx <- ddply(esum,c("year", "district_type"), function(x) 
  rbind(  data.frame(party="REP",turnout=x$rep.turnout.percent),
  data.frame(party="DEM",turnout=x$dem.turnout.percent)))

We will now use the powerful ggplot2 package to view the Republican and Democratic support for each election, in each seat, for our subset:

1
2
3
4
ggplot(elx,aes(year,turnout,fill=factor(party))) + 
  geom_bar(stat="identity") + 
  facet_wrap(~district_type,scales="free_x") + 
  scale_fill_brewer(palette="Set1")

Result:
HD#013 major party percentages

This graphic gives us a decent understanding of district-level electoral trends. For U.S. federal elections (figs.: PVP, USH, USS), we can see a distinct drop in Republican support moving towards 2008; the results for U.S. House (USH) and U.S. Senate (USS), in particular, show a strong increase in Democratic support. This growth correlates to statewide trends that resulted in the election of two Democratic Senators representing Virginia for the first time since 1970. General Democratic gains notwithstanding, the House of Delegates (fig.: HD) results aren’t as promising for a Democratic challenger. The incumbent Del. Marshall saw more than 60% support in three of the last four elections and saw no challenger at all in 2003. While the district may be trending more Democratic over time, the voters of HOD#13 are obviously big fans of Del. Marshall.

Now that we understand the historical partisanship of this district we need to understand historical turnout, allowing us to project of the number of votes required to win. We will utilize the historical.turnout.summary function from the aggpol package to produce a summary of turnout for this district.

1
historical.turnout.summary(hd013, district.type="HD", district.number="013", years=c(2001,2003,2005,2007))
.yeartotal.turnouttotal.registration
12001539913275
220031003145769
320052359262497
420072611078028

Looking at this table one can see some data collection problems in the 2001 HD elections. In recent years, precincts belonged to only one House of Delegates seat, but in 2001 and somewhat less so in 2003 some precincts are split and some have duplicate names and now information on how to allocate results from different races to precincts. The turnout numbers are slightly affected by these problems, but the aggpol attempts to correct this by substituting alternate years or even races if possible.

The take away from from the previous table is that turnout for the last four House of Delegates elections has hovered around 30%. This makes some political sense, because Virginia holds state elections in odd-numbered years with no federal elections to drive up turnout. This leaves a lot of registered voters to be activated, but we need to delve down to the precinct level to find them.

We use the district.analyze function of aggpol to aggregate all electoral results into a summary for each precinct.

1
hd013s <- district.analyze(hd013)

hd013s is a data frame with columns calculated for every precinct; several values for each major party and other values for the precinct as a whole. Those statistics are:

  • Aggregate base partisan vote – The lowest non-zero turnout for a major party, in all electoral years.
  • Average Party Performance – The average percentage of the vote a party receives in the closest 3 elections in recent years.
  • Swing vote – The part of the electorate not included in the aggregate base partisan vote.
  • Soft-partisan vote – The average worst a party has performed, minus the actual worst.
  • Toss-up – The portion of the electorate not included in the Aggregate base or soft-base partisan vote.
  • Partisan base – The combined aggregate-base and soft-partisan vote for each major party.
  • Partisan swing – The combined major party swing vote.
  • Projected turnout – The portion of the electorate that is projected to turn out given previous turnout and current registration data.

These variables can be visualized with the following graphic, adapted –along with definitions above– from Campaign Craft (Burton, Shea).

variables

The actual columns in the data frame returned from from district.analyze are:

  • proj.turnout.percent – The projected turnout percent of for a hypothetical next election.
  • proj.turnout.count – The projected number of voters who will turn out for a hypothetical next election.
  • current.reg – Current number of registered voters in a precinct.
  • partisan.base – The combined aggregate-base and soft-partisan vote for both major parties ( Partisan base ).
  • partisan.swing – All non-base voters (1.0 – partisan.base).
  • tossup – The portion of the electorate not in the base or soft support of either major party.
  • app.rep – The average party performance of a Republican candidate in this precinct.
  • base.rep – The aggregate base partisan vote for a Republican candidate in this precinct.
  • soft.rep – The soft partisan vote for a Republican candidate in this precinct.
  • app.dem – The average party performance of a Democratic candidate in this precinct.
  • base.dem – The aggregate base partisan vote for a Democratic candidate in this precinct.
  • soft.dem – The soft partisan vote for a Democratic candidate in this precinct.
  • partisan.rep – Combination of aggregate base and soft vote percentages for the Republican.
  • partisan.dem – Combination of aggregate base and soft vote percentages for the Democrat.

The most useful statistic above is the Average Party Performance (APP), which is an average of major-party turnout in the 3 closest recent elections. The APP describes supporter levels for a best-case scenario in a close election. We’ve already calculated the APP of each major party (app.dem, app.rep), but when a race doesn’t have a third party candidate what we’ll usually visualize is the share of the combined partisan performance that each party receives. We’ll add these variables to our summary data frame generated previously, one for each major party.

1
2
hd013s$dem.share <- hd013s$app.dem/(hd013s$app.dem+hd013s$app.rep)
hd013s$rep.share <- hd013s$app.rep/(hd013s$app.dem+hd013s$app.rep)

Now that we have the APP and partisan vote share for each party, we can visualize the precinct-level terrain for the Democratic challenger Mr. Bell. This visualization should show us the democratic support for each precinct and give us an idea whinc precincts could be competitive. We’ll produce this visualization using a density plot + 1d histogram, adapted from the seatsVotes plot in the pscl package. We’ll also draw a cut-line down the 50% vote mark to to help find competitive precincts.

1
2
3
4
qplot(dem.share, data=hd013s, geom=c("density","rug"),
    xlab="Dem Vote Share",
    main="Democratic vote share, by precinct")  + 
  geom_vline(xintercept=.50)

dem-vote-share-by-precinct

We can see a lot of precincts are between 48% and 53% Democratic, which means those precincts could potentially go for either candidate. We need to classify these results into something more solid. Let’s say precincts with less than 48% Democratic share are Safe Republican, 48-52% are Tossup, and greater than 52% are Safe Democrat. This is a simple representation but can be refined later. We’ll add a seat classification to our data frame using the cut function:

1
hd013s$cl <- cut(hd013s$dem.share, breaks=c(0,.48,.52,1), labels=c("Safe Rep", "Tossup", "Safe Dem"))

Now we need to visualize how many precincts fall into which classification, using a histogram this time instead of a density curve.

1
2
3
4
5
ggplot(hd013s, aes(x=dem.share)) + 
  geom_bar(aes(fill=cl),binwidth=0.01) + 
  scale_fill_brewer("Precinct Rating", palette="RdYlBu") + 
  scale_x_continuous("Democratic Vote Share") + 
  scale_y_continuous("Frequency")

dem-vote-share-by-precinct-hist

From the histogram we see that not only does a Republican candidate enjoy more “Safe” precincts, but even the majority of the tossup precincts have less than 50% Democratic share. While the precinct breakdown looks bad, a Democratic win in this district is theoretically possible if these tossup precincts are held. A Democratic candidate will face a tough challenge, so the next step will be identifying Democratic and Democrat-leaning precincts to target.

To make this target precinct list we’ll need a method to prioritize the precincts so that we can reach the most persuadable voters while spending the least resources. A popular method to identify a precinct as high-value is to sort precincts by lowest projected turnout with highest Democratic vote share. Lower turnout means there are registered voters waiting to be convinced to show up, and high Democratic vote share means more of those voters will be Democrats.

Since we measured both of these values (turnout%, democratic vote share), it is very easy to order our data by turnout (ascending) and democratic average party performance (descending) using R.

1
hd013s[order(hd013s$proj.turnout.percent,-hd013s$app.dem),c(1:2,20:21),]
precinct_nameproj.turnout.percentdem.sharerep.share
25153 – 409 – SUDLEY NORTH0.19590.51050.4894
27153 – 411 – MULLEN0.22180.50260.4973
4107 – 111 – BRIAR WOODS0.22560.48370.5162
6107 – 212 – CLAUDE MOORE PARK0.22790.52850.4714
26153 – 410 – MOUNTAIN VIEW0.23190.49450.5054
16153 – 110 – BUCKLAND MILLS0.24480.48910.5108
13153 – 106 – ELLIS0.24750.50380.4961
5107 – 112 – FREEDOM0.25090.50280.4971
15153 – 108 – VICTORY0.26450.50050.4994
24153 – 408 – GLENKIRK0.28370.49980.5001
1107 – 106 – EAGLE RIDGE0.28560.49920.5007
18153 – 112 – CEDAR POINT0.28760.48550.5144
14153 – 107 – MARSTELLER0.30670.47750.5224
3107 – 109 – HUTCHISON0.31680.48570.5142
2107 – 108 – MERCER0.32810.50340.4965
17153 – 111 – BRISTOW RUN0.33240.48220.5177
23153 – 406 – ALVEY0.34600.47360.5263
21153 – 402 – BATTLEFIELD0.35460.43230.5676
10153 – 102 – BENNETT0.38960.49590.5040
19153 – 209 – WOODBINE0.40140.46510.5348
7107 – 307 – MIDDLEBURG0.40430.49530.5046
9153 – 101 – BRENTSVILLE0.41800.49040.5095
22153 – 403 – BULL RUN0.42260.48600.5139
20153 – 401 – EVERGREEN0.42830.50060.4993
12153 – 104 – NOKESVILLE0.45370.49600.5039
11153 – 103 – BUCKHALL0.46360.47730.5226
8107 – 309 – ALDIE0.46870.48810.5118

This sorted list is our critical intelligence to finding persuadable voters, but we need a better way to visualize the output. Since we have two scalar variables (turnout %, democratic vote share) we can use a scatter plot with the Democratic vote share on the Y axis and Turnout % on the X. We’ll also color each precinct with its seat classification we defined earlier (Safe Republican, Tossup, Safe Democrat):

1
2
3
ggplot(aes(x=proj.turnout.percent, y=dem.share), data=hd013s) + 
  geom_point(aes(colour=cl,title="a")) + 
  labs(x="Projected Turnout %", y="Democratic Vote Share %",colour="Seat Type")

dem-vote-share-by-precinct-scatter-color

This chart echoes what we’ve seen previously: the Democratic challenger faces an uphill battle, but there is room for a win. We see a single “Safe Democract” precinct with very low turnout, and five “Safe Republican” precincts that run the board in turnout. Given the high number of “Tossup” precincts, and the fact that they run the gamut as far as turnout is concerned, we’ll need to incorporate additional information into our prioritization. If we also rank precincts by current voter registration, we can focus on precincts where we stand to gain the most ground.

Before we continue, we need to make sure there is enough difference in precinct-to-precinct registration to have an impact. Let’s look at some statistics for the current registration in this district.

1
2
mean(hd013s$current.reg)
sd(hd013s$current.reg)
.
2970.111
1014.072

There are on average 2,970 current registered voters in each precinct, but the standard deviation is 1,014 voters. A standard deviation that high tells us we need to take into account registration if we want to focus on the precincts with 4000 people and not 1000 people. A histogram of current registration will help us clarify this finding:

1
2
qplot(current.reg, data=hd013s, geom="bar",binwidth=500,xlab="Current Registration") + 
  scale_y_continuous("Frequency")

Current registration histogram

The standard deviation was correct: we see some very small precincts and some large precincts, but the majority are somewhere in the 2000-4000 range. The difference looks to be large enough to include current registration in our ranking.

We need to look at the Democratic Vote Share vs Turnout % scatter plot again, but with the points scaled to the current precinct registration.

1
2
qplot(proj.turnout.percent,dem.share,size=current.reg, data=hd013s,colour=cl)+ 
  labs(x="Projected Turnout %", y="Democratic Vote Share %",colour="Seat Type",size="Current Registration")
Democratic vote share by Turnout %

Democratic vote share by Turnout %

This plot is almost complete and ready to be analyzed. The last job is to label the points with ther precinct names. Our current precinct_name variable is actually a unique identifier with a FIPS county code, a precinct code, and a name, and it is too long for a point label. We’ll shrink it down to just the name and then we’ll recreate the scatter plot with the label:

1
2
3
4
5
6
7
# replace the fips code and precinct number w/ an empty string
hd013s$precinct.label <- sub("^[0-9]+ - [0-9]+ - ",'',as.character(hd013s$precinct_name))
# plot the previous graph again but this time use precinct.label as the label
ggplot(hd013s, aes(x=proj.turnout.percent, y=dem.share,label=precinct.label)) + 
  geom_point(aes(colour=cl,size=current.reg)) + 
  geom_text(size=2.5,vjust=1.5,angle=25) + 
  labs(x="Projected Turnout %", y="Democratic Vote Share %",colour="Seat Type",size="Current Registration")

dem-vote-share-by-precinct-scatter-color-size-label

From the chart we can see that a Democrat in the HD#013 will want to focus contact efforts on the precincts in the upper-left hand corner of the plot and will want to target larger precincts before smaller. Integrating the current registration into our previous sort command leaves us with the following sort order:

1
hd013s[order(hd013s$proj.turnout.percent,-hd013s$app.dem,hd013s$current.reg),c(1:2,4,20:22),]
precinct_nameproj.turnout.percentcurrent.regdem.sharerep.sharecl
25153 – 409 – SUDLEY NORTH0.195924970.51050.4894Tossup
27153 – 411 – MULLEN0.221835550.50260.4973Tossup
4107 – 111 – BRIAR WOODS0.225622880.48370.5162Tossup
6107 – 212 – CLAUDE MOORE PARK0.227931150.52850.4714Safe Dem
26153 – 410 – MOUNTAIN VIEW0.231937490.49450.5054Tossup
16153 – 110 – BUCKLAND MILLS0.244836460.48910.5108Tossup
13153 – 106 – ELLIS0.247513030.50380.4961Tossup
5107 – 112 – FREEDOM0.250939290.50280.4971Tossup
15153 – 108 – VICTORY0.264548740.50050.4994Tossup
24153 – 408 – GLENKIRK0.283721750.49980.5001Tossup
1107 – 106 – EAGLE RIDGE0.285625310.49920.5007Tossup
18153 – 112 – CEDAR POINT0.287634970.48550.5144Tossup
14153 – 107 – MARSTELLER0.306736690.47750.5224Safe Rep
3107 – 109 – HUTCHISON0.316837220.48570.5142Tossup
2107 – 108 – MERCER0.328132290.50340.4965Tossup
17153 – 111 – BRISTOW RUN0.332430310.48220.5177Tossup
23153 – 406 – ALVEY0.346044030.47360.5263Safe Rep
21153 – 402 – BATTLEFIELD0.354638510.43230.5676Safe Rep
10153 – 102 – BENNETT0.389644400.49590.5040Tossup
19153 – 209 – WOODBINE0.401424060.46510.5348Safe Rep
7107 – 307 – MIDDLEBURG0.404312390.49530.5046Tossup
9153 – 101 – BRENTSVILLE0.418017080.49040.5095Tossup
22153 – 403 – BULL RUN0.422631110.48600.5139Tossup
20153 – 401 – EVERGREEN0.428325350.50060.4993Tossup
12153 – 104 – NOKESVILLE0.453725010.49600.5039Tossup
11153 – 103 – BUCKHALL0.463622870.47730.5226Safe Rep
8107 – 309 – ALDIE0.46879020.48810.5118Tossup

Now that we have our ranking, we can figure out how much each precinct might offer. Let’s first see the number of votes required to win the seat, the number of votes we’re projected to receive given the calculated APP, previous turnout, and current registration. The district.summary function will provide us will all this information:

1
district.summary(hd013s)[,c(1,2,9,10,11)]
current.regproj.turnout.countvotes.to.winproj.turnout.repproj.turnout.dem
1801932540112701.51249912074

We can see that the projected turnout (proj.turnout.count) is about 25,401, so the votes projected to win this district is only 12,702. Using the Democratic APP, we can project Democratic turnout at 12,074, so we need to find 628 votes to win. How do we find these votes?

Lets go back to our sorted precinct list and take the top 30% and call them our target.precincts.

1
2
sorted.precincts <- hd013s[order(hd013s$proj.turnout.percent,-hd013s$app.dem,hd013s$current.reg),]
target.precincts <- sorted.precincts[1:(nrow(sorted.precincts)/3),]

We’ve got our target list, and we know we need 628 votes from them to bring our total to 50% + 1. Adding a small buffer to that number, we’ll take 640 target votes and allocate them across our target precincts, proportional to the number of registered voters in the precinct. Hopefully, this will set more realistic goals for larger and smaller precincts.

1
2
3
target.precincts$inc <- as.integer(640 * target.precincts$current.reg/sum(target.precincts$current.reg))
 
target.precincts[,c(2,3,17,23,18,20:22,24)]
precinct.labelproj.turnout.percentproj.turnout.countproj.turnout.demproj.turnout.repdem.sharerep.shareclinc
SUDLEY NORTH0.19594892482380.51050.4894Tossup55
MULLEN0.22187883913870.50260.4973Tossup78
BRIAR WOODS0.22565162432590.48370.5162Tossup50
CLAUDE MOORE PARK0.22797093663260.52850.4714Safe Dem68
MOUNTAIN VIEW0.23198694274370.49450.5054Tossup82
BUCKLAND MILLS0.24488924314500.48910.5108Tossup80
ELLIS0.24753221601580.50380.4961Tossup28
FREEDOM0.25099864924870.50280.4971Tossup86
VICTORY0.264512896386370.50050.4994Tossup107

The final column in the result is the target increase for that precinct (column: ‘inc’). With this information in hand the campaign field operations can devise a contact strategy to bring these voters to the polls on election day.

Conclusion

Playing the role of campaign consultant, we have analyzed previous electoral outcomes in the 13th seat of the House of Delegates in Virginia. We have shown how a Democratic candidate can leverage increasing Democratic support and low turnout to make this race competitive. We have also created a precinct targeting methodology that provides a high-level blueprint for resources planning. The analysis we performed performed is very standard, but using R makes our methodology unique. A down-ballot or primary-challenger campaign taking advantage of this methodology will spend less money and can experiment more on their targeting, potentially leading them to a win.

Are you a Democrat running for the Virginia House of Delegates who would like to see the same data for your race? Or, are you a Democratic congressional candidate preparing for the 2010 cycle? Contact me at [email protected] for robust targeting data or other analysis.

Follow Offensive Politics on twitter

To leave a comment for the author, please follow the link and comment on his blog: Offensive Politics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , ,

Comments are closed.