Aggreate electoral targeting with R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Electoral targeting is the process of quantifying the partisan bias of a single voter or subset of voters in a geographic region. Bias can be calculated using an individual’s demographic and voting behavior or by aggregating results from an entire election precinct. Targeting is traditionally performed by national committees (e.g., National Committee for an Effective Congress, National Republican Congressional Committee), state political parties, interest groups (e.g., EMILY’s List, National Rifle Association), or campaign consultants. Targeting data is consumed by campaign managers and analysts, and it is used along with polling data to build strategy, direct resources, and project electoral outcomes.
While aggregate electoral targeting can build a sophisticated picture of a district, the mathematics behind targeting are very simple. Targeting can be performed by anyone with previous electoral data, and calculations can be done using 3×5 note cards, with simple spreadsheets, or high-end software packages like SPSS. The targeting methods discussed in this post are taken from academic publications on electioneering: Campaign Craft (Burton, Shea 2006) and The Campaign Manager (Shaw, 2004).
Although targeting data is usually usually inexpensive or free, a down-ballot campaign or a primary challenger might not have the connections or support of a PAC or party to obtain the data. In these cases, a campaign will probably purchase one of the books listed above to perform its own analysis. Even an established campaign may run its own analysis, possibly to test different turnout theories or to integrate additional data. This post is directed towards these groups.
Together, we will assume the role of campaign consultant and perform an aggregate electoral analysis on the 13th House of Delegates seat (HOD#13) in the Commonwealth of Virginia. In HOD#13, the 18-year Republican incumbent Bob Marshall is being challenged by Democrat John Bell. This analysis will compute and visualize turnout, partisan bias, and a precinct ranking based on projected turnout and historical Democratic support.
The analysis of HOD#13 will be performed using R, an open-source computing platform. R is free, extensible, and interactive, making it an ideal platform for experimentation. The R package aggpol was created specifically for this tutorial, and it contains all the data and operations required to execute an aggregate electoral analysis. Readers can execute the provided R code to reproduce the analysis or simply follow along to learn how it was performed. Readers unfamiliar with R should read Introduction to R, which is available on the R project homepage.
The electoral and registration data used were compiled from the Virginia State Board of Elections using several custom written parsers and two different PDF-to-text engines. Please contact me for source data or more information at: [email protected].
This section only applies to readers interested in recreating the analysis and graphics produced in this tutorial. To completely recreate this analysis you will need the following:
- The latest version of the R statistical computing environment. Binaries, source, and installation instructions can be downloaded from R homepage.
- Additional R packages. This analysis requires several packages that provide additional functionality on top of the existing R system. Install the appropriate R environment for your system and run the program.
- plyr, ggplot2, RColorBrewer. To install these packages execute the following in your R environment:
install.packages(c("plyr", "RColorBrewer", "ggplot2"))
- Next you’ll need to install the aggpol package for calculating aggregate political statistics. You will need to download the latest version: For Unix-style systems or For Windows systems. Installation of local packages is detailed in the R Manual on package installation.
Now that the prerequisites are installed we can get started with our data analysis. Start up your R environment and load the required libraries by typing in the following commands:
1 2 3 4
library(plyr) library(aggpol) library(ggplot2) library(RColorBrewer)
We need to attach the VAHOD data set that comes with aggpol. This data set contains precinct-level electoral returns for state and federal elections in the Commonwealth of Virginia from 2001 to 2008. Since we are focusing on HOD#13, we’ll need to select just the records that have to do with that seat.
data(VAHOD) hd013 <- vahod[which(vahod$seat == "HD-013"),]
The data set contains precinct-level electoral results for the following races: U.S. President, U.S.Senate, U.S.House of Representatives, Virginia Governor, Senate of Virginia, and Virginia House of Delegates. This breadth of electoral returns allows us to build a very detailed profile of the partisan bias of a district.
We will first determine the historical partisanship in HOD#13. Since partisanship can fluctuate over the years and different seats have different turnout expectations, we’ll first need to see the major party support for every seat in each election for precincts in HOD#13. We can use the historical.election.summary function from the aggpol package to group the precinct results into district results, and then break them down by seat and year.
esum <- historical.election.summary(hd013)
esum now contains:
|13 more lines…|
We now have major-party turnout for every election in our data set. To best visualize the results we’ll build a bar graph comparing major-party turnout in each seat over time. We first need to transpose the election summary object (esum) from a summary format to an observation format, one line per distinct year+district+party. The plyr package makes this task extremely simple.
1 2 3
elx <- ddply(esum,c("year", "district_type"), function(x) rbind( data.frame(party="REP",turnout=x$rep.turnout.percent), data.frame(party="DEM",turnout=x$dem.turnout.percent)))
We will now use the powerful ggplot2 package to view the Republican and Democratic support for each election, in each seat, for our subset:
1 2 3 4
ggplot(elx,aes(year,turnout,fill=factor(party))) + geom_bar(stat="identity") + facet_wrap(~district_type,scales="free_x") + scale_fill_brewer(palette="Set1")
This graphic gives us a decent understanding of district-level electoral trends. For U.S. federal elections (figs.: PVP, USH, USS), we can see a distinct drop in Republican support moving towards 2008; the results for U.S. House (USH) and U.S. Senate (USS), in particular, show a strong increase in Democratic support. This growth correlates to statewide trends that resulted in the election of two Democratic Senators representing Virginia for the first time since 1970. General Democratic gains notwithstanding, the House of Delegates (fig.: HD) results aren’t as promising for a Democratic challenger. The incumbent Del. Marshall saw more than 60% support in three of the last four elections and saw no challenger at all in 2003. While the district may be trending more Democratic over time, the voters of HOD#13 are obviously big fans of Del. Marshall.
Now that we understand the historical partisanship of this district we need to understand historical turnout, allowing us to project of the number of votes required to win. We will utilize the historical.turnout.summary function from the aggpol package to produce a summary of turnout for this district.
historical.turnout.summary(hd013, district.type="HD", district.number="013", years=c(2001,2003,2005,2007))
Looking at this table one can see some data collection problems in the 2001 HD elections. In recent years, precincts belonged to only one House of Delegates seat, but in 2001 and somewhat less so in 2003 some precincts are split and some have duplicate names and now information on how to allocate results from different races to precincts. The turnout numbers are slightly affected by these problems, but the aggpol attempts to correct this by substituting alternate years or even races if possible.
The take away from from the previous table is that turnout for the last four House of Delegates elections has hovered around 30%. This makes some political sense, because Virginia holds state elections in odd-numbered years with no federal elections to drive up turnout. This leaves a lot of registered voters to be activated, but we need to delve down to the precinct level to find them.
We use the district.analyze function of aggpol to aggregate all electoral results into a summary for each precinct.
hd013s <- district.analyze(hd013)
hd013s is a data frame with columns calculated for every precinct; several values for each major party and other values for the precinct as a whole. Those statistics are:
- Aggregate base partisan vote – The lowest non-zero turnout for a major party, in all electoral years.
- Average Party Performance – The average percentage of the vote a party receives in the closest 3 elections in recent years.
- Swing vote – The part of the electorate not included in the aggregate base partisan vote.
- Soft-partisan vote – The average worst a party has performed, minus the actual worst.
- Toss-up – The portion of the electorate not included in the Aggregate base or soft-base partisan vote.
- Partisan base – The combined aggregate-base and soft-partisan vote for each major party.
- Partisan swing – The combined major party swing vote.
- Projected turnout – The portion of the electorate that is projected to turn out given previous turnout and current registration data.
These variables can be visualized with the following graphic, adapted –along with definitions above– from Campaign Craft (Burton, Shea).
The actual columns in the data frame returned from from district.analyze are:
- proj.turnout.percent – The projected turnout percent of for a hypothetical next election.
- proj.turnout.count – The projected number of voters who will turn out for a hypothetical next election.
- current.reg – Current number of registered voters in a precinct.
- partisan.base – The combined aggregate-base and soft-partisan vote for both major parties ( Partisan base ).
- partisan.swing – All non-base voters (1.0 – partisan.base).
- tossup – The portion of the electorate not in the base or soft support of either major party.
- app.rep – The average party performance of a Republican candidate in this precinct.
- base.rep – The aggregate base partisan vote for a Republican candidate in this precinct.
- soft.rep – The soft partisan vote for a Republican candidate in this precinct.
- app.dem – The average party performance of a Democratic candidate in this precinct.
- base.dem – The aggregate base partisan vote for a Democratic candidate in this precinct.
- soft.dem – The soft partisan vote for a Democratic candidate in this precinct.
- partisan.rep – Combination of aggregate base and soft vote percentages for the Republican.
- partisan.dem – Combination of aggregate base and soft vote percentages for the Democrat.
The most useful statistic above is the Average Party Performance (APP), which is an average of major-party turnout in the 3 closest recent elections. The APP describes supporter levels for a best-case scenario in a close election. We’ve already calculated the APP of each major party (app.dem, app.rep), but when a race doesn’t have a third party candidate what we’ll usually visualize is the share of the combined partisan performance that each party receives. We’ll add these variables to our summary data frame generated previously, one for each major party.
hd013s$dem.share <- hd013s$app.dem/(hd013s$app.dem+hd013s$app.rep) hd013s$rep.share <- hd013s$app.rep/(hd013s$app.dem+hd013s$app.rep)
Now that we have the APP and partisan vote share for each party, we can visualize the precinct-level terrain for the Democratic challenger Mr. Bell. This visualization should show us the democratic support for each precinct and give us an idea whinc precincts could be competitive. We’ll produce this visualization using a density plot + 1d histogram, adapted from the seatsVotes plot in the pscl package. We’ll also draw a cut-line down the 50% vote mark to to help find competitive precincts.
1 2 3 4
qplot(dem.share, data=hd013s, geom=c("density","rug"), xlab="Dem Vote Share", main="Democratic vote share, by precinct") + geom_vline(xintercept=.50)
We can see a lot of precincts are between 48% and 53% Democratic, which means those precincts could potentially go for either candidate. We need to classify these results into something more solid. Let’s say precincts with less than 48% Democratic share are Safe Republican, 48-52% are Tossup, and greater than 52% are Safe Democrat. This is a simple representation but can be refined later. We’ll add a seat classification to our data frame using the cut function:
hd013s$cl <- cut(hd013s$dem.share, breaks=c(0,.48,.52,1), labels=c("Safe Rep", "Tossup", "Safe Dem"))
Now we need to visualize how many precincts fall into which classification, using a histogram this time instead of a density curve.
1 2 3 4 5
ggplot(hd013s, aes(x=dem.share)) + geom_bar(aes(fill=cl),binwidth=0.01) + scale_fill_brewer("Precinct Rating", palette="RdYlBu") + scale_x_continuous("Democratic Vote Share") + scale_y_continuous("Frequency")
From the histogram we see that not only does a Republican candidate enjoy more “Safe” precincts, but even the majority of the tossup precincts have less than 50% Democratic share. While the precinct breakdown looks bad, a Democratic win in this district is theoretically possible if these tossup precincts are held. A Democratic candidate will face a tough challenge, so the next step will be identifying Democratic and Democrat-leaning precincts to target.
To make this target precinct list we’ll need a method to prioritize the precincts so that we can reach the most persuadable voters while spending the least resources. A popular method to identify a precinct as high-value is to sort precincts by lowest projected turnout with highest Democratic vote share. Lower turnout means there are registered voters waiting to be convinced to show up, and high Democratic vote share means more of those voters will be Democrats.
Since we measured both of these values (turnout%, democratic vote share), it is very easy to order our data by turnout (ascending) and democratic average party performance (descending) using R.
|25||153 – 409 – SUDLEY NORTH||0.1959||0.5105||0.4894|
|27||153 – 411 – MULLEN||0.2218||0.5026||0.4973|
|4||107 – 111 – BRIAR WOODS||0.2256||0.4837||0.5162|
|6||107 – 212 – CLAUDE MOORE PARK||0.2279||0.5285||0.4714|
|26||153 – 410 – MOUNTAIN VIEW||0.2319||0.4945||0.5054|
|16||153 – 110 – BUCKLAND MILLS||0.2448||0.4891||0.5108|
|13||153 – 106 – ELLIS||0.2475||0.5038||0.4961|
|5||107 – 112 – FREEDOM||0.2509||0.5028||0.4971|
|15||153 – 108 – VICTORY||0.2645||0.5005||0.4994|
|24||153 – 408 – GLENKIRK||0.2837||0.4998||0.5001|
|1||107 – 106 – EAGLE RIDGE||0.2856||0.4992||0.5007|
|18||153 – 112 – CEDAR POINT||0.2876||0.4855||0.5144|
|14||153 – 107 – MARSTELLER||0.3067||0.4775||0.5224|
|3||107 – 109 – HUTCHISON||0.3168||0.4857||0.5142|
|2||107 – 108 – MERCER||0.3281||0.5034||0.4965|
|17||153 – 111 – BRISTOW RUN||0.3324||0.4822||0.5177|
|23||153 – 406 – ALVEY||0.3460||0.4736||0.5263|
|21||153 – 402 – BATTLEFIELD||0.3546||0.4323||0.5676|
|10||153 – 102 – BENNETT||0.3896||0.4959||0.5040|
|19||153 – 209 – WOODBINE||0.4014||0.4651||0.5348|
|7||107 – 307 – MIDDLEBURG||0.4043||0.4953||0.5046|
|9||153 – 101 – BRENTSVILLE||0.4180||0.4904||0.5095|
|22||153 – 403 – BULL RUN||0.4226||0.4860||0.5139|
|20||153 – 401 – EVERGREEN||0.4283||0.5006||0.4993|
|12||153 – 104 – NOKESVILLE||0.4537||0.4960||0.5039|
|11||153 – 103 – BUCKHALL||0.4636||0.4773||0.5226|
|8||107 – 309 – ALDIE||0.4687||0.4881||0.5118|
This sorted list is our critical intelligence to finding persuadable voters, but we need a better way to visualize the output. Since we have two scalar variables (turnout %, democratic vote share) we can use a scatter plot with the Democratic vote share on the Y axis and Turnout % on the X. We’ll also color each precinct with its seat classification we defined earlier (Safe Republican, Tossup, Safe Democrat):
1 2 3
ggplot(aes(x=proj.turnout.percent, y=dem.share), data=hd013s) + geom_point(aes(colour=cl,title="a")) + labs(x="Projected Turnout %", y="Democratic Vote Share %",colour="Seat Type")
This chart echoes what we’ve seen previously: the Democratic challenger faces an uphill battle, but there is room for a win. We see a single “Safe Democract” precinct with very low turnout, and five “Safe Republican” precincts that run the board in turnout. Given the high number of “Tossup” precincts, and the fact that they run the gamut as far as turnout is concerned, we’ll need to incorporate additional information into our prioritization. If we also rank precincts by current voter registration, we can focus on precincts where we stand to gain the most ground.
Before we continue, we need to make sure there is enough difference in precinct-to-precinct registration to have an impact. Let’s look at some statistics for the current registration in this district.
There are on average 2,970 current registered voters in each precinct, but the standard deviation is 1,014 voters. A standard deviation that high tells us we need to take into account registration if we want to focus on the precincts with 4000 people and not 1000 people. A histogram of current registration will help us clarify this finding:
qplot(current.reg, data=hd013s, geom="bar",binwidth=500,xlab="Current Registration") + scale_y_continuous("Frequency")
The standard deviation was correct: we see some very small precincts and some large precincts, but the majority are somewhere in the 2000-4000 range. The difference looks to be large enough to include current registration in our ranking.
We need to look at the Democratic Vote Share vs Turnout % scatter plot again, but with the points scaled to the current precinct registration.
qplot(proj.turnout.percent,dem.share,size=current.reg, data=hd013s,colour=cl)+ labs(x="Projected Turnout %", y="Democratic Vote Share %",colour="Seat Type",size="Current Registration")
This plot is almost complete and ready to be analyzed. The last job is to label the points with ther precinct names. Our current precinct_name variable is actually a unique identifier with a FIPS county code, a precinct code, and a name, and it is too long for a point label. We’ll shrink it down to just the name and then we’ll recreate the scatter plot with the label:
1 2 3 4 5 6 7
# replace the fips code and precinct number w/ an empty string hd013s$precinct.label <- sub("^[0-9]+ - [0-9]+ - ",'',as.character(hd013s$precinct_name)) # plot the previous graph again but this time use precinct.label as the label ggplot(hd013s, aes(x=proj.turnout.percent, y=dem.share,label=precinct.label)) + geom_point(aes(colour=cl,size=current.reg)) + geom_text(size=2.5,vjust=1.5,angle=25) + labs(x="Projected Turnout %", y="Democratic Vote Share %",colour="Seat Type",size="Current Registration")
From the chart we can see that a Democrat in the HD#013 will want to focus contact efforts on the precincts in the upper-left hand corner of the plot and will want to target larger precincts before smaller. Integrating the current registration into our previous sort command leaves us with the following sort order:
|25||153 – 409 – SUDLEY NORTH||0.1959||2497||0.5105||0.4894||Tossup|
|27||153 – 411 – MULLEN||0.2218||3555||0.5026||0.4973||Tossup|
|4||107 – 111 – BRIAR WOODS||0.2256||2288||0.4837||0.5162||Tossup|
|6||107 – 212 – CLAUDE MOORE PARK||0.2279||3115||0.5285||0.4714||Safe Dem|
|26||153 – 410 – MOUNTAIN VIEW||0.2319||3749||0.4945||0.5054||Tossup|
|16||153 – 110 – BUCKLAND MILLS||0.2448||3646||0.4891||0.5108||Tossup|
|13||153 – 106 – ELLIS||0.2475||1303||0.5038||0.4961||Tossup|
|5||107 – 112 – FREEDOM||0.2509||3929||0.5028||0.4971||Tossup|
|15||153 – 108 – VICTORY||0.2645||4874||0.5005||0.4994||Tossup|
|24||153 – 408 – GLENKIRK||0.2837||2175||0.4998||0.5001||Tossup|
|1||107 – 106 – EAGLE RIDGE||0.2856||2531||0.4992||0.5007||Tossup|
|18||153 – 112 – CEDAR POINT||0.2876||3497||0.4855||0.5144||Tossup|
|14||153 – 107 – MARSTELLER||0.3067||3669||0.4775||0.5224||Safe Rep|
|3||107 – 109 – HUTCHISON||0.3168||3722||0.4857||0.5142||Tossup|
|2||107 – 108 – MERCER||0.3281||3229||0.5034||0.4965||Tossup|
|17||153 – 111 – BRISTOW RUN||0.3324||3031||0.4822||0.5177||Tossup|
|23||153 – 406 – ALVEY||0.3460||4403||0.4736||0.5263||Safe Rep|
|21||153 – 402 – BATTLEFIELD||0.3546||3851||0.4323||0.5676||Safe Rep|
|10||153 – 102 – BENNETT||0.3896||4440||0.4959||0.5040||Tossup|
|19||153 – 209 – WOODBINE||0.4014||2406||0.4651||0.5348||Safe Rep|
|7||107 – 307 – MIDDLEBURG||0.4043||1239||0.4953||0.5046||Tossup|
|9||153 – 101 – BRENTSVILLE||0.4180||1708||0.4904||0.5095||Tossup|
|22||153 – 403 – BULL RUN||0.4226||3111||0.4860||0.5139||Tossup|
|20||153 – 401 – EVERGREEN||0.4283||2535||0.5006||0.4993||Tossup|
|12||153 – 104 – NOKESVILLE||0.4537||2501||0.4960||0.5039||Tossup|
|11||153 – 103 – BUCKHALL||0.4636||2287||0.4773||0.5226||Safe Rep|
|8||107 – 309 – ALDIE||0.4687||902||0.4881||0.5118||Tossup|
Now that we have our ranking, we can figure out how much each precinct might offer. Let’s first see the number of votes required to win the seat, the number of votes we’re projected to receive given the calculated APP, previous turnout, and current registration. The district.summary function will provide us will all this information:
We can see that the projected turnout (proj.turnout.count) is about 25,401, so the votes projected to win this district is only 12,702. Using the Democratic APP, we can project Democratic turnout at 12,074, so we need to find 628 votes to win. How do we find these votes?
Lets go back to our sorted precinct list and take the top 30% and call them our target.precincts.
sorted.precincts <- hd013s[order(hd013s$proj.turnout.percent,-hd013s$app.dem,hd013s$current.reg),] target.precincts <- sorted.precincts[1:(nrow(sorted.precincts)/3),]
We’ve got our target list, and we know we need 628 votes from them to bring our total to 50% + 1. Adding a small buffer to that number, we’ll take 640 target votes and allocate them across our target precincts, proportional to the number of registered voters in the precinct. Hopefully, this will set more realistic goals for larger and smaller precincts.
1 2 3
target.precincts$inc <- as.integer(640 * target.precincts$current.reg/sum(target.precincts$current.reg)) target.precincts[,c(2,3,17,23,18,20:22,24)]
|CLAUDE MOORE PARK||0.2279||709||366||326||0.5285||0.4714||Safe Dem||68|
The final column in the result is the target increase for that precinct (column: ‘inc’). With this information in hand the campaign field operations can devise a contact strategy to bring these voters to the polls on election day.
Playing the role of campaign consultant, we have analyzed previous electoral outcomes in the 13th seat of the House of Delegates in Virginia. We have shown how a Democratic candidate can leverage increasing Democratic support and low turnout to make this race competitive. We have also created a precinct targeting methodology that provides a high-level blueprint for resources planning. The analysis we performed performed is very standard, but using R makes our methodology unique. A down-ballot or primary-challenger campaign taking advantage of this methodology will spend less money and can experiment more on their targeting, potentially leading them to a win.
Are you a Democrat running for the Virginia House of Delegates who would like to see the same data for your race? Or, are you a Democratic congressional candidate preparing for the 2010 cycle? Contact me at [email protected] for robust targeting data or other analysis.
Follow Offensive Politics on twitter
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.