Guilford Registered Voters: An R Sampler

[This article was first published on R on Can I Blog Too, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Attention conservation notice: Unless you are interested in R-related detail, skip this post and
read the next post instead.

The Data Source

It’s primary season and recently I learned about a somewhat strange
site that collects a lot of voter registration data.
It includes birth date and the full dataset also includes phone numbers.
I was a bit surprised that this data is publicly available.
Apparently I am not the only one. Here
is an article from TV News in Connecticut treating this data as a controversy.

The author of this site is a fellow from new Hampshire. His primary interest is genealogy.
He is unapologetic about putting all this data out onto the web and includes a robust defense of his site. In his about page, he offers the following comment:

Perhaps you have become disillusioned, because you had a false sense of privacy.
Welcome to the twenty-first century and the Information Age.

I recovered from my shock and decided to play with this data as a vehicle to become more familiar with my new home in Guilford while exercising and developing my R skills at the same time.

An R Coding Sampler

This won’t be a coherent description of registered voters in Guilford. Instead it is
a collection of various bits of loosely related R code.
I grabbed onto the Guilford voter statistics because it was a way to connect to my
new home. But the primary purpose was to exercise my R muscles rather than to
learn about Guilford voters.
I spent way more time on this than I expected. Each bit would lead me to try one more thing.

Because I enjoy mapping, I wandered off into some time-consuming experimentation with geotagging.

As I spent an unreasonable time fooling around with this data,
I cam to regard this activity as being like a cross-stitch sampler that
I inherited from my mother. The point of samplers of this sort was just to practice the craft.
(I note that this one was done in 1971, the year I graduated from college.) This blog post has something
in common with my mother’s sampler. Like hers, it was done for the simple pleasure of doing it
(even though that involved interludes of frustration and tedium) and as an
exercise to practice my skill. The result may be about as useful as a cross-stitch sampler, but
“useful” isn’t the point.

One reason this has taken an unreasonable amount of time is that I started out with a mistake.
When I first went to the voter registration web site, I used the interface on the page to
select voters in the Town of Guilford. I then copied the list of about 20,000 names and pasted
them into a text file. That was the dataset I started with. But as I progressed, I noticed some
odd things about this data. It had a a noticeable number of cases where I knew people were no longer
registered to vote, including names at my house who I knew no longer lived there and people who I knew
were deceased or who were implausibly old. As a last step, I compared the registration data to
census population data by age and in many cases saw substantially more voter registrations than
population.

It just didn’t seem right. So I went back to the web site and realized that I could download a bunch
of files that together constituted the complete list for Connecticut. In fact he offers the option
to download data from several points in time.
When I downloaded the data for June 2018, I ended up with about 15,000 active voter registrations
rather than the 20,628 names I got when I just copied the results of the query on his main page.
His interest is in genealogy. I suspect that what he has done is to combine all the names he
found in 2013, 2014, 2016, 2017, and 2018. That’s how he includes people who have moved away or died.
Perhaps for genealogy purposes he wants a list as inclusive as possible. Voting is not his focus.
The downloaded data also includes a few more details than he shows in his interactive query.
The upshot was that I needed to reload the larger dataset and redo my analyses. This turned out
to be a case where relying on RMarkdown provided a real advantage.

As a note on the archaeology of IT, the layouts of the Connecticut dataset look to me like they were done
in COBOL. One wonders how much COBOL code is still active.

Skills I Used Doing this Post

  • The ggmap package and geocode function
  • The terms of service for using Google map data
  • Using kable to display simple tables in RMarkdown
  • How hard it can be to setup a simple table in R (compared with Proc Tabulate in SAS)
  • Using the gender package to estimate gender from first name
  • Calculating age from lubridate date information
  • Various ways to recode data, including the recode function and the not well documented form of str_replace_all
  • Using the tidycensus package to get some basic data from the Census (a potentially large subject)
  • Using American Fact Finder (and other sources) to get variable IDs for the decennial census and the American Community Survey
  • Creating a population pyramid in R
  • Using the ggrepel package to get more readable text labels than geom_text.

Explore Basic Stats

After reading in the downloaded file of voter registrations for the Town of Guilford, we
will look at some basic statistics by party.

# note: sampler picture is 1395 × 1597
# Code that reads in the 500MB chunk that includes Guilford:
EXT2 <- read_csv("downloaded_files/SSP-2/ELCT/VOTER/EXT2",
col_names = FALSE, col_types = fix_cols) %>%
filter(X1 == "060")
EXT2 <- EXT2 %>% arrange(X23, X21, X3)
#
# fl <- read_lines("downloaded_files/SSP-2/ELCT/VOTER/EXT2")
construct_names <- c(paste0("elect", seq(1:20)),
paste0("elect_type", seq(1:20)),
paste0("elect_absentee", seq(1:20)))
construct_names <- construct_names[rep(c(1, 21, 41), 20) + rep(seq(0, 19), each = 3)]
# a sample layout line:
# 001300 10 WS-TOWN-ID PIC X(03). 00121000
alayout <- read_table2("downloaded_files/alayout.txt", col_names = FALSE) %>%
filter(X3 != "PIC") %>%
mutate(X3 = str_to_lower(str_replace_all(X3, "-", "_"))) %>% select(X3)
# next apply names from layout plus constructed names at the end of the record
names(EXT2) <- c(str_replace(alayout$X3, "ws_", "")[1:43], construct_names, "mystery")
guilford_only <- EXT2 %>% filter(town_id == "060") %>%
mutate(first_name = str_trim(paste(vtr_nm_first, vtr_nm_mid)),
addresses = paste0(paste(vtr_ad_num, nm_street),
ifelse(is.na(vtr_ad_unit), "", paste(" Unit", vtr_ad_unit)),
", Guilford, CT")) 
load(paste0(data_folder, "regs from full dataset.RData"))
regs <- regs %>% filter(status == "Active")
regs %>%
group_by(party) %>%
tally() %>% arrange(desc(n)) %>%
mutate(pct =paste0(round(100 * n / nrow(regs), 0),'%')) %>%
bind_rows(regs %>% tally() %>% mutate(party = "Total", pct = "100%")) %>%
kable(col.names = c("Party", "n", "%"), format = "html", align = c("l", "r", "r"),
format.args = list(big.mark = ","))
Party n %
Unaffiliated 5,988 40%
Democratic 5,440 36%
Republican 3,450 23%
Independent 144 1%
Libertarian 19 0%
Green 18 0%
Total 15,059 100%

The winning party in Guilford is…Unaffiliated!

This is reasonably close to the stats provided by Secretary of State from 2017 which show 5,379
Democrats, 3,494 Republicans, and 5,956 unaffiliated.

Gender

The downloaded dataset included gender, but many values were missing. One can easily get an approximate version of gender based on first name. The rOpenSci includes a package called gender to estimate gender based on first name.
For the US it uses a social security database of name and gender by birth year. Based on
name it gives a probability of male or female.

# I clean up the first_name data to eliminate initials and then pick first word
# Also interprets hyphenated first name: Mary-Beth picks out Mary
regs_fn <- regs %>%
mutate(dob = mdy(dt_birth), yob = year(dob),
fn = map2_chr(str_split(
str_replace(first_name, "^.\\.* ", ""), "[ -]"), 1, first))
# could try to use year of birth to refine gender guess,
# but decided it wasn't worth pursuing
fn_year <- regs_fn %>% select(fn, yob) %>%
mutate(yob = (yob %/% 5) * 5) %>%
unique()
all_fn <- unique(fn_year$fn)
# call the gender function to get percent male/female
simple_gender <- gender::gender(all_fn) %>%
mutate(gender_guess = ifelse(proportion_male > 0.5, "Male", "Female"))

Here are some names that fall between 45% and 55%
male (i.e., close to a coin flip).

iffy_gender <- simple_gender %>% filter(proportion_male < 0.55, proportion_male > 0.45) %>%
mutate(how_close = abs(proportion_male - 0.5)) %>% arrange(desc(how_close))
kable(iffy_gender %>% select(`first name` = name, `proportion male` = proportion_male, `estimated gender` = gender))
first nameproportion maleestimated gender
Sol0.5407male
Tian0.5316male
Bao0.5289male
Yu0.4737female
Krishna0.4740female
Justice0.5190male
Lavon0.4820female
Jerzy0.4855female
Arden0.5144male
Kerry0.4896female
Kris0.4904female
Riley0.5063male
Blair0.5032male
Jessie0.4993female

And here is the distribution of observed first names compared with the baseline
stats supplied by the gender package. You can’t get more bi-modal than this.
The gender guesses are not 100% accurate, but most are quite unambiguous.

name_found <- left_join(regs_fn, simple_gender, by = c("fn" = "name")) %>% filter(!is.na(proportion_male))
for_subtitle <- paste(sprintf("(The `gender` package found an entry in the gender database for %4.1f%%",
(nrow(name_found) / nrow(regs_fn)) * 100),
"of the first names.)")
ggplot(data = name_found, aes(x = proportion_male)) +
geom_histogram() + scale_x_continuous(labels = scales::percent) +
labs(title = "Estimate of Gender Based on First Name",
subtitle = for_subtitle)

Of the small number of first names that were not found by the gender package, I would
guess that more than 90% are non-European or at least non-English-speaking names.

Now we are ready to do a basic table showing the number of registered voters
by party and gender. Women are less likely to be unaffiliated and more likely to be
Democrats than are men.

# gosh, it takes a lot of fooling around to produce a straightforward table
regs_fn <- regs_fn %>%
left_join(simple_gender %>% select(fn = name, sex = gender_guess))
## Joining, by = "fn"
miss_codes <- regs_fn %>% filter((sex == "Female" & (vtr_sex == "M")) |
((sex == "Male") & (vtr_sex == "F")))
# % of cases where vtr_sex is present where vtr_sex is different than guess of sex
# note that sprintf format %3.1f rounds.
rate_of_misscodes <- sprintf("%3.1f%%",
(nrow(miss_codes) /
nrow(regs_fn %>% filter(vtr_sex != "U", !is.na(sex)))) * 100)
regs_fn <- regs_fn %>% mutate(sex = ifelse(vtr_sex == "M", "Male",
ifelse(vtr_sex == "F", "Female",
ifelse(is.na(sex), "Unknown", sex))))
by_party <- regs_fn %>%
mutate(Party = factor(case_when(
party %in% c("Democratic", "Republican", "Unaffiliated") ~ party,
TRUE ~ "Other"
), levels = c("Democratic", "Republican", "Unaffiliated",
"Other", "Total"))) %>%
group_by(Party) %>%
summarise(Men = sum(sex == "Male"), Women = sum(sex == "Female"),
Unknown = sum(sex == "Unknown"),
Count = n())
totals = by_party %>%
summarise(Party = factor("Total", levels = c("Democratic", "Republican",
"Unaffiliated", "Other", "Total")),
Men = sum(Men), Women = sum(Women),
Unknown = sum(Unknown), Count = sum(Count))
by_party <- bind_rows(by_party, totals) %>%
mutate(`% Women` = sprintf(" %4.f%% ",
round((Women / (Count - Unknown)) * 100, 1)),
`% of Total` = sprintf(" %4.f%% ",
round((Count / totals$Count) * 100, 1)))
kable(by_party, align = c("l", "r", "r", "r", "r", "c", "c"),
format.args = list(big.mark = ",", width = 6, justify = "r"))
PartyMenWomenUnknownCount% Women% of Total
Democratic2,1783,252105,44060%36%
Republican1,8561,58593,45046%23%
Unaffiliated2,9073,070115,98851%40%
Other10774018141%1%
Total7,0487,9813015,05953%100%

Age

Given that birth date is included in the dataset, let’s take a look at age.
The first thing to do is to check the outliers. There are some people
who are under 18. I was surprised to learn that in Connecticut a 17 year old can register to vote in a
primary if they will turn 18 by the time of the general election.

Here is the code that converts the text date of birth into a value for
age and that does some other recoding.

library(lubridate)
# https://stackoverflow.com/questions/31126726/efficient-and-accurate-age-calculation-in-years-months-or-weeks-in-r-given-b
compute_age <- function(from, to) {
from_lt = as.POSIXlt(from)
to_lt = as.POSIXlt(to)
age = to_lt$year - from_lt$year
ifelse(to_lt$mon < from_lt$mon |
(to_lt$mon == from_lt$mon & to_lt$mday < from_lt$mday),
age - 1, age)
}
# create named vector to use with str_replace that will find name and replace with value.
# Full dataset has age as mm/dd/yyyy so I didn't need this code.
# fnd <- c("born: ", " January ", " February ", " March ", " April ",
# " May ", " June ", " July ", " August ",
# " September ", " October ", " November ", " December ")
# repl <- c("", "-01-", "-02-", "-03-", "-04-",
# "-05-", "-06-", "-07-", "-08-",
# "-09-", "-10-", "-11-", "-12-")
# names(repl) <- fnd # repl will be used in str_replace_all
regs_age <- regs_fn %>%
left_join(simple_gender, by = c("fn" = "name")) %>%
mutate(sex = ifelse(is.na(proportion_male), "Unknown",
ifelse(proportion_male > 0.5, "Male", "Female")),
Party = factor(case_when(
party %in% c("Democratic", "Republican", "Unaffiliated") ~ party,
TRUE ~ "Other"
), levels = c("Democratic", "Republican", "Unaffiliated",
"Other", "Total")),
age = compute_age(dob, ymd("2017-11-03")),
age = ifelse(age > 105, NA_real_, age),
# create Age groups:
Age = paste0((age %/% 5) * 5,"-",(age %/% 5) * 5 + 4)) %>%
select(first_name, last_name, yob, age, Age, sex, Party, addresses,
vtr_sex, poll_place)

Population Pyramid of Voter Registrations

Next we will do a “population pyramid” based on a StackOverflow anwer and informed by another StackOverflow answer by Didzis Elferts.

for_pop_pyramid <- regs_age %>% filter(sex != "Unknown", !is.na(age), age < 105, Party != "Other") %>%
mutate( Age = if_else(age >= 90, "90+", Age)) %>%
arrange(age) %>% mutate(Age = fct_inorder(factor(Age)))
party_pyramid <- ggplot(data = for_pop_pyramid,
aes(x = Age, fill = sex)) +
geom_bar(data = subset(for_pop_pyramid, sex == "Female"), width = 1) +
geom_bar(data = subset(for_pop_pyramid, sex == "Male"), , width = 1,
mapping = aes(y = - ..count.. ),
position = "identity") +
scale_y_continuous(labels = abs) +
coord_flip() +
labs(title = "Guilford Registered Voters Population Pyramid by Party 2018") + xlab("Age") +
facet_wrap(~ Party)
print(party_pyramid) 

There are differences by party. For Democrats 65-69 is the largest age category. For Republicans and Unaffiliated it’s 55-59.
Women outnumber men throughout pyramid for the Democratic Party. For the Republicans,
men consistently outnumber women except for 85 and older. Republicans look thin in the under
50 categories.

US Census Data for Guilford

Next we will retrieve US Census data to do the same thing for the total population of Guilford in 2000 and 2010.1

library(tidycensus)
census_age_sex <- function(census_year = 2010, towns = c("Guilford")) {
dec2010 <- load_variables(2010, "sf1", cache = TRUE)
vars <- dec2010$name[985:1031]
# dec2010$label[985:1031]
# we need to weed out the Total Female variabe
vars <- vars[vars != first(dec2010$name[dec2010$label == "Total!!Female"])]
# get my key for the census API
source("~/Dropbox/Programming/R_Stuff/my_census_api_key.R")
age_x_sex <- get_decennial(variables = vars, # table = "P12",
summary_var = "P012001",
geography = "county subdivision",
county = "New Haven",
state = "CT", cache = TRUE, year = census_year)
# Note the interesting filter using map_lgl to take advantage of passing vector to str_detect
# Filter can also be written as: filter(map_lgl(NAME, ~ any(str_detect(., towns))))
guilford_age <- age_x_sex %>%
filter(map_lgl(NAME, function(x) {any(str_detect(x, towns))})) %>%
left_join(dec2010, by = c("variable" = "name")) %>%
arrange(NAME, variable)
# format and coding based on https://walkerke.github.io/2017/10/geofaceted-pyramids/
agegroups <- c("0-4", "5-9", "10-14", "15-19", "15-19", "20-24", "20-24",
"20-24", "25-29", "30-34", "35-39", "40-44", "45-49", "50-54",
"55-59", "60-64", "60-64", "65-69", "65-69", "70-74", "75-79",
"80-84", "85+")
agesex <- c(paste("Male", agegroups),
paste("Female", agegroups))
guilford_age$group <- rep(agesex, length(unique(guilford_age$NAME)))
guilford_age$year <- census_year
return(guilford_age)
}
towns <- paste0("^", c("New Haven", "Guilford",
"Branford", "Hamden",
"Madison",
"East Haven",
"West Haven",
"North Haven",
"Woodbridge"))
x1 <- census_age_sex(census_year = 2000)
## To install your API key for use in future sessions, run this function with `install = TRUE`.
## Getting data from the 2000 decennial Census
all_towns <- census_age_sex(census_year = 2010, towns = towns)
## To install your API key for use in future sessions, run this function with `install = TRUE`.
## Getting data from the 2010 decennial Census
x2 <- all_towns %>% filter(str_detect(NAME, "Guilford"))
guilford_age <- bind_rows(x1, x2)
guilford_age <- bind_rows(census_age_sex(census_year = 2000), census_age_sex(census_year = 2010))
## To install your API key for use in future sessions, run this function with `install = TRUE`.
## Getting data from the 2000 decennial Census
## To install your API key for use in future sessions, run this function with `install = TRUE`.
## Getting data from the 2010 decennial Census
#guilford_age$group <- str_replace(guilford_age$group, "Male |Female ", "")
age2 <- guilford_age %>%
group_by(year, group) %>%
mutate(value = sum(value)) %>%
distinct(year, group, .keep_all = TRUE) %>%
ungroup() %>%
#mutate(percent = 100 * (value / summary_value)) %>%
select(year, group, value) %>%
separate(group, into = c("sex", "age"), sep = " ") %>%
mutate(age = factor(age, levels = unique(age)),
n = ifelse(sex == "Female", value, -value))
xlabs = c("0-4" = "0-4", "5-9" = "", "10-14" = "", "15-19" = "", "20-24" = "",
"25-29" = "", "30-34" = "", "35-39" = "", "40-44" = "", "45-49" = "",
"50-54" = "", "55-59" = "", "60-64" = "", "65-69" = "", "70-74" = "",
"75-79" = "", "80-84" = "", "85+" = "85+")
xlabs2 <- c("0-4", "5-9", "10-14", "15-19", "20-24", "25-29", "30-34", "35-39", "40-44", "45-49", "50-54", "55-59", "60-64", "65-69", "70-74", "75-79", "80-84", "85+")
guilford_two_year <- ggplot(data = age2, aes(x = age, y = n, fill = sex)) +
geom_bar(stat = "identity", width = 1) +
scale_y_continuous(breaks=c(-1000, -500, 0, 500, 1000),labels=c("1,000", "500", "0", "500", "1,000")) +
coord_flip() +
theme_minimal(base_family = "Tahoma") +
scale_x_discrete(labels = xlabs2) +
scale_fill_manual(values = c("red", "navy")) +
# theme(panel.grid.major = element_blank(),
# panel.grid.minor = element_blank(),
# strip.text.x = element_text(size = 6)) +
labs(x = "", y = "", fill = "",
title = "Demographic Structure of Guilford -- 2000 and 2010",
caption = paste("Data source:", "US Census, tidycensus R package.")) +
facet_wrap(~ year)
print(guilford_two_year)

The census data has a top age category of 85+.

Adults in the 20 to 34 age category are relatively scarce in Guilford. Comparing 2000 and 2010, one
can see that the population of Guilford is getting older. In both years age 50-54 is the
largest group. But one can see the distribution moving older. The 55-64 categories have gotten
much bigger between 2000 and 2010 and the 30-39 categories have gotten smaller.

National Age Distribution

The Census blog has a discussion of the national demographic
structure.

US Demograpic Structure

US Demograpic Structure

One can see the peak of the baby boom and then an echo of the baby boom in the 20-24 category.

Comparing Guilford to Other Area Towns

The population structure in Guilford is much different than the nation-wide pattern. This
made me curious about how Guilford compared with some other area towns. I picked eight other
towns centered around New Haven. That gives me two large plots, each with nine towns. The
first plot shows the population pyramid scaled by the size of the population. It emphasizes
how much larger New Haven is that each surrounding town by itself. The second plot is scaled by
the total population in each town. Looking at the percentage of population in each age category
emphasizes differences in the structure of the population rather than the absolute size.

One can see that Guilford is similar to Madison and to Woodbridge (and to a lesser extent
North Haven) in terms of the thin band in
the 20 to 40 age range. I assume this is related to housing costs for young families.
Note the post-secondary student age bulges for New Haven, Hamden, and West Haven because of
their college populations (Yale, Quinnipiac, and University of New Haven).

age <- all_towns %>% rename(town = NAME) %>%
mutate(town = str_replace(str_replace(town, " town, New Haven County, Connecticut", ""), " town", ""),
town = factor(town, levels = c("Woodbridge", "Hamden", "North Haven",
"West Haven", "New Haven", "East Haven",
"Branford", "Guilford", "Madison"))) %>%
group_by(town, group) %>%
mutate(value = sum(value)) %>%
distinct(town, group, .keep_all = TRUE) %>%
ungroup() %>%
mutate(percent = 100 * (value / summary_value)) %>%
select(town, group, value, percent) %>%
separate(group, into = c("sex", "age"), sep = " ") %>%
mutate(age = factor(age, levels = unique(age)),
n = ifelse(sex == "Female", value, -value),
pct = ifelse(sex == "Female", percent, -percent))
xlabs = c("0-4" = "0-4", "5-9" = "", "10-14" = "", "15-19" = "", "20-24" = "",
"25-29" = "", "30-34" = "", "35-39" = "", "40-44" = "", "45-49" = "",
"50-54" = "", "55-59" = "", "60-64" = "", "65-69" = "", "70-74" = "",
"75-79" = "", "80-84" = "", "85+" = "85+")
# xlabs2 <- c("0-4", "5-9", "10-14", "15-19", "20-24", "25-29", "30-34", "35-39", "40-44", "45-49", "50-54", "55-59", "60-64", "65-69", "70-74", "75-79", "80-84", "85+")
xlabs2 <- c("0-4", "", "10-14", "", "20-24", "", "30-34", "", "40-44", "", "50-54", "", "60-64", "", "70-74", "", "", "85+")
town_pyramids_pop <- ggplot(data = age, aes(x = age, y = n, fill = sex)) +
geom_bar(stat = "identity", width = 1) +
scale_y_continuous(breaks=c(-6000, -5000, -4000, -3000, -2000, -1000, 0,
1000, 2000, 3000, 4000, 5000, 6000),
labels=c("", "5K", "", "", "2K", "", "", "", "2K", "", "", "5K", "")) +
coord_flip() +
theme_minimal(base_family = "Tahoma") +
scale_x_discrete(labels = xlabs2) +
scale_fill_manual(values = c("red", "navy")) +
# theme(panel.grid.major = element_blank(),
# panel.grid.minor = element_blank(),
# strip.text.x = element_text(size = 6)) +
labs(x = "", y = "", fill = "",
title = "Demographic Structure Area Towns - Population by Age 2010",
caption = paste("Data source:", "US Census, tidycensus R package.")) +
facet_wrap(~ town)
print(town_pyramids_pop)

town_pyramids_pct <- ggplot(data = age, aes(x = age, y = pct, fill = sex)) +
geom_bar(stat = "identity", width = 1) +
scale_y_continuous(breaks=c(-6, -4, -2, 0, 2, 4, 6),labels=c("", "4%", "2%","0", "2%", "4%", "")) +
coord_flip() +
theme_minimal(base_family = "Tahoma") +
scale_x_discrete(labels = xlabs2) +
scale_fill_manual(values = c("red", "navy")) +
# theme(panel.grid.major = element_blank(),
# panel.grid.minor = element_blank(),
# strip.text.x = element_text(size = 6)) +
labs(x = "", y = "", fill = "",
title = "Demographic Structure Area Towns - Percent by Age 2010",
caption = paste("Data source:", "US Census, tidycensus R package.")) +
facet_wrap(~ town)
print(town_pyramids_pct)

Guiford Regisration Rates

Let’s directly compare the Guilford registered voter counts of 2018 to the 2010 census. That’s not an exact comparison because
the population changed between 2010 and 2018.2 And the registered voter list appears
to include some people who moved away or died.

regs_count <- regs_age %>% group_by(Age) %>% tally() %>%
mutate(group = case_when(
Age %in% c("100-104", "105-109", "85-89", "90-94", "95-99") ~ "85+",
Age == "NA-NA" ~ NA_character_,
TRUE ~ Age
)) %>%
filter(!is.na(group), group != "15-19") %>%
group_by(group) %>% summarise(n = sum(n, na.rm = TRUE)) %>%
select(group, regs = n)
census_count <- guilford_age %>% filter(year == 2010) %>%
mutate(group = str_replace(group, "Male |Female ","")) %>%
group_by(group) %>% summarise(n = sum(value)) %>%
filter(!(group %in% c("0-4", "5-9", "10-14", "15-19")))
both <- left_join(regs_count, census_count) %>%
mutate(`%` = sprintf(" %8.f%% ", round(((regs / n) * 100), 0))) %>%
select(group, `2010 population` = n, `2018 voters` = regs, `%`) %>%
arrange(desc(group))
## Joining, by = "group"
kable(both, format.args = list(big.mark = ",", justify = "right"))
group2010 population2018 voters%
85+55451292%
80-8449347296%
75-79665793119%
70-748171,301159%
65-691,3841,563113%
60-641,8401,57085%
55-591,9711,66785%
50-542,0741,49772%
45-492,0101,33266%
40-441,69089953%
35-391,08078072%
30-3478565684%
25-2967864395%
20-24709935132%

Remember that the numerator (voters) is from 2018 and the denominator (population) is from 2010.
That’s how we are able to have some categories where we have more voters than population.
Presumably the 70-74 category shows as 159% of the population in 2010 because a large chunk of
people who were in the 65-69 category in 2010 were in the 70-74 category in 2018 for registrations.
The same would be true for some other categories as a bulge of older people moved through the
age categories.

Geotagging – Where are the Voters?

I did some simple geotagging years ago, and this dataset of addresses cried out for more.
The ggmap package includes a function called geocode that
takes an address and returns longitude and latitude.

In theory geo-coding involves a straightforward call to the geocode function. That was true
if I relied on the default source. But when I used Google as a source, it was much more
problematic. I had to call geocode one address at a time and wrap that call with code that
would survive failures. Google limits the number of calls and also the speed of the calls. As
a result it doesn’t return a result and geocode does not handle those failures gracefully. I
had to divide up the 8,000+ addresses into batches and feed them to Google in bits and pieces
over a period of several days.

First I’ll show the basic registration counts by polling place.
Then I’ll display a basic map showing where the voters are located, separately by party.

Here are the basic voter registration counts by polling place. The percentage of Republican voters
ranges between 21% and 25%. There’s a somewhat larger range for Democratic voters (30% to 42%).

poll_loc <- tribble(
~poll_place, ~lon, ~lat,
"Guilford Fire Headquarters", -72.68924, 41.29657,
"Melissa Jones School", -72.72633, 41.36891,
"Abraham Baldwin School", -72.71898, 41.33782,
"Calvin Leete School", -72.66886, 41.28287,
"A.W. Cox School", -72.69658, 41.28663)
by_place <- regs_age %>% group_by(poll_place) %>% tally() %>% rename(subtotal = n)
by_place_party <- regs_age %>%
group_by(poll_place, Party) %>% tally() %>%
left_join(by_place) %>% mutate(pct = (n / subtotal) * 100)
## Joining, by = "poll_place"
place_summary <- by_place_party %>% group_by(poll_place) %>%
select(poll_place, Party, pct) %>%
spread(key = Party, value = pct) %>%
mutate(summary = sprintf("D:%2.f%% R:%2.f%% U:%2.f%%", Democratic, Republican, Unaffiliated)) %>%
left_join(poll_loc) %>% left_join(by_place)
## Joining, by = "poll_place"
## Joining, by = "poll_place"
kable(place_summary %>% select(`Polling Place` = poll_place, n = subtotal, summary),
format.args = list(big.mark = ","),
caption = "Guilford Voter Registratins by Polling Place")
(#tab:polling_place)Guilford Voter Registratins by Polling Place
Polling Placensummary
A.W. Cox School2,852D:40% R:21% U:37%
Abraham Baldwin School3,032D:35% R:22% U:41%
Calvin Leete School2,745D:42% R:21% U:37%
Guilford Fire Headquarters3,466D:34% R:24% U:40%
Melissa Jones School2,964D:30% R:25% U:43%

There seems to be a tendency for the polling locations close to the Sound to have a higher
percentage of Democrats and a lower percentage of Unaffiliated. The far north is the most
Republican. None of these effects are large.

# I'm loading in data already geo-tagged because geocoding was a bit messy in practice.
load(paste0(data_folder, "google_addr.RData"))
load(paste0(data_folder, "dsk_addr.RData"))
# At the end of this post I'll append a long block of code that shows how geocoded addresses.
guilford_center_lat <- 41.32
guilford_center_lon <- -72.699986
guilford_left_bottom_lon <- -72.749185
guilford_left_bottom_lat <- 41.242489
guilford_right_top_lon <- -72.631308
guilford_right_top_lat <- 41.436585
guilford_right_lower_lat <- 41.34
guilford_boundaries <- c(left = guilford_left_bottom_lon, bottom = guilford_left_bottom_lat,
right = guilford_right_top_lon, top = guilford_right_top_lat)
guilford_base <- ggmap(get_map(location = guilford_boundaries, zoom = 11,
maptype = "roadmap", source = "google"))
## Source : http://tile.stamen.com/terrain/11/610/764.png
## Source : http://tile.stamen.com/terrain/11/610/765.png
## Source : http://tile.stamen.com/terrain/11/610/766.png
regs_loc <- regs_age %>%
left_join(google_addr %>% select(lat, lon, addresses = in_address)) %>%
filter(!is.na(lat), !is.na(lon),
lat <= guilford_right_top_lat, lat >= guilford_left_bottom_lat,
lon <= guilford_right_top_lon, lon >= guilford_left_bottom_lon)
## Joining, by = "addresses"
p <- guilford_base +
geom_point(data = regs_loc %>%
filter(Party %in% c("Democratic", "Republican", "Unaffiliated")) %>%
sample_frac(size = 0.5),
aes(x = lon, y = lat, color = Party), size = 0.1) +
facet_wrap(~ Party) +
xlim(c(guilford_left_bottom_lon, guilford_right_top_lon)) +
ylim(c(guilford_left_bottom_lat, guilford_right_top_lat)) + # guilford_right_top_lat
scale_color_manual(values=c("blue", "red", "purple"))
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
p <- p + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
axis.text.y = element_blank(), axis.ticks.y = element_blank(),
axis.line = element_blank()) + theme(legend.position="none") +
xlab(NULL) + ylab(NULL)
# dusing the ggrepel package to keep the polling place summary visible
p <- p + geom_label_repel(data = place_summary, aes(label = summary), colour = "black", size = 2.5)
print(p)
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf

The plot shows each party separately. With a
dot plot such as this sometimes one can have too much data. Things get lost in a swarm of dots.
To thin things out a bit I am showing only a 50% sample. This is a ggplot2 dot plot
with a Google map as background (based on the ggmap package).
To my eye I don’t see much sign that voters are concentrated by party in different parts of
Guilford. The summary of the party distribution by polling place is overlaid on the plot.
The text labels are same on each plot. The location of the text corresponds roughly
to the location of the polling place, with some adjustment to keep the summaries readable.

Geotagging: Into the Weeds

There is a lot of information on the web about geocoding. The geocode function
in ggmap provides an
interface to two sources of geocoding: the Data Science Toolkit and the Google Maps Platform. The Google terms of services
has a number of restrictions. Unless one gets set up to pay for the Google service,
there is a limit on the total number of searches per day (which may be 2,500) and
a limit on the number of searches per second. For most purposes the Data Science Toolkit
is probably good enough. Out of curiosity I tried both. In practice I had a difficult time geocoding
the 8,000+ addresses via Google. I was tripped up by both the daily limit and the
search per second limit. The geocode function did not seem to fail gracefully so I had
to wrap it in try-catch code. It ended up taking me almost a week to get all the addresses
geocoded via Google.

Via ggmap it’s easy to throw up a quick plot showing
the results of the geocoding. Watch out for outliers. With both Data Science Toolkit and Google
I had two addresses that ended up in Turkey. They were both addresses that had a street
address that ended with something like “Unit #5”. All the addresses ended with “Guilford, CT.” Tacking on “USA” after the state helped eliminate the wackiest results. But
I found a number of other cases that produced a latitude and longitude well outside of the
boundaries of Guilford.

For most purposes the default result from geocode will be adequate. But I was curious about
the quality of the results and where they came from. A number of academic geography-type programs
get into these issues. See Texas A&M GeoServices as an example.
I think you can submit batch files of addresses to them as an alternative to using something
like the geocode function in ggmap. Out of personal interest, I decided to evaluate
my geocodes by directly comparing the results based on Google and on the Data Science Toolkit.

p_dsk_google <- google_dsk_map <- guilford_base +
geom_point(data = google_addr ,
aes(x = lon, y = lat), size = 0.1, colour = "red") +
geom_point(data = dsk_addr,
aes(x = lon, y = lat), size = 0.1, colour = "blue") +
xlim(c(guilford_left_bottom_lon, guilford_right_top_lon)) +
ylim(c(guilford_left_bottom_lat, 41.34)) + # guilford_right_top_lat
ggtitle("Comparison of Data Science Toolkit (blue) and Google (red)") +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
axis.text.y = element_blank(), axis.ticks.y = element_blank(),
axis.line = element_blank()) + theme(legend.position="none") +
xlab(NULL) + ylab(NULL)
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
print(p_dsk_google)
## Warning: Removed 1 rows containing missing values (geom_rect).
## Warning: Removed 1856 rows containing missing values (geom_point).
## Warning: Removed 1854 rows containing missing values (geom_point).

This plot uses the geocodes provided by Google (in red) and then Data Science Toolkit (in blue).
Because of the order in which the points were applied, if an address has the same location from
both sources, the blue point will overwrite the red point, that is, only blue will be visible.
Blue points are along roads only. That means that the Data Science Toolkit geocodes show the location
along the road only. Google codes are often on top of the building (as seen on a close-up view using
the satellite map). In my immediate neighborhood one can only see blue points, meaning the Data
Science Toolkit and Google are giving exactly the same latitude and longitude.

I don’t draw much of any practical conclusion from this.

Zooming in on my local neighborhood

detail_left_bottom_lon <- -72.710723
detail_left_bottom_lat <- 41.269414
detail_right_top_lon <- -72.692658
detail_right_top_lat <- 41.280105
detail_boundaries <- c(left = detail_left_bottom_lon, bottom = detail_left_bottom_lat,
right = detail_right_top_lon, top = detail_right_top_lat)
home_base <- ggmap(get_map(location = detail_boundaries, zoom = 15,
maptype = "hybrid", source = "google"))
## Source : http://tile.stamen.com/terrain/15/9765/12251.png
## Source : http://tile.stamen.com/terrain/15/9766/12251.png
## Source : http://tile.stamen.com/terrain/15/9767/12251.png
## Source : http://tile.stamen.com/terrain/15/9765/12252.png
## Source : http://tile.stamen.com/terrain/15/9766/12252.png
## Source : http://tile.stamen.com/terrain/15/9767/12252.png
## Source : http://tile.stamen.com/terrain/15/9765/12253.png
## Source : http://tile.stamen.com/terrain/15/9766/12253.png
## Source : http://tile.stamen.com/terrain/15/9767/12253.png
p_myhome <- home_base +
geom_point(data = google_addr,
aes(x = lon, y = lat), size = 1, colour = "red") +
geom_point(data = dsk_addr,
aes(x = lon, y = lat), size = 1, colour = "blue") +
xlim(c(detail_left_bottom_lon, detail_right_top_lon)) +
ylim(c(detail_left_bottom_lat, detail_right_top_lat)) +
ggtitle("Local Detail of Data Science Toolkit (blue) and Google (red)") +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
axis.text.y = element_blank(), axis.ticks.y = element_blank(),
axis.line = element_blank()) + theme(legend.position="none") +
xlab(NULL) + ylab(NULL)
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
print(p_myhome)
## Warning: Removed 8470 rows containing missing values (geom_point).
## Warning: Removed 8457 rows containing missing values (geom_point).

The satellite map of the area around my house shows the results of geocoding in detail. In
this area there are lots of addresses for which Google and Data Science Toolkit give exactly
the same longitude and latitude, in which case only a blue dot appears (because the red dot
is underneath). In the satellite map one can see the actual houses, so one can see places
(such as Three Corners Road) where geocoding is approximately correct, but the houses are
more strung out than the geocodes imply. One interesting street is Greenwood Lane. Google
correctly shows the locations of the houses on this small and relatively new street. The
Data Science Toolkit shows all of the addresses as being at the location where Greenwood Lane
intersects with Three Corners Road. It illustrates that the services that are providing geocodes
have significant technical details when it comes down to identifying the precise location of
address. Plus one wonders whether Google distinguishes the location of the house from the
location of the entrance of driveway, given that the latter is what is relevant for providing
turn-by-turn directions.

detail_left_bottom_lon <- -72.683863
detail_left_bottom_lat <- 41.302653
detail_right_top_lon <- -72.667307
detail_right_top_lat <- 41.312671
detail_boundaries <- c(left = detail_left_bottom_lon, bottom = detail_left_bottom_lat,
right = detail_right_top_lon, top = detail_right_top_lat)
home_base <- ggmap(get_map(location = detail_boundaries, zoom = 15,
maptype = "hybrid", source = "google"))
## Source : http://tile.stamen.com/terrain/15/9768/12247.png
## Source : http://tile.stamen.com/terrain/15/9769/12247.png
## Source : http://tile.stamen.com/terrain/15/9768/12248.png
## Source : http://tile.stamen.com/terrain/15/9769/12248.png
p_example <- home_base +
geom_point(data = google_addr,
aes(x = lon, y = lat), size = 1, colour = "red") +
geom_point(data = dsk_addr,
aes(x = lon, y = lat), size = 1, colour = "blue") +
xlim(c(detail_left_bottom_lon, detail_right_top_lon)) +
ylim(c(detail_left_bottom_lat, detail_right_top_lat)) +
ggtitle("Another Local Example of Data Science Toolkit (blue) and Google (red)") +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
axis.text.y = element_blank(), axis.ticks.y = element_blank(),
axis.line = element_blank()) + theme(legend.position="none") +
xlab(NULL) + ylab(NULL)
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
print(p_example)
## Warning: Removed 8471 rows containing missing values (geom_point).
## Warning: Removed 8467 rows containing missing values (geom_point).

Finally, here is another example that zooms in on a small local neighborhood. In this case you
can see clearly how Google locates individual homes fairly well while the Digital Science Toolkit
has blue dots strung out along the road, and not always lined up well with the location of the houses.
The only identifying information on this bit of Google maps is Sullivans Pits Septic Lagoons. Not
a very nice name for a neighborhood.

Conclusion

Has anyone read this far? Why?

I wouldn’t say I have learned a lot of new R techniques doing this exercise. What I did do was to
solidify some bits and techniques I already knew. The way this post will be useful for me is that
it contains lots of details that will come up again in other work that I will do. I’ll remember
that I did something here and be able to go back and look at the code I used. That’s why I have
left in every bit of R code here. Perhaps that makes the post unreadable for any general user (should
such a critter actually encounter it), but that’s exactly what may make the post useful for me in the
future.

Sample Function to Apply Geocode Function

As promised, below I have pasted the functions I wrote to try to cope with the difficulties of
using the geocode function with Google as a source.

# Geocoding a csv column of "addresses" in R
# from: http://www.storybench.org/geocode-csv-addresses-r/
#test: wrap_geocode("248 Sam Hill Rd, Guilford, CT", 28)
#test: geocode("248 Sam Hill Rd, Guilford, CT", output = "latlona", source = "dsk")
#test: xx <- wrap_geocode("ksdfjsdfaf",23)
wrap_geocode_single <- function(an_address, an_id, src2 = "dsk") {
result <- data_frame(in_address = an_address, id = an_id, lat = NA_real_, lon = NA_real_, out_address = "", rc = "")
try_result = tryCatch({
geo_rc <- 0
geo_result <- geocode(an_address, output = "latlona", source = src2, messaging = FALSE)
if (ncol(geo_result) <= 2) {
result$rc[1] <- paste(nrow(geo_result), "rows", "and", ncol(geo_result), "columns returned from geocode.")
return(result)
} else if (nrow(geo_result) == 1) {
geo_rc <- 1
result$lon[1] <- geo_result$lon[1]
result$lat[1] <- geo_result$lat[1]
result$out_address[1] <- as.character(geo_result$address[1])
result$id[1] <- an_id
result$in_address[1] <- an_address
result$rc[1] <- "OK"
return(result)
} else {
result$rc[1] <- paste(nrow(geo_result), "rows returned from geocode.")
return(result)
}
}, error = function(e) {
result$rc[1] <- as.character(e)
return(result)
})
return(try_result)
}
# If at first you don't succeed, try again
# I'm hoping this introduces enough of a delay to take care of google problem
wrap_geocode <- function(an_address, an_id, src = "dsk") {
if (src == "google") if (as.numeric(geocodeQueryCheck()) == 0) return(NULL)
xx <- wrap_geocode_single(an_address, an_id, src2 = src)
if (src == "google") Sys.sleep(0.2)
if (xx$rc[1] != "OK") {
Sys.sleep(0.2)
if (src == "google") if (as.numeric(geocodeQueryCheck()) == 0) return(NULL)
xx <- wrap_geocode(an_address, an_id, src = src)
}
return(xx)
}

  1. The population pyramids using Census data are based a Geo-faceted population pyramids with tidycensus 0.3 by Kyle Walker.

  2. I tried looking at the American Community Survey, which is based on a sample taken by the Census Bureau in between the decennial census. Guilford seems to be too small to provide a reliable breakdown by age and sex using the American Community Survey. It might require something at the county level or higher to provide decent estimates for the 5-year age categories.

To leave a comment for the author, please follow the link and comment on their blog: R on Can I Blog Too.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)