Working with New Haven Area Census Data Using R

[This article was first published on R on Can I Blog Too, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post will work through a basic setup to look at census data for the New Haven area. It will rely on the tidycensus and tigris packages by Kyle Walker. The emphasis will be on the R code for getting and using Census data rather than on the data itself.

I should emphasize that I am in no way an expert on the Census. On the contrary I would rate myself as a novice. The point of this post is to record my exploration of how to apply the Census to my area of the state. I hope to provide some of the information I wish I had when I first started working with Census data.

American Community Survey

The tidycensus package is designed to retrieve data via the Census Bureau API’s either from the decennial census or from the American Community Survey (ACS). For this exercise I will focus on the ACS.

The ACS started in 2005 and has replaced the “long form” of the census survey. Between 1970 and 2000, the U.S. Census Bureau used two questionnaires. Most households received a short-form questionnaire asking a minimum number of questions. A sample of households received a long-form questionnaire that included additional questions about the household. In 2005 the American Community Survey was created to replace the long form census. The 2010 Census had just one questionnaire consisting of ten questions. (Census Bureau)

By law one is required to respond to Census Bureau surveys, but in practice the Bureau does not levy fines for non-response to the ACS.

Each year about 3% of households are included in the sample for the American Community Survey. It covers a wide range of topics. There’s a page that describes the many topics on the survey and why they are included. It also leads to links where you can see the actual questions as they appear on the ACS. There’s lots of data about race and citizenship and household demographics and money. But there are also quite a few items on other aspects of life. It seems like it has everything, including the kitchen sink (covered by table B25051 Kitchen Facilities for All Housing Units).

As a side note, learning about the American Community Survey has given me a much stronger understanding of the recent controversy (and legal wrangling) about adding a citizenship question to the decennial census. I now appreciate that the Census Bureau has worked to made the decennial census bare bones and simple to respond to. There are a lot of things one might want to know about the US population, and the American Community Survey is their vehicle for getting that depth of information. It is already has questions about citizenship and also lots of detail about national origin. The ACS is designed so that the decennial census does not need to ask about citizenship. The ACS is sufficient and a better vehicle for getting a full picture of citizenship and immigration.

What Variables are in the ACS?

For someone like myself who is unfamiliar with the ACS it can be downright bewildering to sift through what’s available. Using the load_variables function in tidycensus (or by going to the Census Bureau web site) one can download the variable numbers that describe each count that is available from the ACS results. There are over 25,000 rows in the table that lists all these variable numbers. It took me a while to get used to what a “variable” is in this sense. Remember, via the API you are not seeing raw results (i.e., one row per person’s response). The ACS API gives you various tabulations and cross-tabulations. Each variable is one of those tabulations, that is, one number in a table. In fact, variables are organized in terms of topics and tables. There is an ID for each table, (such as B17010) and then separate variables for each cell in that table (and for many of the cross-tabulations as well). A tool that helped me to get used to the meaning of variables is the American Factfinder, an online tool at the Census Bureau web site. The Factfinder has been superseded by a new data exploration platform at census.data.com. Unfortunately, as far as I can tell, the American Factfinder doesn’t display the table numbers used by the API. (It feels like I should be able to get that; perhaps I just don’t know how.) The new data.census.gov does show the overall table number, but not the variable names for individual cells.

When I click on Table Notes on a Factfinder table, I get a popup window with a link to technical documentation. (The same link is also available at data.census.gov.) On the left side of that page there is a guide to other items related to technical documentation. Of particular value is the link to Table Shells. On that page one can find Table Shells for All Detailed Tables as a single 1.3MB excel spreadsheet

As an example, here is the table shell for Table C17010 (which is a collapsed version of the even more detailed Table B17010)1:

POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILIES BY FAMILY TYPE BY PRESENCE OF RELATED CHILDREN UNDER 18 YEARS

Line# Variable Description
1 C17010_001 Total:
2 C17010_002 Income in the past 12 months below poverty level:
3 C17010_003   Married-couple family:
4 C17010_004       With related children of the householder under 18 years
5 C17010_005   Other family:
6 C17010_006     Male householder, no wife present
7 C17010_007       With related children of the householder under 18 years
8 C17010_008     Female householder, no husband present
9 C17010_009       With related children of the householder under 18 years
10 C17010_010 Income in the past 12 months at or above poverty level:
11 C17010_011   Married-couple family:
12 C17010_012       With related children of the householder under 18 years
13 C17010_013   Other family:
14 C17010_014     Male householder, no wife present
15 C17010_015       With related children of the householder under 18 years
16 C17010_016     Female householder, no husband present
17 C17010_017       With related children of the householder under 18 years

As you work with tidycensus using these variable names will be confusing. Many ACS table are much more elaborate than this one and the array of variable names can be overwhelming. C17010 has 17 lines in the table, but B17010 (the more expanded version) has 41 lines. And there are nine variations of this table by race, indicated by a letter appended to the table name:

  • C17010A White alone
  • C17010B Black or African American alone
  • C17010C American Indian or Alaskan Native alone
  • C17010D Asian alone
  • C17010E Native Hawaiian or Other Pacific Islander alone
  • C17010F Some other race alone
  • C17010G Two or more races
  • C17010H White alone and not Hispanic or Latino
  • C17010I Hispanic or Latino

That adds up to 17 x 10 = 170 variables for all of the variations of C17010. And for B17010 and its variations there are a total of 410 variables. You can see how you can get lost amidst all of these variables.

After writing this section, I realized that C17010 is only available in the one-year version of ACS, and is not available in the five-year version which combines data for 2013 through 2017. I’m leaving the description of C17010 because it is more compact, but in the code below I want to work with the 5-year data (to get better estimates for small geographic units) so I’ll have to stick with B17010. Of course one can always get the same numbers in C17010 by adding the right variables from B17010. I should have paid attention to the “notes” column which has 1 and a 5 when the table is available for both ACS versions and a 1 when it is available only for the single year version.

Note that when you are working with the ACS table shells, you should pay attention to what universe is specified for each table. Some example: families, households, total population, housing units. There are many more specific universes: population 15 years and over in the United States, civilian non-institutionalized population, and so on. The “universe” is the set of things that are being sampled to produce the results in the table.

Get ACS Data Via tidycensus

Once you have zeroed in on which ACS variables you want to look at, the next step is to use tidycensus to retrieve the data via the Census Bureau API. I am going to retrieve data on families below the poverty line for the towns in the New Haven area, my own part of Connecticut.

We will use get_acs() to retrieve the data.

This post was created with RMarkdown and all of the code to fetch the ACS data and to make the plots is included here and is visible.

A preliminary detail. To access US Census items via the Census Bureau API, you will need an API key. You can obtain a key via this link. After they email you your key, you use the census_api_key function to save it for use by tiycensus and tigris functions. It will look like `census_api_key(“YOUR KEY GOES HERE”). The key is stored in your R environment.

# setup R libraries used in this post
library(tidyverse, quietly = TRUE)
library(tidycensus, quietly = TRUE)
library(sf, quietly = TRUE)
library(tigris, quietly = TRUE)
library(viridis, quietly = TRUE)
library(knitr, quietly = TRUE)
library(scales, quietly = TRUE)
library(kableExtra, quietly = TRUE)
options(tigris_use_cache = TRUE)
options(tigris_class = "sf")

You can manually create a list of variables to pass to get_acs(), but it may be easier to work with the long data frame of variable names that one can fetch via load_variables(). Some additional parsing of the returned data can make working with variables much easier. In the code block here we will pull out the table identifiers and also separate out the table variants for breakdowns by race.

vars <- load_variables(2017, "acs5", cache = TRUE) %>%
mutate(table_id = str_sub(name, 1, 6),
# Race generally is in parentheses after the concept name.
# But for a few cases, something else is in parentheses first. So I
# am going to blank out that stuff and then assume whatever I find inside
# of parentheses is race.
concept = str_replace_all(concept,
c("\\(IN 2017 INFLATION-ADJUSTED DOLLARS\\)" = "",
"\\(EXCLUDING HOUSEHOLDERS, SPOUSES, AND UNMARRIED PARTNERS\\)" = "",
"\\(SSI\\)" = "",
"\\(INCLUDING LIVING ALONE\\)" = "",
"\\(IN MINUTES\\)" = "",
"\\(DOLLARS\\)" = "",
"\\(CT, ME, MA, MI, MN, NH, NJ, NY, PA, RI, VT, WI\\)" = "--CT, ME, MA, MI, MN, NH, NJ, NY, PA, RI, VT, WI--",
"\\(CAR, TRUCK, OR VAN\\)" = "--CAR, TRUCK, OR VAN--",
"\\(\\)" = ""
)),
race = str_extract(concept, "\\(.+\\)"),
race = str_replace(race, "\\(", ""),
race = str_replace(race, "\\)", ""))
# I should have been able to do this in one line, but it doesn't seem to work:
# race = str_extract(concept, "\\((.*?)\\)"))
B17010_variables <- vars %>%
filter(table_id == "B17010", is.na(race)) %>%
pluck("name")
poverty_acs <- get_acs(geography = "county subdivision", # for CT, that means towns
state = "CT",
county = "New Haven",
geometry = "FALSE", # no map at this time
year = 2017,
survey = "acs5",
variables = B17010_variables[2],
summary_var = B17010_variables[1]) %>%
filter(estimate > 0) %>%
mutate(TOWN = str_replace(NAME, " town, New Haven County, Connecticut", ""),
pct_poverty = estimate / summary_est,
pct_moe = moe_prop(estimate, summary_est, moe, summary_moe)) 

Via the parameters of the function, we told the get_acs function that we want data by town (geography = “county subdivision”), that we wanted the five-year ACS data ending in year 2017 (the latest available as of April, 2019), that we want variable B17010_002 (the estimate of the total number of families below the poverty line). We have also asked that B17010_001 be added as a summary variable, in this case the estimate of the total number of families in the geographic unit. The percentage of the number of families below the poverty line can thus be calculated as estimate divided by summary_est. The tidycensus package also supplies some functions for estimating the margin of error when you combine or summarize estimates. See the tidycensus help for moe_sum and moe_prop for notes on how to estimate margin of error when doing arithmetic operatons on ACS estimates such as addition or using a ratio to calculate a percentage.

pick_towns <- c("Woodbridge", "West Haven", "New Haven", "East Haven",
"Bethany", "Orange", "Milford", "Branford", "Guilford",
"North Haven", "Madison", "Hamden", "North Branford",
"Wallingford")
branford <- poverty_acs %>% filter(TOWN == "Branford") # for example
poverty_acs <- poverty_acs %>%
filter(TOWN %in% pick_towns) %>%
arrange(desc(pct_poverty))
nh_pct_poverty <- percent(poverty_acs$estimate[[1]] / sum(poverty_acs$estimate), accuracy = 1)
nh_pct_families <- percent(poverty_acs$summary_est[[1]] / sum(poverty_acs$summary_est), accuracy = 1)
poverty_formatted <- poverty_acs %>%
arrange(desc(pct_poverty)) %>%
filter(TOWN %in% pick_towns) %>%
mutate(pct_poverty = percent(pct_poverty, accuracy = 1),
pct_moe = paste0("±", percent(pct_moe, accuracy = .1)),
moe = paste0("±", moe),
summary_moe = paste0("±", summary_moe)) %>%
select(GEOID, Town = TOWN, `Below Poverty` = estimate, MOE = moe, `Total # of Families` = summary_est, `MOE of Families` = summary_moe, `% Poverty` = pct_poverty, `MOE of %` = pct_moe)
kable(poverty_formatted, format = "markdown", format.args = list(big.mark = ","),
caption = "Families Below Poverty Line (from Table B17010)") %>%
kableExtra::kable_styling() %>%
kableExtra::footnote(general = "Source: US Census American Community Survey 2013-2017 (variable B17010_002)\ntidycensus R package")
GEOID Town Below Poverty MOE Total # of Families MOE of Families % Poverty MOE of %
0900952070 New Haven 5,038 ±468 24,699 ±584 20% ±1.8%
0900982870 West Haven 1,055 ±267 11,563 ±460 9% ±2.3%
0900922910 East Haven 438 ±194 6,802 ±296 6% ±2.8%
0900935650 Hamden 534 ±187 13,974 ±410 4% ±1.3%
0900947535 Milford 507 ±163 13,841 ±340 4% ±1.2%
0900978740 Wallingford 421 ±148 11,879 ±350 4% ±1.2%
0900907310 Branford 227 ±115 7,325 ±307 3% ±1.6%
0900904580 Bethany 45 ±42 1,611 ±107 3% ±2.6%
0900957600 Orange 97 ±65 3,836 ±117 3% ±1.7%
0900954870 North Haven 137 ±61 6,261 ±216 2% ±1.0%
0900934950 Guilford 122 ±62 6,343 ±193 2% ±1.0%
0900987700 Woodbridge 38 ±32 2,249 ±156 2% ±1.4%
0900944560 Madison 76 ±48 5,066 ±158 2% ±0.9%
0900953890 North Branford 55 ±53 4,075 ±208 1% ±1.3%

The table displays the basic data returned by the call to get_acs. Note that there is a margin of error (MOE) for each count. ACS is based on a sample and is subject to sampling error. Plus or minus the MOE is the 90% confidence interval for the estimate. As one looks at small geographic units, MOE can be a big issue. For example, in this data there are an estimated 7,325 families in Branford, plus or minus 307. That’s a range of about 7%. But for Branford the estimate for the number of families below the poverty line is both small and unreliable: 227 with an MOE of 115. If I had used the one-year ACS rather than the five-year, the MOE would be even worse. In fact, the API will not even return the one-year results for most towns. If I run the get_acs code but with ACS1 rather than ACS5 I only get data returned for New Haven and Waterbury, the two largest towns in the county. The MOE for total number of families in New Haven goes from ±584 for ACS5 to ±1,358 for ACS1. And for the number of families below the poverty line, the MOE goes from ±468 to ±925. The one-year ACS has rules for what data will not be reported because the sample is too small. The ACS5 will give you the results, but that doesn’t mean you should be oblivious to whether a large MOE indicates a comparison you want to make is not valid.

The table confirms that poverty is concentrated in the city. While 21% of families of the 14 towns are in New Haven, 57% of families below the poverty line are in New Haven. And some other tracts with elevated levels of poverty are in East Haven and West Haven adjacent to the city.

Basic Map

At this point we have a simple example of using the Census API to get data from the ACS. Next we are going to do some basic mapping and then show how to use those technique to map data from the ACS.

To begin we will create a basic map for Connecticut towns surrounding the City of New Haven. The Census Bureau provides extensive resources for geographic information and maps via its TIGER (Topologically Integrated Geographic Encoding and Referencing) products. The tigris package provides a simple way to get basic maps (represented as shapefiles) that are most relevant to census tabulations. There are many types of geographic units available. In our case we will focus on county, towns (in the case of New England, county subdivisions in TIGER terminology), and census tracts. Towns are especially relevant in Connecticut (and in New England in general). The rest of the country relies on census tracts within counties.

To create the maps I will use the relatively new sf (simple features) package. The sf package is designed to be tidyverse compatible. One can use the usual dplyr verbs to manipulate sf objects and use the geom_sf function to add maps to ggplot2. Before sf was created, I had to rely on the sp package, which was much less convenient. (There are links to six vignettes describing sf at the CRAN sf package page.)

There are 28 towns, boroughs, or cities in New Haven County. The county goes as far north as Waterbury. I decided to focus on the 14 towns that I think of as being the City of New Haven and its suburbs:

Milford, Orange, Woodbridge, Bethany,
Hamden, West Haven, Wallingford, North Haven,
East Haven, Branford, North Branford, Guilford,
Madison, and New Haven. 

It turns out this list is very similar to the South Central Regional Council of Governments. The one difference is that SCRCOG also includes Meriden, which I excluded as being too far north and therefore outside of the immediate New Haven orbit.

The code in the next section uses tigris to retrieve shapefiles for the census tracts and towns (county subdivisions) in New Haven County. There may be some direct way via tigris to relate towns to census tracts. I had to do it indirectly. I use the st_centroid function to find the center of each census tract and then use the sf_join function with the st_intersects join to join towns to tracts to show which goes with which. Each geographic object in TIGER has a GEOID identifier. The tract_town object has a column for TOWN.GEOID and another for TRACT.GEOID. I turn that into a regular data frame that I can join to other sf objects so that I can associate towns with tracts via tract GEOID.

pick_towns <- c("Woodbridge", "West Haven", "New Haven", "East Haven",
"Bethany", "Orange", "Milford", "Branford", "Guilford",
"North Haven", "Madison", "Hamden", "North Branford",
"Wallingford")
# I'm saving these to make it easier to re-use same code for a different area.
pick_county = "New Haven"
pick_state = "CT"
# set cb = TRUE to keep boundaries tied to the coastline
town_geometry <- county_subdivisions(state = pick_state, county = pick_county, cb = TRUE)
tract_geometry <- tracts(state = pick_state, county = pick_county, cb = TRUE)
# let's find which tract is in which town
tract_centroid = st_centroid(tract_geometry)
# tract town has the geometry of town, not tract
tract_town <-st_join(town_geometry, tract_centroid,
join = st_intersects,
suffix = c(".TOWN", ".TRACT"))
tract_town_df <- tract_town %>%
as_tibble() %>%
select(TOWN = NAME.TOWN, GEOID = GEOID.TRACT) %>%
mutate(near_nh = (TOWN %in% pick_towns))
area_tracts <- tract_geometry %>%
left_join(tract_town_df, by = "GEOID") %>%
filter(TOWN %in% pick_towns)
area_towns <- town_geometry %>% filter(NAME %in% pick_towns)
area_town_centroid <- st_centroid(area_towns) # use to place town labels

Create a Map of New Haven Area

Next we will use ggplot2 and the geom_sf geom to create a map from the sf objects that contain the shapefiles for tracts and towns around New Haven.

ggplot() +
geom_sf(data = area_tracts,
fill = "gray", colour = "darkgray", show.legend = FALSE) +
geom_sf(data = area_towns, colour = "yellow", fill = NA) +
geom_sf_text(data = area_town_centroid, aes(label = NAME), color = "yellow") +
coord_sf(datum = NA, label_axes = "----") +
xlab("") + ylab("") + theme_minimal() +
labs(title = "New Haven Area Towns",
subtitle = "with census tract boundaries",
caption = "Source: US Census, tidycensus package") 

We Have a Map; Let’s Display Some Census Data

I am going to retrieve the poverty data at the level of census tract and display the result. Keep in mind that the margin of error for a particular tract will be quite large. Comparing similar census tracts may be misleading, although it still should be clear that the poor areas in the cities are much different than the suburbs. Let’s try it and see what happens and afterward we will explore the issue of margin of error a bit more.

I have redone the get_acs query we did by town, only this time the geography unit will be tracts rather than county subdivisions.

poverty_tracts <- get_acs(geography = "tract",
state = "CT",
county = "New Haven",
geometry = "TRUE", # yes, get tract shapefiles
year = 2017,
survey = "acs5",
variables = B17010_variables[2],
summary_var = B17010_variables[1]) %>%
filter(estimate > 0) %>%
mutate(pct_poverty = estimate / summary_est,
pct_moe = moe_prop(estimate, summary_est, moe, summary_moe)) %>%
left_join(tract_town_df, by = "GEOID") %>%
filter(TOWN %in% pick_towns)
# poverty_tracts is similar to poverty_acs, but includes geometry and limits to area towns
ggplot() +
geom_sf(data = poverty_tracts,
aes(fill = pct_poverty), colour = "lightgray") +
scale_fill_viridis_c(option = "plasma", direction = -1, name = "Pct Poverty", begin = 0.1) +
geom_sf(data = area_towns, colour = "darkgray", fill = NA) +
geom_sf_text(data = area_town_centroid, aes(label = NAME), color = "darkgray") +
coord_sf(datum = NA, label_axes = "----") +
xlab("") + ylab("") + theme_minimal() +
labs(title = "Percentage of Families Below the Poverty Line",
subtitle = "by census tract (margin of error by tract may be large)",
caption = "Source: US Census American Community Survey 2013-2017 (variable B17010_002)\ntidycensus R package") 

In the map that shows poverty by census tract, there are six tracts that are missing from the data. They show up as white areas on the map. Presumably data has been suppressed. One of the missing tracts is where I used to live in New Haven. Families below the poverty line may have been hard to find there. I don’t know how the suppression policies work for the five-year ACS so I don’t know why these tracts are missing.

# INCOME IN THE PAST 12 MONTHS B07411_0.5
income_tracts <- get_acs(geography = "tract",
state = "CT",
county = "New Haven",
geometry = "TRUE",
year = 2017,
survey = "acs5",
variables = "B19013_001") %>%
filter(estimate > 0) %>%
left_join(tract_town_df, by = "GEOID") %>%
filter(TOWN %in% pick_towns)
ggplot() +
geom_sf(data = income_tracts,
aes(fill = estimate), colour = "gray") +
scale_fill_viridis_c(option = "plasma", direction = -1, name = "Income", begin = 0.1,
trans = "log", breaks = c(20000, 30000, 50000, 100000, 150000)) +
geom_sf(data = area_towns, colour = "white", fill = NA, size = 0.5) +
geom_sf_text(data = area_town_centroid, aes(label = NAME), color = "darkgray") +
coord_sf(datum = NA, label_axes = "----") +
xlab("") + ylab("") + theme_minimal() +
labs(title = "Median Household Income (Log Scale)",
subtitle = "by census tract (margin of error by tract may be large)",
caption = "Source: US Census American Community Survey 2013-2017 (variable B19013_001)\ntidycensus R package") 

Because the ACS is based on a sample, one cannot ignore the margin of error. The next plot has shows the median income for each census tract within each town drawn with a line that represents minus and plus the margin error. In effect this shows a 90% confidence interval. Where the lines overlap you should not draw the conclusion that one census tract is higher than another. On the other hand, one see cases where it is clear there is a difference. Track 1571 in Orange is unambiguously lower than the other tracts in Orange.

area_towns_order <- c("Bethany", "Woodbridge", "Wallingford", "North Branford",
"Orange", "Hamden", "North Haven", "Guilford",
"West Haven", "New Haven", "Branford", "Madison",
"Milford", "East Haven")
income_tracts <- income_tracts %>%
mutate(tract_name = str_sub(NAME, 8, (str_locate(NAME, ",")[, 1] - 1)),
TOWN = factor(TOWN, levels = area_towns_order))
ggplot(data = income_tracts, aes(x = estimate/1000, y = fct_reorder(tract_name, estimate))) +
geom_point() +
geom_errorbarh(mapping = aes(xmin = (estimate - moe)/1000, xmax = (estimate + moe)/1000), height = 0) +
facet_wrap(~ TOWN, scales = "free_y") +
ylab(NULL) +
labs(title = "90% Confidence Interval for Median Household Income",
subtitle = "(Income is Displayed on a Log Scale)",
caption = "Source: US Census American Community Survey 2013-2017 (variable B19013_001)\ntidycensus R package") +
scale_x_log10(name = "Income (000's)", breaks = c(20, 30, 50, 100, 150)) +
theme(axis.text.x = element_text(size=6), axis.text.y = element_text(size=5))

Now for Something Completely Different

The next ACS query will focus on mode of transportation to work. This is an example of the breadth of the survey. In addition to the material shown here, there’s also a lot on the time it takes to get to work.

vars2 <- vars %>% filter(table_id == "B08006") %>%
separate(label, into = paste0("t", seq(1, 8)), remove = FALSE, sep = "!!") %>%
select(-t1)
# to supplement picking out variables manually, in this step we will parse the
# label and use that info to help select variables.
vars2 <- vars2 %>% filter(is.na(t4)) %>%
mutate(commute_mode = case_when(
is.na(t3) ~ "All workers",
str_detect(t3, "^Car") ~ "Vehicle",
str_detect(t3, "Public") ~ "Public Transport",
str_detect(t3, "Bicycle") ~ "Bicycle",
str_detect(t3, "Walked") ~ "Walked",
str_detect(t3, "Taxi") ~ "Taxi, motorcycle",
str_detect(t3, "^Worked") ~ "At home",
TRUE ~ "Other")
)
for_summary <- vars2 %>% filter(commute_mode == "All workers") %>% pluck("name")
vars2 <- vars2 %>% filter(commute_mode != "Other", commute_mode != "All workers")
# At this point, variables are in vars2$name
# SEX OF WORKERS BY MEANS OF TRANSPORTATION TO WORK B08006
commuters <- get_acs(geography = "county subdivision", # for CT, that means towns
state = "CT",
county = "New Haven",
geometry = "FALSE",
year = 2017,
survey = "acs5",
variables = vars2$name,
summary_var = for_summary) %>%
filter(estimate > 0) %>%
left_join(town_geometry %>% select(-NAME), by = "GEOID") %>%
mutate(TOWN = str_replace(NAME, " town, New Haven County, Connecticut", ""),
pct = estimate / summary_est,
pct_moe = moe_prop(estimate, summary_est, moe, summary_moe)) %>%
filter(TOWN %in% pick_towns) %>%
left_join(vars2 %>% select(variable = name, commute_mode), by = "variable")
commuters_table <- commuters %>%
select(commute_mode, pct, TOWN) %>%
spread(key = commute_mode, value = pct) %>%
arrange(Vehicle) %>%
mutate_if(is.numeric, percent)

Transportation to Work

The ACS reports the mode of transportation to work. For workers, the table below shows the percentage who rely primarily on each form of transportation (plus a column for workers who work at home). The table is sorted by the percentage who travel by car.

kable(commuters_table, caption = "Mode of Transportation to Work") %>%
kableExtra::kable_styling() %>%
kableExtra::footnote(general = "Source: US Census American Community Survey 2013-2017 (variable B08006)\ntidycensus R package")
(#tab:commuter_table)Mode of Transportation to Work
TOWN At home Bicycle Public Transport Taxi, motorcycle Vehicle Walked
New Haven 4.18% 3.06% 12.4% 0.99% 66.8% 12.6%
Madison 9.88% 0.71% 3.6% 0.96% 84.3% 0.5%
West Haven 3.37% 0.10% 6.2% 0.22% 86.3% 3.8%
Guilford 7.88% NA% 2.3% 0.37% 87.7% 1.7%
Hamden 3.08% 0.63% 4.2% 0.43% 88.2% 3.5%
Milford 4.73% 0.02% 5.0% 0.54% 88.4% 1.4%
Woodbridge 7.70% 0.45% 2.8% 0.22% 88.4% 0.4%
Orange 5.91% 0.22% 3.0% 0.49% 89.7% 0.6%
Bethany 6.79% NA% 1.2% 0.71% 90.6% 0.7%
East Haven 3.72% 0.08% 2.7% 0.54% 90.7% 2.3%
Branford 4.18% 0.15% 2.1% 0.71% 90.8% 2.1%
Wallingford 4.90% 0.00% 0.7% 0.52% 91.9% 2.0%
North Haven 4.10% NA% 2.3% 1.22% 92.1% 0.3%
North Branford 2.60% NA% 0.3% 0.32% 96.2% 0.6%
Note:
Source: US Census American Community Survey 2013-2017 (variable B08006)
tidycensus R package
white_not_hispanic <- "B01001H_001"
white_alone <- "B02008_001"
black <- "B02009_001"
asian <- "B02011_001"
hispanic <- "B01001I_001"
not_us <- "B05001_006"
# let's get the variables into a data frame where they will be easier to work with
rcodes <- tribble(
~code, ~variable,
"white_not_hispanic" , "B01001H_001",
"white_alone" , "B02008_001",
"black" , "B02009_001",
"asian" , "B02011_001",
"hispanic" , "B01001I_001",
"not_us" , "B05001_006")
race_town2 <- get_acs(geography = "county subdivision", # for CT, that means towns
state = "CT",
county = "New Haven",
geometry = "FALSE",
year = 2017,
survey = "acs5",
variables = c(white_not_hispanic, white_alone, black, hispanic, asian, not_us),
summary_var = "B01001_001") %>%
left_join(rcodes, by = "variable") %>%
mutate(race = factor(code,
levels = c("white_alone", "white_not_hispanic", "hispanic", "black", "asian", "not_us"),
labels = c("White alone", "White not Hispanic", "Hispanic", "Black", "Asian", "Not US citizen")),
pct = estimate / summary_est,
pct_moe = moe_prop(estimate, summary_est, moe, summary_moe))
race_town <- town_geometry %>% left_join(race_town2 %>% select(-NAME), by = "GEOID") %>%
filter(NAME %in% pick_towns)
race_table <- as_tibble(race_town %>%
select(Town = NAME, code, summary_est, pct, pct_moe, race))
race_table <- race_table %>%
mutate(pct = percent(pct, accuracy = 1.0)) %>%
select(Town, race, Percent = pct) %>%
spread(key = race, value = Percent) %>%
arrange(`White not Hispanic`) 

Race by Town

The following table shows race by town. The ACS has a lot of detail on race, more than is presented in this table. The Census Bureau (and other agencies) now use what is referred to as the “two question” format for asking about race. The first question is whether or not the individual is Hispanic and then a second question asks about race/ethnicity. On the second question, the individual can select multiple categories. In many tables the non-Hispanic categories are presented as “alone”, meaning the individual chose only that category. But because Hispanic is a second question, “White alone” (or other alone categories) can also self-identify as Hispanic. In the table below, “White not Hispanic” means the person selected “White” (or some other race category) and did not select Hispanic. “White alone” means the person selected only “White” among the race categories regardless of the response to the Hispanic question. Notice that for New Haven, “White not Hispanic” is 30% of the population while “White alone” is 46%. A substantial number of Hispanics also said White.

The citizenship category is completely separate from race and simply indicates whether a respondent was not a US citizen.

Note that the categories as I have defined them here can add up to more than the total population. This is a tricky area of the ACS. There are some other tables that get into the issue of how the multiple categories are used.

race_table %>% kable(caption = "Percent Race by Town (Categories are not mutually exclusive)") %>%
kableExtra::kable_styling() %>%
kableExtra::footnote(general = "Source: US Census American Community Survey 2013-2017\nvariables B01001H_001, B02008_001, B02009_001, B02011_001, B01001I_001, B05001_006\ntidycensus R package")
(#tab:race_table)Percent Race by Town (Categories are not mutually exclusive)
Town White alone White not Hispanic Hispanic Black Asian Not US citizen
New Haven 46% 30% 30% 35% 5% 12%
West Haven 67% 51% 21% 24% 4% 9%
Hamden 65% 58% 11% 25% 6% 7%
East Haven 86% 77% 15% 4% 5% 3%
Woodbridge 82% 77% 5% 2% 16% 6%
North Haven 88% 83% 5% 4% 8% 3%
Milford 90% 84% 7% 4% 6% 5%
Wallingford 92% 85% 8% 2% 5% 2%
Bethany 94% 86% 6% 1% 6% 3%
Orange 91% 87% 2% 2% 8% 4%
Branford 93% 88% 5% 2% 5% 3%
North Branford 94% 89% 5% 4% 2% 1%
Guilford 96% 91% 4% 1% 4% 2%
Madison 96% 93% 1% 1% 4% 3%
Note:
Source: US Census American Community Survey 2013-2017
variables B01001H_001, B02008_001, B02009_001, B02011_001, B01001I_001, B05001_006
tidycensus R package

Next we will see multiple plots that plot each of these categories on the map of towns around New Haven. We can see that the Black population is primarily in New Haven, West Haven, and Hamden. I was surprised to see that there is a notable Asian population in Woodbridge. As we saw in the table, most of the outer suburban towns are very, very white.

ggplot() +
geom_sf(data = race_town,
aes(fill = pct), colour = "lightgray") +
# using the begin parameter of viridis prevents the extremem values from looking almost black
scale_fill_viridis_c(option = "plasma", direction = -1, name = "Pct", begin = 0.2) +
geom_sf(data = area_towns, colour = "darkgray", fill = NA, size = 0.05) +
geom_sf_text(data = area_town_centroid, aes(label = NAME), color = "darkgray", size = 2) +
coord_sf(datum = NA, label_axes = "----") +
xlab("") + ylab("") + theme_minimal() +
facet_wrap(~ race) +
labs(title = "Percentage of Population for a Set of Sub-Groups",
# subtitle = "by census tract (margin of error by tract may be large)",
caption = "Source: US Census American Community Survey 2013-2017 (variables B01001H_001, B02008_001, B02009_001, B02011_001, B01001I_001, B05001_006)\ntidycensus R package") 

As an experiment, I redid the race plot to show detail by census tract not just by town. Of course the overall impression is similar to the town level plots. As I have said before, as one gets down to the level of individual tracts the fact that we are dealing with a sample with a margin of error is an important reason not to over-interpret small vaiations.

race_tracts <- get_acs(geography = "tract",
state = "CT",
county = "New Haven",
geometry = "TRUE",
year = 2017,
survey = "acs5",
variables = c(white_not_hispanic, white_alone, black, hispanic, asian, not_us),
summary_var = "B01001_001") %>%
filter(estimate > 0) %>%
left_join(tract_town_df, by = "GEOID") %>%
filter(TOWN %in% pick_towns) %>%
mutate(pct = estimate / summary_est,
pct_moe = moe_prop(estimate, summary_est, moe, summary_moe)) %>%
left_join(rcodes, by = "variable") %>%
mutate(race = factor(code,
levels = c("white_alone", "white_not_hispanic", "hispanic", "black", "asian", "not_us"),
labels = c("White alone", "White not Hispanic", "Hispanic", "Black", "Asian", "Not US citizen")))
ggplot() +
geom_sf(data = race_tracts,
aes(fill = pct), colour = NA) +
# using the begin parameter of viridis prevents the extremem values from looking almost black
scale_fill_viridis_c(option = "plasma", direction = -1, name = "Pct", begin = 0.2) +
geom_sf(data = area_towns, colour = "darkgray", fill = NA, size = 0.05) +
geom_sf_text(data = area_town_centroid, aes(label = NAME), color = "darkgray", size = 2) +
coord_sf(datum = NA, label_axes = "----") +
xlab("") + ylab("") + theme_minimal() +
facet_wrap(~ race) +
labs(title = "Percentage of Population for a Set of Sub-Groups",
subtitle = "by census tract (margin of error by tract may be large)",
caption = "Source: US Census American Community Survey 2013-2017\n(variables B01001H_001, B02008_001, B02009_001, B02011_001, B01001I_001, B05001_006)\ntidycensus R package") 

As a final step I focused only on the City of New Haven. By focusing on one town only, we can see more detail by census tract. Within the city, some tracts are very white and others are very non-white. The east side of the city tends to be more Hispanic (including the Fair Haven area) and the west side tends to be more Black (the Hill). The census tract where I lived for over 25 years near East Rock is almost as white as the outer suburbs. Two tracts to the west is the blackest tract in New Haven (which I assume is Newhallville).

As before, it is important to remember that at the level of individual census tracts the ACS sample results may not be exactly precise. Small differences between tracts may be within the margin of error.

# I explored possibilit of adding roads, but didn't like it.
# nh_roads <- roads("CT", "New Haven") %>%
# filter(RTTYP %in% c("I", "S", "U"))
# nh_roads <- st_intersection(nh_roads, area_towns %>% filter(NAME == "New Haven"))
nh_focus <- race_tracts %>% filter(TOWN == "New Haven")
ggplot() +
geom_sf(data = nh_focus,
aes(fill = pct), colour = NA) +
# using the begin parameter of viridis prevents the extremem values from looking almost black
scale_fill_viridis_c(option = "plasma", direction = -1, name = "Pct", begin = 0.2) +
geom_sf(data = area_towns %>% filter(NAME == "New Haven"),
colour = "darkgray", fill = NA, size = 0.05) +
geom_sf_text(data = area_town_centroid %>% filter(NAME == "New Haven"),
aes(label = NAME), color = "darkgray", size = 4) +
coord_sf(datum = NA, label_axes = "----") +
xlab("") + ylab("") + theme_minimal() +
facet_wrap(~ race) +
labs(title = "Focus on New Haven: Percentage of Population for a Set of Sub-Groups",
subtitle = "by census tract (margin of error by tract may be large)",
caption = "Source: US Census American Community Survey 2013-2017\n(variables B01001H_001, B02008_001, B02009_001, B02011_001, B01001I_001, B05001_006)\ntidycensus R package") 

Conclusion

The point of all this was to try out the R-based techniques for both extracting and displaying ACS data. My primary aim wasn’t to tell a story about life around New Haven. But the plots do emphasize that where people live is highly correlated with important variables such as income and race.

That comes as no surprise, but perhaps it was even more true than I assumed. I live with these patterns everyday and take them for granted. The plots make the existence of these patterns more salient.

Sources

In the text I have tried to give some tips on how to find one’s way among the abundance of topics and variables in the ACS. Of course there is much, much more info at census.gov.

All of this relies on the tidycensus and tigris packages by Kyle Walker. Several years ago I did a wee bit of work on making maps with census data without tidycensus or the sf package. It was much more difficult! The tools that Kyle has provided make it all much easier. I took his census course at DataCamp and found it to be very helpful.

In addition to tidycensus there is also a censusapi package that provides access to Census Bureau API’s other than the ACS or the decennial census.


  1. I created this table in RMarkdown by pasting a table copied from Excel into this page which converted it automatically into markdown table format. Very handy! I had to add a HTML spaces to get it to indent properly.

To leave a comment for the author, please follow the link and comment on their blog: R on Can I Blog Too.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)