Exploring San Francisco with choroplethrZip

April 7, 2015

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Ari Lamstein


Today I will walk through an analysis of San Francisco Zip Code Demographics using my new R package choroplethrZip. This package creates choropleth maps of US Zip Codes and connects to the US Census Bureau. A choropleth is a map that shows boundaries of regions (such as zip codes) and colors those regions according to some metric (such as population).

Zip codes are a common geographic unit for businesses to work with, but rendering them is difficult. Official zip codes are maintained by the US Postal Service, but they exist solely to facilitate mail delivery. The USPS does not release a map of them, they change frequently and, in some cases, are not even polygons. The most authoritative map I could find of US Zip codes was the Census Bureau’s Map of Zip Code Tabulated Areas (ZCTAs). Despite shipping with only a simplified version of this map (60MB instead of 500MB), choroplethrZip is still too large for CRAN. It is instead hosted on github, and you can install it from an R console like this:

# install.github(“devtools”)
install_github('arilamstein/[email protected]')

The package vignettes (1, 2) explain basic usage. In this article I’d like to demonstrate a more in depth example: showing racial and financial characteristics of each zip code in San Francisco. The data I use comes from the 2013 American Community Survey (ACS) which is run by the US Census Bureau. If you are new to the ACS you might want to view my vignette on Mapping US Census Data.

Example: Race and Ethnicity

One table that deals with race and ethnicity is B03002 – Hipanic or Latino Origin by Race. Many people will be surprised by the large number of categories. This is because the US Census Bureau has a complex framework for categorizing race and ethnicity. Since my purpose here is to demonstrate technology, I will simplify the data by only dealing with only a handful of the values: Total Hispanic or Latino, White (not Hispanic), Black (not Hispanic) and Asian (not Hispanic).

The R code for getting this data into a data.frame can be viewed here, and the code for generating the graphs in this post can be viewed here. Here is a boxplot of the ethnic breakdown of the 27 ZCTAs in San Francisco.


This boxplot shows that there is a wide variation in the racial and ethnic breakdown of San Francisco ZCTAs. For example, the percentage of White people in each ZCTA ranges from 7% to 80%. The percentage of black and hispanics seem to have a tighter range, but also contains outliers. Also, while Asian Americans only make up the 5% of the total US population, the median ZCTA in SF is 30% Asian.

Viewing this data with choropleth maps allows us to associate locations with these values.  


Example: Per Capita Income

When discussing demographics people often ask about Per Capita Income. Here is a boxplot of the Per capita Income in San Francisco ZCTAs.


The range of this dataset – from $15,960 to $144,400 – is striking. Equally striking is the outlier at the top. We can use a continuous scale choropleth to highlight the outlier.  We can also use a four color choropleth to show the locations of the quartiles.



The outlier for income is zip 94105, which is where a large number of tech companies are located. The zips in the southern part of the city tend to have a low income.

Further Exploration: Interactive Analysis with Shiny

After viewing this analysis readers might wish to do a similar analysis for the city where they live. To facilitate this I have created an interactive web application. The app begins by showing a choropleth map of a random statistic of the Zips in a random Metropolitan Statistical Area (MSA). You can choose another statistic, zoom in, or select another MSA.

In the event that the application does not load (for example, if I reach my monthly quota at my hosting company) then you can run the app from source, which is available here.


I hope that you have enjoyed this exploration of Zip code level demographics with choroplethrZip. I also hope that it encourages more people to use R for demographic statistics.

Ari Lamstein is a software engineer and data analyst in San Francisco. He blogs at Just an R Blog.


To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)