[This article was first published on data science ish, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I saw this tweet making the rounds this past week.
Interesting! I saw people using this map to make the argument that the Electoral College was super important, or a terrible idea, or any of a number of other sociopolitical thoughts. This map certainly caught my attention and made me want to know more about this kind of population density distribution.
Census Population Data
I use Census data from the American Community Survey a lot for my work, so let’s get the ACS population estimates for all the counties in the United States. I’m going to use the most recent 5-year estimates, and let’s do some munging so we have FIPS codes for the mapping. (If you haven’t used the acs package before, you will need to get an API key and run api.key.install() one time to install your key on your system.)
There! Now we have the population in each county.
Making Some Maps
For the mapping, I’m going to use Bob Rudis’ albersusa package. It has some really nice map projections for the United States and was great to work with. It turns out that Bob did package up some population numbers with the maps, but we’ll use our ACS data here instead.
Here we already see how unevenly the U.S. population is distributed. There are almost 10 million people in Los Angeles County, while other large cities like Dallas, Houston, Chicago, and New York are just barely visible with this linear color mapping.
Now we want to find the most populous counties where the top half of the U.S. population lives. Let’s make a copy of the data frame that we’ve used for the mapping, find the total population (and check it, just for sanity’s sake), sort the data frame by population, and then calculate a cumulative sum for the population.
Those are some SMALL counties. WOW. Where is the halfway point, i.e., the point where the cumulative sum goes from being less than half of the total population to more than half?
Now let’s map it.
First off, we have exactly reproduced the map in the tweet. This is maybe not entirely surprising because I am pretty sure that the people who made the map in the tweet also used ACS population estimates.
I’ve lived in half a dozen places over the course of my life and I’ve only lived in the darker, high population counties, with the exception of my four years in undergrad that I spent in a pretty small college town. Other than that, I have only lived in the top half of more populous counties. Probably a lot of you have too! That’s what makes them more populous, I suppose.
What really motivated me to work on this is that I wanted to be able to learn a bit more about how this population density distribution changes. I made a Shiny app where the user chooses (via a slider) what percentage of the population to use as a break between high and low population counties.
The app is most interesting with the slider between the range of about 30% and 70%, I think; the United States is remarkably urban, at least to me. I would never have argued with someone who told me that the population is concentrated in cities, of course, but the population is more unevenly distributed than I would have predicted. The R Markdown file used to make this blog post is available here. I am very happy to hear feedback or questions!
To leave a comment for the author, please follow the link and comment on their blog: data science ish.