Jay Ulfelder, PhD, serves as Program Manager for the Nonviolent Action Lab, part of the Carr Center for Human Rights Policy at the Harvard Kennedy School. He has used R to work at the intersection of social science and data science for nearly two decades.
Where are people in the United States protesting in 2020, and what are they protesting about? How large have those crowds been? How many protesters have been arrested or injured? And how does this year’s groundswell of protest activity compare to the past several years, which had already produced some of the largest single-day gatherings in U.S. history?
These are the kinds of questions the Crowd Counting Consortium (CCC) Crowd Dataset helps answer. Begun after the 2017 Women’s March by Professors Erica Chenoweth (Harvard University) and Jeremy Pressman (University of Connecticut), the CCC’s database on political crowds has grown into one of the most comprehensive open sources of near-real time information on protests, marches, demonstrations, strikes, and similar political gatherings in the contemporary United States. At the start of August 2020, the database included nearly 50,000 events. These data have been used in numerous academic and press pieces, including a recent New York Times story on the historic scale of this year’s Black Lives Matter uprising.
As rich as the data are, they have been a challenge to use. The CCC shares its data on political crowds via a stack of monthly Google Sheets with formats that can vary from sheet to sheet in small but confounding ways. Column names don’t always match, and certain columns have been added or dropped over time. Some sheets include separate tabs for specific macro-events or campaigns (e.g., a coordinated climate strike), while others group everything in a single sheet. And, of course, typos happen.
To make this tremendous resource more accessible to researchers, activists, journalists, and data scientists, the Nonviolent Action Lab at Harvard’s Carr Center for Human Rights Policy—a new venture started by CCC co-founder Chenoweth—has created a GitHub repository to host a compiled, cleaned, and augmented version of the CCC’s database.
In addition to all the information contained in the monthly sheets, the compiled version adds two big feature sets.
Geolocation. After compiling the data, the Lab uses the googleway package to run the cities and towns in which the events took place through Google Maps’s Geolocation API and extracts geocoordinates from the results, along with clean versions of the locality, county, and state names associated with them.
Issue Tags. In the original data, protesters’ claims—in other words, what the protest is about—are recorded by human coders in a loosely structured way. To allow researchers to group or filter the data by theme, the Nonviolent Action Lab maintains a dictionary of a few dozen major political issues in the U.S. (e.g., “guns”, “migration”, “racism”) and keyword– and keyphrase-based regular expressions associated with them. By mapping this dictionary over the claim strings, we generate a set of binary issue tags that gets added to the compiled database as a vector of semicolon-separated strings.
To make the CCC data more accessible to a wider audience, the Lab has also built a Shiny dashboard that lets users filter events in various ways and then map and plot the results. Users can filter by date range, year, or campaign as well as issue and political valence (pro-Trump, anti-Trump, or neither).
The dashboard has two main tabs. The first uses the leaflet package to map the events with markers that contain summary details and links to the source(s) the human coders used to research them.
The second tab uses plotly and the streamgraph html widget package to render interactive plots of trends over time in the occurrence of the selected events, the number of participants in them, and the political issues associated with them.
The point of the Nonviolent Action Lab’s repository and dashboard is to make the Crowd Counting Consortium’’s data more accessible and more useful to as wide an audience as possible. If you use either of these resources and find bugs or errors or have suggestions on how to improve them, please let us know.