Hourly Subway Station Flows

[This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Pie charts are bad, as any fule kno. We’re not as good at judging relative differences between angles and areas as we are at judging relative differences in lengths on a common baseline. This is especially true when we have more than two things to compare at the same time. So, as a rule, you shouldn’t use them. You should figure out some other way of viewing your data instead. On the other hand, I just made 424 animated pie charts because if you’re going to break a rule you should break it good and hard.

A view of the New York City Subway System (excluding the SIR). We'll animate this in a minute.

A view of the New York City Subway System (excluding the SIR). We’ll animate this in just a minute.

The New York City Subway system is very large and carries a lot of passengers every day. The MTA makes quite a bit of data available about the subway, including data on hourly flow through the system. Now, the MTA can’t track individual pathways people take through the subway. If you use an OMNY card (or before that, a Metrocard) to enter the system, this signals the start of a trip from some specific station or station complex. But unlike some systems, you don’t need to “tag out” of the subway, you just exit through a turnstile. So the system doesn’t know where you exit it. In addition, while many stations are just on a single line, some (like 34 St/Penn Station, or Fulton Street) are station complexes that serve many lines and allow transfers between them.

However, the MTA does publish hourly Origin-Destination estimates for all pairs of stations. These are their best guess about the flow of traffic from any particular station to any other. Because there are so many combinations, visualizing that sort of data is quite tricky. Even then, you don’t get information about routes through the system, just start and end points. Transit analysts and planners can go further by introducing some further assumptions about Subway users. For example we might assume that commuters take the most efficient route between any given pair of entry and exit stations, and build from there to a picture of flow through the system.

I do something rather more simple here. I use the MTA’s hourly origin-destination estimates and aggregate them on a station-by-station basis to calculate in-and-out flows across 424 subway stations or station complexes. These specific numbers are averaged over all Mondays in 2025. For each hour of the we calculate the total passenger volume at the station, and the share of that volume that are estimated arrivals and departures. Then we draw a pie chart for each station, coloring it yellow for departures, purple for arrivals. The circle size reflects total volume and the pie slice proportions show the flow balance.

The flow data is pretty bulky. The original dataset has about 121 million rows. But working with it is pretty straightforward, thanks to the magic of parquet files, duckdb, and duckplyr. Having patiently downloaded the data via its API, I put it in a parquet file. The CSV is about 17GB but the parquet file boils it down to 1.5GB. Then I made a small R package that bundled that data with a few convenience functions. This lets me use the data without copying it into any single project. So I can write, e.g.,

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
nycsubwayodr::nyc_subway_odr()
#> # A duckplyr data frame: 15 variables
#>     year month day_of_week hour_of_day timestamp           day_of_month origin_station_complex_id
#>    <int> <int> <chr>             <int> <dttm>                     <int>                     <int>
#>  1  2025     1 Monday                1 2025-01-06 01:00:00            6                       189
#>  2  2025     1 Monday                1 2025-01-06 01:00:00            6                       313
#>  3  2025     1 Monday                1 2025-01-06 01:00:00            6                       611
#>  4  2025     1 Monday                1 2025-01-06 01:00:00            6                       125
#>  5  2025     1 Monday                1 2025-01-06 01:00:00            6                       313
#>  6  2025     1 Monday                1 2025-01-06 01:00:00            6                       154
#>  7  2025     1 Monday                1 2025-01-06 01:00:00            6                       167
#>  8  2025     1 Monday                1 2025-01-06 01:00:00            6                       612
#>  9  2025     1 Monday                1 2025-01-06 01:00:00            6                       272
#> 10  2025     1 Monday                1 2025-01-06 01:00:00            6                       167
#> # ℹ more rows
#> # ℹ 8 more variables: origin_station_complex_name <chr>, origin_latitude <dbl>, origin_longitude <dbl>,
#> #   destination_station_complex_id <int>, destination_station_complex_name <chr>,
#> #   destination_latitude <dbl>, destination_longitude <dbl>, estimated_average_ridership <dbl>

From there, we lazily query the data and duckdb does the work of doing the calculations. The whole table is never loaded into your R session, and duckdb is very fast. From there, we take our hourly flow summaries, join them to a tibble of station and line data, and export the result to some JSON files that D3js animates for us.

Here’s the result. There are three views. Initially, you see just the schematic subway map. If you click the “Map” button in the top left, it will switch to the ticking pie-chart view, which puts a pie on every station complex, with each tick being an hour of the day. The pies pile up on one another in the geographic view (in a not wholly uninformative way), but click again to have them expand to a somewhat more abstracted, force-directed network view of the system. Then click again to go back to the map. You can hover over or tap on nodes to get information about the bit of data it’s currently showing.

Now, you might reasonably say, Kieran, that’s a lot of data to show that people go to work in the morning and come home in the evening. I’m not saying there’s nothing to that criticism. But there are quite a few interesting details in there as the data pick up traffic to different parts of town. The big interchanges naturally dominate the view, but even here there are things of interest about the balance of flow, as e.g. Penn Station has people coming in on New Jersey Transit during morning rush hour and then entering the subway, which does a lot to balance its net flow during rush-hour and even tip it towards net departures. But more importantly, who doesn’t want to sit back and contemplate more than 400 pie charts, each one pulsing with life as another hour ticks by?

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)