nycOpenData: A unified R interface to NYC Open Data APIs

[This article was first published on R on Stats and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Guest post by Christian Martinez, developer of the nycOpenData package in R.

nycOpenData: A unified R interface to NYC Open Data APIs

I am pleased to announce the release of nycOpenData, an R package providing convenient, tidy access to dozens of datasets from the New York City Open Data platform.

The package is designed as part of an open-science and reproducible-research effort, with the goal of lowering the friction between public data and statistical analysis—especially for teaching, exploratory research, and applied civic work.

Why nycOpenData?

NYC Open Data hosts hundreds of datasets covering topics such as public safety, housing, transportation, education, health, and city services. While these datasets are publicly accessible through the Socrata API, working with them directly often requires:

  • knowing dataset identifiers,
  • manually constructing API queries,
  • handling pagination, timeouts, and rate limits,
  • and performing repetitive data-cleaning steps.

These barriers can slow down exploratory analysis and make public data less accessible to students, researchers, and practitioners who primarily work in R.

nycOpenData was built to remove these obstacles by providing a consistent, user-friendly interface that returns clean tibbles ready for analysis—without requiring users to interact directly with the API.

What does the package do?

The package provides a growing collection of wrapper functions, each corresponding to a specific NYC Open Data dataset or dataset family. All functions follow a shared design pattern and support:

  • row limits,
  • optional filtering via named lists,
  • sorting,
  • and graceful handling of API errors and timeouts.

Examples of currently supported domains include:

  • 311 service requests
  • Transportation and for-hire vehicles
  • Motor vehicle collisions
  • Department of Buildings permits and complaints
  • Education and school reporting
  • Juvenile justice and public safety
  • Street trees and environmental data
  • Permitted events (historical)

A typical call looks like this:

library(nycOpenData)

nyc_311(
  limit = 1000,
  filters = list(borough = "BROOKLYN")
)
## # A tibble: 1,000 × 40
##    unique_key created_date          agency agency_name complaint_type descriptor
##    <chr>      <chr>                 <chr>  <chr>       <chr>          <chr>     
##  1 67613985   2026-01-26T02:06:05.… NYPD   New York C… Noise - Resid… Banging/P…
##  2 67609553   2026-01-26T02:02:09.… NYPD   New York C… Noise - Resid… Banging/P…
##  3 67610990   2026-01-26T01:58:58.… NYPD   New York C… Illegal Parki… Blocked H…
##  4 67615428   2026-01-26T01:56:49.… NYPD   New York C… Noise - Resid… Banging/P…
##  5 67609568   2026-01-26T01:48:16.… NYPD   New York C… Noise - Resid… Loud Musi…
##  6 67612476   2026-01-26T01:47:10.… NYPD   New York C… Noise - Resid… Loud Musi…
##  7 67614152   2026-01-26T01:46:26.… DSNY   Department… Snow or Ice    Snow Trac…
##  8 67614054   2026-01-26T01:44:50.… DSNY   Department… Dirty Conditi… Trash     
##  9 67606570   2026-01-26T01:41:32.… NYPD   New York C… Noise - Resid… Banging/P…
## 10 67610091   2026-01-26T01:35:51.… NYPD   New York C… Noise - Vehic… Car/Truck…
## # ℹ 990 more rows
## # ℹ 34 more variables: location_type <chr>, incident_zip <chr>,
## #   incident_address <chr>, street_name <chr>, cross_street_1 <chr>,
## #   cross_street_2 <chr>, intersection_street_1 <chr>,
## #   intersection_street_2 <chr>, address_type <chr>, city <chr>,
## #   landmark <chr>, status <chr>, community_board <chr>,
## #   council_district <chr>, police_precinct <chr>, bbl <chr>, borough <chr>, …

The result is returned as a tidy tibble of the 1,000 most recent NYC 311 requests, making it immediately compatible with the tidyverse ecosystem for visualization, modeling, and reporting.

Mini analysis

One of the strongest qualities this function has is its ability to filter based on multiple columns. Let’s put everything together and get a dataset of the last 1,000 311 requests from the New York Police Department in Brooklyn.

# Creating the dataset
brooklyn_nypd <- nyc_311(limit = 1000, filters = list(agency = "NYPD", borough = "BROOKLYN"))

# Calling head of our new dataset
head(brooklyn_nypd)
## # A tibble: 6 × 39
##   unique_key created_date           agency agency_name complaint_type descriptor
##   <chr>      <chr>                  <chr>  <chr>       <chr>          <chr>     
## 1 67613985   2026-01-26T02:06:05.0… NYPD   New York C… Noise - Resid… Banging/P…
## 2 67609553   2026-01-26T02:02:09.0… NYPD   New York C… Noise - Resid… Banging/P…
## 3 67610990   2026-01-26T01:58:58.0… NYPD   New York C… Illegal Parki… Blocked H…
## 4 67615428   2026-01-26T01:56:49.0… NYPD   New York C… Noise - Resid… Banging/P…
## 5 67609568   2026-01-26T01:48:16.0… NYPD   New York C… Noise - Resid… Loud Musi…
## 6 67612476   2026-01-26T01:47:10.0… NYPD   New York C… Noise - Resid… Loud Musi…
## # ℹ 33 more variables: location_type <chr>, incident_zip <chr>,
## #   incident_address <chr>, street_name <chr>, cross_street_1 <chr>,
## #   cross_street_2 <chr>, intersection_street_1 <chr>,
## #   intersection_street_2 <chr>, address_type <chr>, city <chr>,
## #   landmark <chr>, status <chr>, community_board <chr>,
## #   council_district <chr>, police_precinct <chr>, bbl <chr>, borough <chr>,
## #   x_coordinate_state_plane <chr>, y_coordinate_state_plane <chr>, …
# Quick check to make sure our filtering worked
nrow(brooklyn_nypd)
## [1] 1000
unique(brooklyn_nypd$agency)
## [1] "NYPD"
unique(brooklyn_nypd$borough)
## [1] "BROOKLYN"

We successfully created our dataset that contains the 1,000 most recent requests regarding the NYPD in the borough Brooklyn.

Now that we have successfully pulled the data and have it in R, let’s figure out what NYC residents in Brooklyn are complaining about to the NYPD.

To do this, we will create a bar graph of the complaint types.

# Visualizing the distribution, ordered by frequency
library(ggplot2)

ggplot(brooklyn_nypd, aes(y = reorder(complaint_type, complaint_type, length))) +
  geom_bar(fill = "steelblue") +
  theme_minimal() +
  labs(
    title = "Most Recent NYPD 311 Complaints (Brooklyn)",
    subtitle = "Top 1,000 service requests",
    x = "Number of Complaints",
    y = "Type of Complaint"
  )
Bar chart showing the frequency of NYPD-related 311 complaint types in Brooklyn from the 1,000 most recent service requests.

Figure 1: Bar chart showing the frequency of NYPD-related 311 complaint types in Brooklyn from the 1,000 most recent service requests.

This graph shows us not only which complaints were made, but how many of each complaint were made.

Designed for reproducible workflows

A core design principle of nycOpenData is reproducibility. Rather than downloading static CSV files that can change over time or be accidentally modified, analyses can explicitly document:

  • which dataset was used,
  • how many rows were requested,
  • which filters were applied,
  • and when the data were accessed.

This makes the package particularly useful for:

  • reproducible research projects,
  • classroom assignments,
  • data journalism,
  • and exploratory civic analysis.

The package is also designed to be API-polite, with configurable timeouts and safeguards that help prevent common failure modes when querying large public datasets.

Who is it for?

nycOpenData is intended for a broad audience, including:

  • students learning statistics or data science using real-world data,
  • instructors teaching reproducible research or applied data analysis,
  • researchers conducting exploratory or descriptive analyses,
  • data journalists and civic technologists,
  • and anyone interested in working with NYC public data in R.

The goal is not to abstract away the data itself, but to make access predictable, transparent, and easy to integrate into standard R workflows.

Availability

The package is available on CRAN and can be installed using:

install.packages("nycOpenData")

Development continues on GitHub, where new datasets and improvements are added regularly.

Acknowledgements

This package was developed alongside teaching and applied research projects in reproducible data science, with inspiration from open-source contributors across the R community and the NYC Open Data program.

Useful links

As always, feedback, bug reports, and dataset requests are very welcome.

Thanks for reading!

To leave a comment for the author, please follow the link and comment on their blog: R on Stats and R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)