readr::problems() returns tidy data!

[This article was first published on jacobsimmering.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A handy little trick I picked up today when using readr.

Some background: I needed a mapping between ZIP Code Tabulation Areas and counties (to link to some urban/rural data). The Census Bureau provides a CSV style table that includes information about each of the ZCTA (e.g., size, population, area by land/water type) and the FIPS codes for the state and county.

However, when I load that data using the readr package:

library(tidyverse)
zcta_to_county_mapping <- read_csv("http://www2.census.gov/geo/docs/maps-data/data/rel/zcta_county_rel_10.txt") %>%
  select(ZCTA5, STATE, COUNTY) %>%
  mutate(STATE = as.numeric(STATE),
         COUNTY = as.numeric(COUNTY))
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   ZCTA5 = col_character(),
##   COUNTY = col_character(),
##   COAREA = col_double(),
##   COAREALAND = col_double(),
##   ZPOPPCT = col_double(),
##   ZHUPCT = col_double(),
##   ZAREAPCT = col_double(),
##   ZAREALANDPCT = col_double(),
##   COPOPPCT = col_double(),
##   COHUPCT = col_double(),
##   COAREAPCT = col_double(),
##   COAREALANDPCT = col_double()
## )
## See spec(...) for full column specifications.
## Warning: 1592 parsing failures.
##  row        col   expected     actual
## 1303 ZAREA      an integer 3298386447
## 1303 ZAREALAND  an integer 3032137295
## 1304 AREAPT     an integer 2429735568
## 1304 AREALANDPT an integer 2262437812
## 1304 ZAREA      an integer 3298386447
## .... .......... .......... ..........
## See problems(...) for more details.

It produces a warning. Looking at the few rows it returned, it seems likely that the errors are coming from overflow - read_csv() guessed that the variable was of type int (8 bytes, max value of \(2^31 - 1\) or 2,147,483,647) byt some of these values are huge. I looked up a few of them and saw that they were all occuring in large, unpopulated areas. One of them (ZIP code 04462) is described by UnitedStatesZipCodes.org as covering “an extremely large land area compared to other ZIP codes in the United States.”

So that seems like the source of the issue - but there were 1,592 failures! I want to make sure those failures never affect the variables that I’m interested in. I noticed the error message says to use problems() to see more details. I did as it was told, expecting something about as useful as the results of warnings() but was pleased to get get back a tbl_df!

Checking to make sure the errors didn’t affect my variables of interest (ZCTA5, STATE and COUNTY) was as easy as

problems(zcta_to_county_mapping) %>%
  filter(col %in% c("ZCTA5", "STATE", "COUNTY"))
## # A tibble: 0 × 4
## # ... with 4 variables: row <int>, col <int>, expected <chr>, actual <chr>

I love when tools make life easier! Even the error handling returns tidy data!

To leave a comment for the author, please follow the link and comment on their blog: jacobsimmering.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)