My first R package: zipcode

Posted on January 5, 2011 by Jeffrey Breen in R bloggers, Uncategorized | 0 Comments

[This article was first published on Things I tend to forget » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

You may know that I am a fan of the CivicSpace US ZIP Code Database compiled by Schuyler Erle of Mapping Hacks fame. It contains nearly 10,000 more records than the ZIP Code Tabulation Areas file from the U.S. Census Bureau upon which it is based, so a lot of work has gone into it.

I have been using the database a lot recently to correlate with survey respondents, so I have saved it as an R data.frame. Since others may find it useful, too, I have packaged it into the ‘zipcode’ package now available on CRAN.

One you load the package, the database is available in the ‘zipcode’ data.frame:

> library(zipcode)
> data(zipcode)

> nrow(zipcode)
[1] 43191

> head(zipcode)
    zip       city state latitude longitude timezone  dst
1 00210 Portsmouth    NH 43.00590  -71.0132       -5 TRUE
2 00211 Portsmouth    NH 43.00590  -71.0132       -5 TRUE
3 00212 Portsmouth    NH 43.00590  -71.0132       -5 TRUE
4 00213 Portsmouth    NH 43.00590  -71.0132       -5 TRUE
5 00214 Portsmouth    NH 43.00590  -71.0132       -5 TRUE
6 00215 Portsmouth    NH 43.00590  -71.0132       -5 TRUE

Note that the ‘zip’ column is a string, not an integer, in order to preserve leading zeroes — a sensitive topic for those of us in the Northeast…

The package also includes a clean.zipcodes() function to help clean up zip codes in your data. It strips off “ZIP+4″ suffixes, attempts to restore missing leading zeroes, and replaces anything with non-digits (like non-U.S. postal codes) with NAs:

> library(zipcode)
> data(zipcode)

> somedata = data.frame(postal = c(2061, "02142", 2043, "20210", "2061-2203", "SW1P 3JX", "210", '02199-1880'))
> somedata
      postal
1       2061
2      02142
3       2043
4      20210
5  2061-2203
6   SW1P 3JX
7        210
8 02199-1880

> somedata$zip = clean.zipcodes(somedata$postal)
> somedata
      postal   zip
1       2061 02061
2      02142 02142
3       2043 02043
4      20210 20210
5  2061-2203 02061
6   SW1P 3JX  <NA>
7        210 00210
8 02199-1880 02199

> data(zipcode)
> somedata = merge(somedata, zipcode, by.x='zip', by.y='zip')
> somedata
    zip     postal       city state latitude longitude timezone  dst
1 00210        210 Portsmouth    NH 43.00590 -71.01320       -5 TRUE
2 02043       2043    Hingham    MA 42.22571 -70.88764       -5 TRUE
3 02061       2061    Norwell    MA 42.15243 -70.82050       -5 TRUE
4 02061  2061-2203    Norwell    MA 42.15243 -70.82050       -5 TRUE
5 02142      02142  Cambridge    MA 42.36230 -71.08412       -5 TRUE
6 02199 02199-1880     Boston    MA 42.34713 -71.08234       -5 TRUE
7 20210      20210 Washington    DC 38.89331 -77.01465       -5 TRUE

Now we wouldn’t be R users if we didn’t try to do something with data, even if it’s just a lookup table of zip codes. So let’s take a look at how they’re distributed by first digit:

library(zipcode)
library(ggplot2)

data(zipcode)
zipcode$region = substr(zipcode$zip, 1, 1)

g = ggplot(data=zipcode) + geom_point(aes(x=longitude, y=latitude, colour=region))

# simplify display and limit to the "lower 48"
g = g + theme_bw() + scale_x_continuous(limits = c(-125,-66), breaks = NA)
g = g + scale_y_continuous(limits = c(25,50), breaks = NA)

# don't need axis labels
g = g + labs(x=NULL, y=NULL)