Site icon R-bloggers

Statistics New Zealand experimental API initiative

[This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Exciting experimental API to access New Zealand official statistics

Statistics New Zealand have released an exciting experiment in accessing data in JSON format over the web via an application programming interface (API). It looks to be time series data that is usually provided over the solid but dated Infoshare interface, which has only clunky ability to provide data machine to machine (at my work, we have a moderately complex data warehousing application that deals with getting data from the current dissemination tools).

This is a great initiative by Statistics New Zealand that should be supported, so try accessing the data and giving them feedback.

Important: note that the data in this experiment is not going to be updated, so its currency will degrade over time. Treat this as an experiment in a means of accessing data, not a way to get the definitive source.

Access is easiest via R

Within days of Statistics New Zealand announcing the experimental API, an R package to engage with it had been released on GitHub by Jonathan Marshall. It builds on Hadley Wickham’s httr package, designed specifically to make it easy for developers like Jonathan to build downstream packages for R to interact with this sort of API. The net impact is that a few simple R functions give full access to the data Statistics New Zealand have provided.

Note that as far as I am aware @jmarshallnz has no affiliation with Statistics New Zealand and his work on this package is his own. This means two important things:

Usage

Here’s a graphic I made with just a few lines of code, and no manual downloads of data, using the StatsNZ R package and Statistics New Zealand’s experimental API (which I didn’t have to learn at all, as the work is done under the hood by Jonathan’s statsNZ package):

Note that the vertical scales differ in each facet, and that the industry groupings have been sorted in order of recent size (smallest at the bottom right). Also note that the industry groupings aren’t mutually exclusive.

It shows an interesting picture of changes in the New Zealand manufacturing sector. For example:

The overall picture is one of growth, with recovery in recent years to greater than pre-GFC levels. Manufacturing sales are around $26 billion per quarter, in September 2010 prices.

Here’s the code that grabbed the data and drew the graphic:

devtools::install_github("jmarshallnz/statsNZ")
library(statsNZ)
library(dplyr)
library(ggseas)
library(stringr)

available_stats()

get_groups("ESM")

esm <- get_stats("ESM", "Industry by variable - Subannual Financial Collection")

unique(esm$SeriesTitle1) # 15 levels
unique(esm$SeriesReference) # 225 levels and the code contains information
unique(esm$status) # 3 levels

# To understand the metadata we check out the original Hot Off the Press release
# http://www.stats.govt.nz/~/media/Statistics/Browse%20for%20stats/EconomicSurveyofManufacturing/HOTPDec15qtr/esm-dec-2015-tables.xlsx

# SeriesReference MFGQ.XXX1KA (ie last three letters 1K1) means
# Sales in volume terms ie adjusted for price changes, 
# the September 2010 quarter prices of industry XXX


# taking a punt on status "F" meaning final (the alternatives are C and R.  There don't seem
# to be any values of R in this subset, and C seems identical to F.)
the_data <- esm %>%
   filter(status == "F") %>%
   filter(grepl("1KA$", SeriesReference)) %>%
   mutate(SeriesTitle1 = str_wrap(SeriesTitle1, 30))

# what's the order biggest to largest, so we can sort the facets:
sorted <- the_data %>%
   filter(Period == max(Period)) %>%
   arrange(desc(DataValues))

the_data %>%
   mutate(SeriesTitle1 = factor(SeriesTitle1, levels = sorted$SeriesTitle1)) %>%
   ggplot(aes(x= Period, y = DataValues)) +
   facet_wrap(~SeriesTitle1, scales = "free_y", ncol = 3) +
   # draw original data:
   geom_line(colour = "grey70") +
   # draw seasonally adjusted version (note this is our seasonal adjustment on the
   # fly, not the seasonally adjusted data published by Statistics New Zealand):
   stat_stl(s.window = 7, frequency = 4, colour = "steelblue", size = 0.9) +
   scale_y_continuous("Sales per quarter, September 2010 prices, millions of dollars\n", 
                      label = dollar) +
   labs(x = "", caption = "Source: Statistics New Zealand experimental API\nhttp://innovation.stats.govt.nz/initiatives/time-series-api-prototype/") +
   ggtitle("Economic Survey of Manufacturing, New Zealand",
           subtitle = "Different fortunes in industries' manufacturing trends")

A few points:

Ideally that sort of metadata would also be available for download (and hence joining) as part of the API, and fields should be tidied so they don’t contain multiple variables of information (ie that “1KA” snippet of information should be in its own field, not packed into an 11 character string). In fact it’s quite possible the metadata is already somewhere in the API I haven’t explored so I’ll hold off on giving that sort of detailed feedback yet. And of course, even if what we’ve got here is all, this is far better access than anything currently available. It’s not that this data isn’t out there, but it’s just fiddly to get at.

So great work, Statistics New Zealand! And everyone else, support this initiative.

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.