Statistics New Zealand experimental API initiative

[This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Exciting experimental API to access New Zealand official statistics

Statistics New Zealand have released an exciting experiment in accessing data in JSON format over the web via an application programming interface (API). It looks to be time series data that is usually provided over the solid but dated Infoshare interface, which has only clunky ability to provide data machine to machine (at my work, we have a moderately complex data warehousing application that deals with getting data from the current dissemination tools).

This is a great initiative by Statistics New Zealand that should be supported, so try accessing the data and giving them feedback.

Important: note that the data in this experiment is not going to be updated, so its currency will degrade over time. Treat this as an experiment in a means of accessing data, not a way to get the definitive source.

Access is easiest via R

Within days of Statistics New Zealand announcing the experimental API, an R package to engage with it had been released on GitHub by Jonathan Marshall. It builds on Hadley Wickham’s httr package, designed specifically to make it easy for developers like Jonathan to build downstream packages for R to interact with this sort of API. The net impact is that a few simple R functions give full access to the data Statistics New Zealand have provided.

Note that as far as I am aware @jmarshallnz has no affiliation with Statistics New Zealand and his work on this package is his own. This means two important things:

  • encourage him to keep up the good work!
  • be careful when you give feedback to work out which issues are for Statistics New Zealand to improve the API, and which are for Jonathan to improve his R package.

Usage

Here’s a graphic I made with just a few lines of code, and no manual downloads of data, using the StatsNZ R package and Statistics New Zealand’s experimental API (which I didn’t have to learn at all, as the work is done under the hood by Jonathan’s statsNZ package):

manufacturing

Note that the vertical scales differ in each facet, and that the industry groupings have been sorted in order of recent size (smallest at the bottom right). Also note that the industry groupings aren’t mutually exclusive.

It shows an interesting picture of changes in the New Zealand manufacturing sector. For example:

  • big and sustained (ie not yet recovered) drops in metal product manufacturing and seafood processing at the time of the global financial crisis (GFC)
  • ongoing decline over a longer period in textiles, printing, and “furniture and other” (although there’s an interesting trend up in the past few quarters in the latter)
  • post-GFC recoveries in chemicals, non-metallic minerals, other food manufacturing (to a degree), and transport
  • sustained growth in meat and dairy and beverage and tobacco
  • a sharp boom in the last few years in petroleum and coal product manufacturing
  • different intensity of seasonality in different industries (eg particularly strong in beverage and tobacco, which includes wine).

The overall picture is one of growth, with recovery in recent years to greater than pre-GFC levels. Manufacturing sales are around $26 billion per quarter, in September 2010 prices.

Here’s the code that grabbed the data and drew the graphic:

devtools::install_github("jmarshallnz/statsNZ")
library(statsNZ)
library(dplyr)
library(ggseas)
library(stringr)

available_stats()

get_groups("ESM")

esm <- get_stats("ESM", "Industry by variable - Subannual Financial Collection")

unique(esm$SeriesTitle1) # 15 levels
unique(esm$SeriesReference) # 225 levels and the code contains information
unique(esm$status) # 3 levels

# To understand the metadata we check out the original Hot Off the Press release
# http://www.stats.govt.nz/~/media/Statistics/Browse%20for%20stats/EconomicSurveyofManufacturing/HOTPDec15qtr/esm-dec-2015-tables.xlsx

# SeriesReference MFGQ.XXX1KA (ie last three letters 1K1) means
# Sales in volume terms ie adjusted for price changes, 
# the September 2010 quarter prices of industry XXX


# taking a punt on status "F" meaning final (the alternatives are C and R.  There don't seem
# to be any values of R in this subset, and C seems identical to F.)
the_data <- esm %>%
   filter(status == "F") %>%
   filter(grepl("1KA$", SeriesReference)) %>%
   mutate(SeriesTitle1 = str_wrap(SeriesTitle1, 30))

# what's the order biggest to largest, so we can sort the facets:
sorted <- the_data %>%
   filter(Period == max(Period)) %>%
   arrange(desc(DataValues))

the_data %>%
   mutate(SeriesTitle1 = factor(SeriesTitle1, levels = sorted$SeriesTitle1)) %>%
   ggplot(aes(x= Period, y = DataValues)) +
   facet_wrap(~SeriesTitle1, scales = "free_y", ncol = 3) +
   # draw original data:
   geom_line(colour = "grey70") +
   # draw seasonally adjusted version (note this is our seasonal adjustment on the
   # fly, not the seasonally adjusted data published by Statistics New Zealand):
   stat_stl(s.window = 7, frequency = 4, colour = "steelblue", size = 0.9) +
   scale_y_continuous("Sales per quarter, September 2010 prices, millions of dollars\n", 
                      label = dollar) +
   labs(x = "", caption = "Source: Statistics New Zealand experimental API\nhttp://innovation.stats.govt.nz/initiatives/time-series-api-prototype/") +
   ggtitle("Economic Survey of Manufacturing, New Zealand",
           subtitle = "Different fortunes in industries' manufacturing trends")

A few points:

  • I had to go to the Statistics New Zealand website and look at the Excel version of the “Hot Off The Press” release for this survey to understand the meta data, and in particular what the 225 levels of “SeriesReference” equated to.
  • That field contains a string that provides multiple bits of information and needs to be unpacked. In the example above I use a regular expression to get all the series that end with “1KA” because I learned from looking at the Excel version that those fields mean “sales in volume terms”.

Ideally that sort of metadata would also be available for download (and hence joining) as part of the API, and fields should be tidied so they don’t contain multiple variables of information (ie that “1KA” snippet of information should be in its own field, not packed into an 11 character string). In fact it’s quite possible the metadata is already somewhere in the API I haven’t explored so I’ll hold off on giving that sort of detailed feedback yet. And of course, even if what we’ve got here is all, this is far better access than anything currently available. It’s not that this data isn’t out there, but it’s just fiddly to get at.

So great work, Statistics New Zealand! And everyone else, support this initiative.

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)