Reading OECD.Stat into R

[This article was first published on TC » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

OECD.Stat is a commonly used statistics portal in the research world but there are no easy ways (that I know of) to query it straight from R. There are two main benefits of querying OECD.Stat straight from R:

1. Create reproducible analysis (something that is easily lost if you have to download excel files)
2. Make tweaks to analysis easily

There are three main ways I could see to collect data from OECD.Stat

  1. Find the unique name for each dataset and scrape the html from the dataset’s landing page (e.g. the unique URL for pension operating expenses is http://stats.oecd.org/Index.aspx?DataSetCode=PNNI_NEW). This probably would have been the easiest way to scrape the data but it doesn’t offer the flexibility that the other two options do.
  2. Use the OECD Open Data API . This was the avenue I explored initially but it doesn’t seem that the API functionality is fully built yet.
  3. Use the SDMX query provided under the export tab on the OECD.Stat site. This URL query can be easily edited to change the selected countries, observation range and even datasets.

I went with option 3, using the SDMX query.

To get the query that you need to use in R, navigate to your dataset and click
export -> SDMX (XML) (as per picture below)

sdmx2

Then, copy everything in the ‘SDMX DATA URL’ box

Query

In the example below I am using the trade union density dataset.

Getting the SDMX URL as described above for the trade union dataset gives us a very long URL as it contains a lot countries. I cut it down in this example for clarity to:

http://stats.oecd.org/restsdmx/sdmx.ashx/GetData/UN_DEN/AUS+CAN+FRA+DEU+NZL+GBR+USA+OECD/OECD?startTime=1960&endTime=2012

The important parts of the URL above are

  1. UN_DEN – This is the Trade Union Density dataset code
  2. AUS+CAN+FRA+DEU+NZL+GBR+USA+OECD – Unsurprisingly, this is the list of countries we are querying, you can delete countries or if you know the ISO country codes you can add to it.
  3. startTime=1960&endTime=2012 – Change the date range as you please.

Note that many datasets have a lot more options on offer so there is usually a bunch more junk after the dataset code in the URL.

The following code R code creates a melted data frame, ready for use in analysis or ggplot or rCharts. I make use of Carson Sievert’s XML2R package. All you need to do is paste your own SDMX URL into the relevant spot.

library(XML2R)

file <- "http://stats.oecd.org/restsdmx/sdmx.ashx/GetData/UN_DEN/AUS+CAN+FRA+DEU+NZL+GBR+USA+OECD/OECD?startTime=1960&endTime=2012"

obs <- XML2Obs(file)
tables <- collapse_obs(obs)

# The data we care about is stored in the following three nodes
# We only care about the country variable in the keys node
keys <- tables[["MessageGroup//DataSet//Series//SeriesKey//Value"]]
dates <- tables[["MessageGroup//DataSet//Series//Obs//Time"]]
values <- tables[["MessageGroup//DataSet//Series//Obs//ObsValue"]]

# Extract the country part of the keys table
# Have to use both COU and COUNTRY as OECD don't use a standard name
country_list <- keys[keys[,1]== "COU" | keys[,1]== "COUNTRY"]
# The country names are stored in the middle third of the above list
country_list <- country_list[(length(country_list)*1/3+1):(length(country_list)*2/3)]

# Bind the existing date and value vectors
dat <- cbind.data.frame(as.numeric(dates[,1]),as.numeric(values[,1]))
colnames(dat) <- c('date', 'value')

# Add the country variable
# This code maps a new country each time the diff(dat$date)<=0 ...
# ...as there are a different number of readings for each country
# This is not particularly robust
dat$country <- c(country_list[1], country_list[cumsum(diff(dat$date) <= 0) + 1])
#created this as too many sig figs make the rChart ugly
dat$value2 <- signif(dat$value,2)

head(dat)

This should create a data frame for nearly all annualised OECD.Stat data. You will need to do some work with the dates if you want to use more frequently reported data (e.g. quarterly, monthly data).

Once the data is set up like this it is very easy to visualise.

library(ggplot2)
ggplot(dat) + geom_line(aes(date,value,colour=country))

Rplot

Or just as easy but a bit cooler, use rCharts to make it interactive

# library(devtools)
# install_github('rCharts', 'ramnathv', ref = 'dev')
library(rCharts)

n1 <- nPlot(value2 ~ date, group = "country", data = dat, type = "lineChart")
n1$chart(forceY = c(0))

#To publish to a gist on github use this code, it will produce the url for viewing
#(you need a github account)
n1$publish('Union Density over time', host = 'gist')

Link to the interactive is here (or click on the image below)

Union Density

To leave a comment for the author, please follow the link and comment on their blog: TC » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)