Scraping data from STATEC’s public tables

[This article was first published on rdata.lu Blog | Data science with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A lot of open data is available in Luxembourg’s open data portal, but sometimes, it is not very easy to download. In the video below, I give you an example of such data and show how you can use rvest to get the data easily.

After watching the video, take a look at the code below. This code does two things; first it scrapes the data, and then it puts the data in a tidy format fur further processing.

So to summarize the idea of the video; instead of clicking the buttons to download each year’s data (which you would have to do 15 times), it is easier to simple turn off javascript and then scrape the html version of the table. It would be possible, albeit with much more effort, to scrape the tables with javascript enabled, by using a tool such as phantomjs. But since we have the possibility to view the table in html, why not take advantage of it?

To scrape the data, you will need first to install the rvest and then load it (and let’s also load the other needed packages)

library(rvest)
library(dplyr)
library(tidyr)
library(purrr)
library(janitor)

Now, using rvest::read_html(), we can download the whole html page:

page_unemp <- read_html("http://www.statistiques.public.lu/stat/TableViewer/tableViewHTML.aspx?ReportId=12950&IF_Language=eng&MainTheme=2&FldrName=3&RFPath=91")

Now, we need to extract the table from the html page, and we do this by using rvest::html_nodes() and by providing this function with the name of the class of the object we’re interested in, namely, the table.

page_unemp %>%
  html_nodes(".b2020-datatable") %>% .[[1]] %>% html_table(fill = TRUE) -> data_raw


head(data_raw)
##                          X1                         X2      X3      X4
## 1                      Year                       Year    2001    2002
## 2             Specification                       Year    2001    2002
## 3 Grand Duchy of Luxembourg  Total employed population 180,084 182,004
## 4 Grand Duchy of Luxembourg     of which: Wage-earners 162,407 164,277
## 5 Grand Duchy of Luxembourg of which: Non-wage-earners  17,677  17,727
## 6 Grand Duchy of Luxembourg                 Unemployed   5,393   6,773
##        X5      X6      X7      X8      X9     X10     X11     X12     X13
## 1    2003    2004    2005    2006    2007    2008    2009    2010    2011
## 2    2003    2004    2005    2006    2007    2008    2009    2010    2011
## 3 183,419 186,325 187,380 192,095 197,486 202,203 204,127 207,923 214,094
## 4 165,509 168,214 169,194 174,045 179,176 183,705 185,369 188,983 194,893
## 5  17,910  18,111  18,186  18,050  18,310  18,498  18,758  18,940  19,201
## 6   8,359   9,426  10,653  10,297   9,670  11,496  14,816  15,567  16,159
##       X14     X15     X16     X17      X18
## 1    2012    2013    2014    2015     2016
## 2    2012    2013    2014    2015 Measures
## 3 219,168 223,407 228,423 233,130  236,100
## 4 199,741 203,535 208,238 212,530  215,430
## 5  19,427  19,872  20,185  20,600   20,670
## 6  16,963  19,287  19,362  18,806   18,185

As you can see, we got the data in quite a nice format, but it still needs to be cleaned a bit. Let’s do this.

First, let’s use the first row as the header of the data set and then remove it:

colnames(data_raw) <- data_raw[2, ]
colnames(data_raw)[1:2] <- c("division", "variable")
data_raw <- data_raw[-c(1,2), ]
head(data_raw)
##                    division                   variable    2001    2002
## 3 Grand Duchy of Luxembourg  Total employed population 180,084 182,004
## 4 Grand Duchy of Luxembourg     of which: Wage-earners 162,407 164,277
## 5 Grand Duchy of Luxembourg of which: Non-wage-earners  17,677  17,727
## 6 Grand Duchy of Luxembourg                 Unemployed   5,393   6,773
## 7 Grand Duchy of Luxembourg          Active population 185,477 188,777
## 8 Grand Duchy of Luxembourg   Unemployment rate (in %)    2.91    3.59
##      2003    2004    2005    2006    2007    2008    2009    2010    2011
## 3 183,419 186,325 187,380 192,095 197,486 202,203 204,127 207,923 214,094
## 4 165,509 168,214 169,194 174,045 179,176 183,705 185,369 188,983 194,893
## 5  17,910  18,111  18,186  18,050  18,310  18,498  18,758  18,940  19,201
## 6   8,359   9,426  10,653  10,297   9,670  11,496  14,816  15,567  16,159
## 7 191,778 195,751 198,033 202,392 207,156 213,699 218,943 223,490 230,253
## 8    4.36    4.82    5.38    5.09    4.67    5.38    6.77    6.97    7.02
##      2012    2013    2014    2015 Measures
## 3 219,168 223,407 228,423 233,130  236,100
## 4 199,741 203,535 208,238 212,530  215,430
## 5  19,427  19,872  20,185  20,600   20,670
## 6  16,963  19,287  19,362  18,806   18,185
## 7 236,131 242,694 247,785 251,936  254,285
## 8    7.18    7.95    7.81    7.46     7.15

This is starting to look nice, but we need to replace the “,” with “.” and then convert the columns to numeric.

data_raw %>%
  map_df(function(x)(gsub(",", ".", x = x))) %>%
  mutate_at(vars(matches("\\d{4}")), as.numeric
            ) -> clean_unemp

head(clean_unemp)
## # A tibble: 6 x 18
##   division    variable    `2001` `2002` `2003` `2004` `2005` `2006` `2007`
##   <chr>       <chr>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Grand Duch… Total empl… 180    182    183    186    187    192    197   
## 2 Grand Duch… of which: … 162    164    166    168    169    174    179   
## 3 Grand Duch… of which: …  17.7   17.7   17.9   18.1   18.2   18.0   18.3 
## 4 Grand Duch… Unemployed    5.39   6.77   8.36   9.43  10.7   10.3    9.67
## 5 Grand Duch… Active pop… 185    189    192    196    198    202    207   
## 6 Grand Duch… Unemployme…   2.91   3.59   4.36   4.82   5.38   5.09   4.67
## # ... with 9 more variables: `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
## #   `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
## #   Measures <chr>

This line: map_df(function(x)(gsub(",", ".", x = x))) calls purrr::map_df(), which maps a function to each column of a data frame. The function in question is function(x)(gsub(",", ".", x = x)), which is an anonymous function (meaning it does not have a name) wrapped around gsub. This function looks for the string “,” and replaces it with “.” in a single column of the data frame. But because we’re mapping this function to all the columns of the data frame with purrr::map_df(), this substitution happens in each column. We’ not done yet, because these columns are still holding characters. We need to convert each column to a numeric vector and this is what happens in the next line, mutate_at(vars(matches("\\d{4}")), as.numeric). Each column that contains exactly for digits (hence the "\\d{4}") is converted to numeric with dplyr::mutate_at().

Now, one last step to really have the data in a nice format:

clean_unemp %>% 
    gather(key=year, value, -division, -variable) %>%
    spread(variable, value) %>%
    clean_names(
           ) -> clean_unemp

head(clean_unemp)
## # A tibble: 6 x 8
##   division year  active_population of_which_non_wage_e… of_which_wage_ear…
##   <chr>    <chr> <chr>             <chr>                <chr>             
## 1 Beaufort 2001  688               85                   568               
## 2 Beaufort 2002  742               85                   631               
## 3 Beaufort 2003  773               85                   648               
## 4 Beaufort 2004  828               80                   706               
## 5 Beaufort 2005  866               96                   719               
## 6 Beaufort 2006  893               87                   746               
## # ... with 3 more variables: total_employed_population <chr>,
## #   unemployed <chr>, unemployment_rate_in_percent <chr>

By using tidyr::gather() and then tidyr::spread() we get a nice data set where each column is a variable and each row is an observation. I advise you run the above code line by line and try to understand what each function does. We finish by cleaning the names of the variables with janitor::clean_names() and that’s it.

Don’t hesitate to follow us on twitter @rdata_lu and to subscribe to our youtube channel.
You can also contact us if you have any comments or suggestions. See you for the next post!

To leave a comment for the author, please follow the link and comment on their blog: rdata.lu Blog | Data science with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)