Scraping data from STATEC’s public tables
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A lot of open data is available in Luxembourg’s open data portal, but sometimes, it is not very easy to download. In the video below, I give you an example of such data and show how you can use rvest
to get the data easily.
After watching the video, take a look at the code below. This code does two things; first it scrapes the data, and then it puts the data in a tidy format fur further processing.
So to summarize the idea of the video; instead of clicking the buttons to download each year’s data (which you would have to do 15 times), it is easier to simple turn off javascript and then scrape the html version of the table. It would be possible, albeit with much more effort, to scrape the tables with javascript enabled, by using a tool such as phantomjs. But since we have the possibility to view the table in html, why not take advantage of it?
To scrape the data, you will need first to install the rvest
and then load it (and let’s also load the other needed packages)
library(rvest) library(dplyr) library(tidyr) library(purrr) library(janitor)
Now, using rvest::read_html()
, we can download the whole html page:
page_unemp <- read_html("http://www.statistiques.public.lu/stat/TableViewer/tableViewHTML.aspx?ReportId=12950&IF_Language=eng&MainTheme=2&FldrName=3&RFPath=91")
Now, we need to extract the table from the html page, and we do this by using rvest::html_nodes()
and by providing this function with the name of the class of the object we’re interested in, namely, the table.
page_unemp %>% html_nodes(".b2020-datatable") %>% .[[1]] %>% html_table(fill = TRUE) -> data_raw head(data_raw) ## X1 X2 X3 X4 ## 1 Year Year 2001 2002 ## 2 Specification Year 2001 2002 ## 3 Grand Duchy of Luxembourg Total employed population 180,084 182,004 ## 4 Grand Duchy of Luxembourg of which: Wage-earners 162,407 164,277 ## 5 Grand Duchy of Luxembourg of which: Non-wage-earners 17,677 17,727 ## 6 Grand Duchy of Luxembourg Unemployed 5,393 6,773 ## X5 X6 X7 X8 X9 X10 X11 X12 X13 ## 1 2003 2004 2005 2006 2007 2008 2009 2010 2011 ## 2 2003 2004 2005 2006 2007 2008 2009 2010 2011 ## 3 183,419 186,325 187,380 192,095 197,486 202,203 204,127 207,923 214,094 ## 4 165,509 168,214 169,194 174,045 179,176 183,705 185,369 188,983 194,893 ## 5 17,910 18,111 18,186 18,050 18,310 18,498 18,758 18,940 19,201 ## 6 8,359 9,426 10,653 10,297 9,670 11,496 14,816 15,567 16,159 ## X14 X15 X16 X17 X18 ## 1 2012 2013 2014 2015 2016 ## 2 2012 2013 2014 2015 Measures ## 3 219,168 223,407 228,423 233,130 236,100 ## 4 199,741 203,535 208,238 212,530 215,430 ## 5 19,427 19,872 20,185 20,600 20,670 ## 6 16,963 19,287 19,362 18,806 18,185
As you can see, we got the data in quite a nice format, but it still needs to be cleaned a bit. Let’s do this.
First, let’s use the first row as the header of the data set and then remove it:
colnames(data_raw) <- data_raw[2, ] colnames(data_raw)[1:2] <- c("division", "variable") data_raw <- data_raw[-c(1,2), ] head(data_raw) ## division variable 2001 2002 ## 3 Grand Duchy of Luxembourg Total employed population 180,084 182,004 ## 4 Grand Duchy of Luxembourg of which: Wage-earners 162,407 164,277 ## 5 Grand Duchy of Luxembourg of which: Non-wage-earners 17,677 17,727 ## 6 Grand Duchy of Luxembourg Unemployed 5,393 6,773 ## 7 Grand Duchy of Luxembourg Active population 185,477 188,777 ## 8 Grand Duchy of Luxembourg Unemployment rate (in %) 2.91 3.59 ## 2003 2004 2005 2006 2007 2008 2009 2010 2011 ## 3 183,419 186,325 187,380 192,095 197,486 202,203 204,127 207,923 214,094 ## 4 165,509 168,214 169,194 174,045 179,176 183,705 185,369 188,983 194,893 ## 5 17,910 18,111 18,186 18,050 18,310 18,498 18,758 18,940 19,201 ## 6 8,359 9,426 10,653 10,297 9,670 11,496 14,816 15,567 16,159 ## 7 191,778 195,751 198,033 202,392 207,156 213,699 218,943 223,490 230,253 ## 8 4.36 4.82 5.38 5.09 4.67 5.38 6.77 6.97 7.02 ## 2012 2013 2014 2015 Measures ## 3 219,168 223,407 228,423 233,130 236,100 ## 4 199,741 203,535 208,238 212,530 215,430 ## 5 19,427 19,872 20,185 20,600 20,670 ## 6 16,963 19,287 19,362 18,806 18,185 ## 7 236,131 242,694 247,785 251,936 254,285 ## 8 7.18 7.95 7.81 7.46 7.15
This is starting to look nice, but we need to replace the “,” with “.” and then convert the columns to numeric.
data_raw %>% map_df(function(x)(gsub(",", ".", x = x))) %>% mutate_at(vars(matches("\\d{4}")), as.numeric ) -> clean_unemp head(clean_unemp) ## # A tibble: 6 x 18 ## division variable `2001` `2002` `2003` `2004` `2005` `2006` `2007` ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Grand Duch… Total empl… 180 182 183 186 187 192 197 ## 2 Grand Duch… of which: … 162 164 166 168 169 174 179 ## 3 Grand Duch… of which: … 17.7 17.7 17.9 18.1 18.2 18.0 18.3 ## 4 Grand Duch… Unemployed 5.39 6.77 8.36 9.43 10.7 10.3 9.67 ## 5 Grand Duch… Active pop… 185 189 192 196 198 202 207 ## 6 Grand Duch… Unemployme… 2.91 3.59 4.36 4.82 5.38 5.09 4.67 ## # ... with 9 more variables: `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, ## # `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, ## # Measures <chr>
This line: map_df(function(x)(gsub(",", ".", x = x)))
calls purrr::map_df()
, which maps a function to each column of a data frame. The function in question is function(x)(gsub(",", ".", x = x))
, which is an anonymous function (meaning it does not have a name) wrapped around gsub
. This function looks for the string “,” and replaces it with “.” in a single column of the data frame. But because we’re mapping this function to all the columns of the data frame with purrr::map_df()
, this substitution happens in each column. We’ not done yet, because these columns are still holding characters. We need to convert each column to a numeric vector and this is what happens in the next line, mutate_at(vars(matches("\\d{4}")), as.numeric)
. Each column that contains exactly for digits (hence the "\\d{4}"
) is converted to numeric with dplyr::mutate_at()
.
Now, one last step to really have the data in a nice format:
clean_unemp %>% gather(key=year, value, -division, -variable) %>% spread(variable, value) %>% clean_names( ) -> clean_unemp head(clean_unemp) ## # A tibble: 6 x 8 ## division year active_population of_which_non_wage_e… of_which_wage_ear… ## <chr> <chr> <chr> <chr> <chr> ## 1 Beaufort 2001 688 85 568 ## 2 Beaufort 2002 742 85 631 ## 3 Beaufort 2003 773 85 648 ## 4 Beaufort 2004 828 80 706 ## 5 Beaufort 2005 866 96 719 ## 6 Beaufort 2006 893 87 746 ## # ... with 3 more variables: total_employed_population <chr>, ## # unemployed <chr>, unemployment_rate_in_percent <chr>
By using tidyr::gather()
and then tidyr::spread()
we get a nice data set where each column is a variable and each row is an observation. I advise you run the above code line by line and try to understand what each function does. We finish by cleaning the names of the variables with janitor::clean_names()
and that’s it.
Don’t hesitate to follow us on twitter @rdata_lu and to subscribe to our youtube channel.
You can also contact us if you have any comments or suggestions. See you for the next post!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.