Scraping Gdpr Fines

Posted on April 7, 2020 by Roel M. Hogervorst in R bloggers | 0 Comments

[This article was first published on Category R on Roel's R-tefacts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The website Privacy Affairs keeps a list of fines related to GDPR. I heard * that this might be an interesting dataset for TidyTuesdays. The dataset contains at this moment 250 fines given out for GDPR violations and is last updated (according to the website) on 31 March 2020.

All data is from official government sources, such as official reports of national Data Protection Authorities.

The largest fine is €50,000,000 on Google Inc. on January 21 , 2019 – in France, and the smallest is actually 0 euros, but the website says 90.

Scraping

I use the {rvest} package to scrape the website.

Before you start

I first checked the robots.txt of this website. And it did not disallow me to scrape the website.

The scraping

I thought this would be easy and done in a minute. But there were some snafus. It works for now, but if the website changes a bit this scraping routine will not work that well anymore. It extracts the script part of the website and extracts the data between ‘[’ and ’]’. If anyone has ideas on making this more robust, be sure to let me know over twitter.

Details about the scraping part

First I noticed that the website doesn’t show you all of the fines. But when we look at the source of the page it seems they are all there. It should be relatively simple to retrieve the data, the data is in the javaScript part (see picture).

Image of sourcecode of the website

But extracting that data is quite some more work:

First find the < script > tag on the website
Find the node that contains the data
Realize that there are actually two datasources in here

library(rvest)
## Loading required package: xml2
link<- "https://www.privacyaffairs.com/gdpr-fines/"
page <- read_html(link)
temp <- page %>% html_nodes("script") %>% .[9] %>%
rvest::html_text()

cry (joking, don’t give up! The #rstats community will help you!)
do some advanced string manipulation to extract the two json structures
Read the json data in R

ends <- str_locate_all(temp, "\\]")
starts <- str_locate_all(temp, "\\[")
table1 <- temp %>% stringi::stri_sub(from = starts[[1]][1,2], to = ends[[1]][1,1]) %>%
str_remove_all("\n") %>%
str_remove_all("\r") %>%
jsonlite::fromJSON()
table2 <- temp %>% stringi::stri_sub(from = starts[[1]][2,2], to = ends[[1]][2,1]) %>%
str_remove_all("\n") %>%
str_remove_all("\r") %>%
jsonlite::fromJSON()

Profit

I also tried it in pure text before I gave up and returned to html parsing. You can see that in the repo.

(*) I was tricked through twitter #rstats on #tidytuesday

https://twitter.com/hrbrmstr/status/1247476867621421061

Links

RVEST Documentation https://rvest.tidyverse.org/articles/harvesting-the-web.html#css-selectors
The source website for the data set https://www.privacyaffairs.com/gdpr-fines/
Tidy Tuesday website https://github.com/rfordatascience/tidytuesday
Sourcecode for the scraper https://github.com/RMHogervorst/scrape_gdpr_fines
Picture credit: Photo by Paulius Dragunas on Unsplash https://unsplash.com/photos/uw_NWjC1mBE

State of the machine

At the moment of creation (when I knitted this document ) this was the state of my machine: click here to expand

sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 3.6.3 (2020-02-29)
## os macOS Mojave 10.14.6
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Europe/Amsterdam
## date 2020-04-08
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
## blogdown 0.18 2020-03-04 [1] CRAN (R 3.6.1)
## bookdown 0.18 2020-03-05 [1] CRAN (R 3.6.1)
## cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.0)
## crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
## digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.0)
## evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
## fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.0)
## glue 1.3.2 2020-03-12 [1] CRAN (R 3.6.0)
## htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.0)
## httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.0)
## knitr 1.28 2020-02-06 [1] CRAN (R 3.6.0)
## magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
## R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.0)
## Rcpp 1.0.4 2020-03-17 [1] CRAN (R 3.6.1)
## rlang 0.4.5 2020-03-01 [1] CRAN (R 3.6.0)
## rmarkdown 2.1 2020-01-20 [1] CRAN (R 3.6.0)
## rvest * 0.3.5 2019-11-08 [1] CRAN (R 3.6.0)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
## stringi 1.4.6 2020-02-17 [1] CRAN (R 3.6.0)
## stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.0)
## withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0)
## xfun 0.12 2020-01-13 [1] CRAN (R 3.6.0)
## xml2 * 1.2.2 2019-08-09 [1] CRAN (R 3.6.0)
## yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.0)
##
## [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

To leave a comment for the author, please follow the link and comment on their blog: Category R on Roel's R-tefacts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Scraping Gdpr Fines

Scraping

Before you start

The scraping

Details about the scraping part

Links

State of the machine

Related

Scraping

Before you start

The scraping

Details about the scraping part

Links

State of the machine

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)