Pirating Pirate Data for Pirate Day

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This past Tuesday was Talk Like A Pirate Date, the unofficial holiday of R (aRRR!) users worldwide. In recognition of the day, Bob Rudis used R to create this map of worldwide piracy incidents from 2013 to 2017

Pirates

The post provides a useful and practical example of extracting data from a website without an API, otherwise known as “scraping” data. In this case the website was the International Chamber of Commerce, and the provided R code demonstrates several useful R packages for scraping:

  • rvest, for extracting data from a web pages' HTML source
  • purrr, and specifically the safely function, for streamlining the process of iterating over pages that may return errors
  • purrrly, for iterating over the rows of scraped data
  • splashr, for automating the process of taking a screenshot of a webpage
  • and robotstxt, for checking whether automated downloads of the website content are allowed 

That last package is an important one, because while it's almost always technically possible to automate the process of extracting data from a website, it's not always allowed or even legal. Inspecting the robots.txt file is one check you should definitely make, and you should also check the terms of service of the website which may also prohibit the practice. Even then, you should be respectful and take care not to overload the server with frequent and/or repeated calls, as Bob demonstrates by spacing requests by 5 seconds. Finally — and most importantly! — even if scraping isn't expressly forbidden, using scraped data may not be ethical, especially when the data is about people who are unable to give their individual consent to your use of the data. This Forbes article about analyzing data scraped from a dating website offers an instructive tale in that regard.

This piracy data example however provides a case study of using websites and the data it provides in the right way. Follow the link below all for the details.

rud.is: Pirating Web Content Responsibly With R

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)