Pirating Pirate Data for Pirate Day

September 21, 2017
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

This past Tuesday was Talk Like A Pirate Date, the unofficial holiday of R (aRRR!) users worldwide. In recognition of the day, Bob Rudis used R to create this map of worldwide piracy incidents from 2013 to 2017

Pirates

The post provides a useful and practical example of extracting data from a website without an API, otherwise known as "scraping" data. In this case the website was the International Chamber of Commerce, and the provided R code demonstrates several useful R packages for scraping:

  • rvest, for extracting data from a web pages' HTML source
  • purrr, and specifically the safely function, for streamlining the process of iterating over pages that may return errors
  • purrrly, for iterating over the rows of scraped data
  • splashr, for automating the process of taking a screenshot of a webpage
  • and robotstxt, for checking whether automated downloads of the website content are allowed 

That last package is an important one, because while it's almost always technically possible to automate the process of extracting data from a website, it's not always allowed or even legal. Inspecting the robots.txt file is one check you should definitely make, and you should also check the terms of service of the website which may also prohibit the practice. Even then, you should be respectful and take care not to overload the server with frequent and/or repeated calls, as Bob demonstrates by spacing requests by 5 seconds. Finally — and most importantly! — even if scraping isn't expressly forbidden, using scraped data may not be ethical, especially when the data is about people who are unable to give their individual consent to your use of the data. This Forbes article about analyzing data scraped from a dating website offers an instructive tale in that regard.

This piracy data example however provides a case study of using websites and the data it provides in the right way. Follow the link below all for the details.

rud.is: Pirating Web Content Responsibly With R

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)