Unlocking Data in PDFs

[This article was first published on Silent Spring Institute Developer Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Unfortunately, there is a lot of data released in on the web in the form of PDF files. Scraping data out of PFDs is much harder than scraping from a web page; web pages have structure, in the form of HTML, that you can usually leverage to extract structured data.

It isn’t hopeless, it’s just harder. Here are some of the tools and techniques that we’ve found useful in parsing data from PDFs.

pdftotext

pdftotext is a utility from the Xpdf project that converts PDFs to flat text files. It is easiest to install and use on unix based platforms, where it can be found in the poppler-utils package. There is also a windows port of Xpdf that I’ve used successfully.

pdftotext has several useful flags that effect how it parses its input:

  • -layout
  • -raw

tabula

Recently I’ve switched to Tabula for most of my PDF scraping needs. Tabula is a desktop application for extracting data from PDFs. I’ve found it to be more reliable than pdftotext. The only drawback is that it isn’t a command line program, so automating the scraping isn’t as easy as pdftotext. On the other hand, you can visually select the parts of the PDFs you’d like to scrape, which is useful for one-off jobs.

To leave a comment for the author, please follow the link and comment on their blog: Silent Spring Institute Developer Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)