Unlocking Data in PDFs
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Unfortunately, there is a lot of data released in on the web in the form of PDF files. Scraping data out of PFDs is much harder than scraping from a web page; web pages have structure, in the form of HTML, that you can usually leverage to extract structured data.
It isn’t hopeless, it’s just harder. Here are some of the tools and techniques that we’ve found useful in parsing data from PDFs.
pdftotext
pdftotext
is a utility from the Xpdf
project that converts PDFs to flat text files. It is easiest to install and use on unix based platforms, where it can be found in the poppler-utils
package. There is also a windows port of Xpdf
that I’ve used successfully.
pdftotext
has several useful flags that effect how it parses its input:
-layout
-raw
tabula
Recently I’ve switched to Tabula for most of my PDF scraping needs. Tabula is a desktop application for extracting data from PDFs. I’ve found it to be more reliable than pdftotext
. The only drawback is that it isn’t a command line program, so automating the scraping isn’t as easy as pdftotext
. On the other hand, you can visually select the parts of the PDFs you’d like to scrape, which is useful for one-off jobs.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.