robust pdf title extraction

March 28, 2012

(This article was first published on idle thoughts » R, and kindly contributed to R-bloggers)

I end up with a lot of PDF documents lying around – at last glance, this amounted to a few thousand files. Unfortunately, most of these documents end up with rather obscure names, making it rather annoying to find what I want, or what is interesting.  For example, these are the documents I’ve recently downloaded:


I previously tried to organize everything using something like Papers, which is a lovely product, but still required effort from me and isn’t very useful now that I no longer have a Mac.

I’ve also tried to rectify this situation via half-hearted attempts at using pdftotext, and grabbing the first 10 words of text, but more often then not I was left with more incomprehensible garbage.

Today, I had some spare time, and far too much interest in this problem, but I managed to come up with an easy and fairly effective solution.  It also resembles a rube-goldberg machine.  After digging around for various pdf conversion utilities, I discovered that pdftohtml not only generated reasonable output, but it could also be set to output to an easily parsed xml format.  From there it was a simple bit of BeautifulSoup to get nice titles for most of my documents:

View the code on Gist.

To leave a comment for the author, please follow the link and comment on their blog: idle thoughts » R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)