# robust pdf title extraction

March 28, 2012
By

(This article was first published on idle thoughts » R, and kindly contributed to R-bloggers)

I end up with a lot of PDF documents lying around – at last glance, this amounted to a few thousand files. Unfortunately, most of these documents end up with rather obscure names, making it rather annoying to find what I want, or what is interesting.  For example, these are the documents I’ve recently downloaded:

wodet3-paper12.pdf
jong_afst.pdf
tut_gpu_2012_03.pdf
lecture1-1.pdf
natella_binary_sfi_edcc_2012.pdf
TR-Farrukh-58.pdf
730959.pdf
NLSEmagic_Paper.pdf
M23584378H1770Q2.pdf
G89T37P10W263075.pdf
journal_online.pdf
manus_Jour-INFORMATION-Camera.pdf
12011.VitekJan.Paper.pdf
R3X8722476T2X278.pdf
1203.0321.pdf

I previously tried to organize everything using something like Papers, which is a lovely product, but still required effort from me and isn’t very useful now that I no longer have a Mac.

I’ve also tried to rectify this situation via half-hearted attempts at using pdftotext, and grabbing the first 10 words of text, but more often then not I was left with more incomprehensible garbage.

Today, I had some spare time, and far too much interest in this problem, but I managed to come up with an easy and fairly effective solution.  It also resembles a rube-goldberg machine.  After digging around for various pdf conversion utilities, I discovered that pdftohtml not only generated reasonable output, but it could also be set to output to an easily parsed xml format.  From there it was a simple bit of BeautifulSoup to get nice titles for most of my documents: