I end up with a lot of PDF documents lying around – at last glance, this amounted to a few thousand files. Unfortunately, most of these documents end up with rather obscure names, making it rather annoying to find what I want, or what is interesting. For example, these are the documents I’ve recently downloaded:
wodet3-paper12.pdf jong_afst.pdf tut_gpu_2012_03.pdf lecture1-1.pdf natella_binary_sfi_edcc_2012.pdf TR-Farrukh-58.pdf 730959.pdf NLSEmagic_Paper.pdf M23584378H1770Q2.pdf G89T37P10W263075.pdf journal_online.pdf manus_Jour-INFORMATION-Camera.pdf 12011.VitekJan.Paper.pdf R3X8722476T2X278.pdf 1203.0321.pdf
I previously tried to organize everything using something like Papers, which is a lovely product, but still required effort from me and isn’t very useful now that I no longer have a Mac.
I’ve also tried to rectify this situation via half-hearted attempts at using pdftotext, and grabbing the first 10 words of text, but more often then not I was left with more incomprehensible garbage.
Today, I had some spare time, and far too much interest in this problem, but I managed to come up with an easy and fairly effective solution. It also resembles a rube-goldberg machine. After digging around for various pdf conversion utilities, I discovered that pdftohtml not only generated reasonable output, but it could also be set to output to an easily parsed xml format. From there it was a simple bit of BeautifulSoup to get nice titles for most of my documents:
[gist id=2078056]
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Zero Inflated Models and Generalized Linear Mixed Models with R.
Zuur, Saveliev, Ieno (2012).