Parse pdf files with R (on a Mac)

October 4, 2012
By

(This article was first published on Nicebread » R, and kindly contributed to R-bloggers)

Inspired by this blog post from theBioBucket, I created a script to parse all pdf files in a directory. Due to its reliance on the Terminal, it’s Mac specific, but modifications for other systems shouldn’t be too hard (as a start for Windows, see BioBucket’s script).

First, you have to install the command line tool pdftotext (a binary can be found on Carsten Blüm’s website). Then, run following script within a directory with pdfs:

?View Code RSPLUS
# helper function: get number of words in a string, separated by tab, space, return, or point.
nwords <- function(x){
	res <- strsplit(as.character(x), "[ \t\n,\\.]+")
	res <- lapply(res, length)
	unlist(res)
}
 
# sanitize file name for terminal usage (i.e., escape spaces)
sanitize <- function(str) {
	gsub('([#$%&~_\\^\\\\{}\\s\\(\\)])', '\\\\\\1', str, perl = TRUE)
}
 
# get a list of all files in the current directory
fi <- list.files()
fi2 <- fi[grepl(".pdf", fi)]
 
 
## Parse files and do something with it ...
res <- data.frame() # keeps records of the calculations
for (f in fi2) {
	print(paste("Parsing", f))
 
	f2 <- sanitize(f)
	system(paste0("pdftotext ", f2), wait = TRUE)
 
	# read content of converted txt file
	filetxt <- sub(".pdf", ".txt", f)
	text <- readLines(filetxt, warn=FALSE)
 
	# adjust encoding of text - you have to know it
	Encoding(text) <- "latin1"
 
	# Do something with the content - here: get word and character count of all pdfs in the current directory
	text2 <- paste(text, collapse="\n")	# collapse lines into one long string
 
	res <- rbind(res, data.frame(filename=f, wc=nwords(text2), cs=nchar(text2), cs.nospace=nchar(gsub("\\s", "", text2)))) 
 
	# remove converted text file
	file.remove(filetxt)
}
 
print(res)

… gives following result (wc = word count, cs = characgter count, cs.nospace = character count without spaces):


> print(res)
                                                    filename    wc     cs cs.nospace
1                              Applied_Linear_Regression.pdf 33697 186280     154404
2                                           Baron-rpsych.pdf 22665 128440     105024
3                              bootstrapping regressions.pdf  6309  34042      27694
4                            Ch_multidimensional_scaling.pdf   718   4632       3908
5                                               corrgram.pdf  6645  40726      33965
6                   eRm - Extended Rach Modeling (Paper).pdf 11354  65273      53578
7                                           eRm (Folien).pdf   371   1407        886
8  Faraway 2002 - Practical Regression and ANOVA using R.pdf 68777 380902     310037
9                             Farnsworth-EconometricsInR.pdf 20482 125207     101157
10                                           ggplot_book.pdf 10681  65388      53551
11                                       ggplot2-lattice.pdf 18067 118591      93737
12                               lavaan_usersguide_0.3-1.pdf 12608  64232      52962
13                                  lme4 - Bootstrapping.pdf  2065  11739       9515
14                                                Mclust.pdf 18191  92180      70848
15                                              multcomp.pdf  5852  38769      32344
16                                       OpenMxUserGuide.pdf 37320 233817     197571

To leave a comment for the author, please follow the link and comment on his blog: Nicebread » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.