Parse pdf files with R (on a Mac)

FelixS

9 years ago

[This article was first published on Nicebread » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Inspired by this blog post from theBioBucket, I created a script to parse all pdf files in a directory. Due to its reliance on the Terminal, it’s Mac specific, but modifications for other systems shouldn’t be too hard (as a start for Windows, see BioBucket’s script).

First, you have to install the command line tool pdftotext (a binary can be found on Carsten Blüm’s website). Then, run following script within a directory with pdfs:

^?View Code RSPLUS

# helper function: get number of words in a string, separated by tab, space, return, or point.
nwords <- function(x){
	res <- strsplit(as.character(x), "[ \t\n,\\.]+")
	res <- lapply(res, length)
	unlist(res)
}
 
# sanitize file name for terminal usage (i.e., escape spaces)
sanitize <- function(str) {
	gsub('([#$%&~_\\^\\\\{}\\s\\(\\)])', '\\\\\\1', str, perl = TRUE)
}
 
# get a list of all files in the current directory
fi <- list.files()
fi2 <- fi[grepl(".pdf", fi)]
 
 
## Parse files and do something with it ...
res <- data.frame() # keeps records of the calculations
for (f in fi2) {
	print(paste("Parsing", f))
 
	f2 <- sanitize(f)
	system(paste0("pdftotext ", f2), wait = TRUE)
 
	# read content of converted txt file
	filetxt <- sub(".pdf", ".txt", f)
	text <- readLines(filetxt, warn=FALSE)
 
	# adjust encoding of text - you have to know it
	Encoding(text) <- "latin1"
 
	# Do something with the content - here: get word and character count of all pdfs in the current directory
	text2 <- paste(text, collapse="\n")	# collapse lines into one long string
 
	res <- rbind(res, data.frame(filename=f, wc=nwords(text2), cs=nchar(text2), cs.nospace=nchar(gsub("\\s", "", text2)))) 
 
	# remove converted text file
	file.remove(filetxt)
}
 
print(res)

… gives following result (wc = word count, cs = characgter count, cs.nospace = character count without spaces):

> print(res)
                                                    filename    wc     cs cs.nospace
1                              Applied_Linear_Regression.pdf 33697 186280     154404
2                                           Baron-rpsych.pdf 22665 128440     105024
3                              bootstrapping regressions.pdf  6309  34042      27694
4                            Ch_multidimensional_scaling.pdf   718   4632       3908
5                                               corrgram.pdf  6645  40726      33965
6                   eRm - Extended Rach Modeling (Paper).pdf 11354  65273      53578
7                                           eRm (Folien).pdf   371   1407        886
8  Faraway 2002 - Practical Regression and ANOVA using R.pdf 68777 380902     310037
9                             Farnsworth-EconometricsInR.pdf 20482 125207     101157
10                                           ggplot_book.pdf 10681  65388      53551
11                                       ggplot2-lattice.pdf 18067 118591      93737
12                               lavaan_usersguide_0.3-1.pdf 12608  64232      52962
13                                  lme4 - Bootstrapping.pdf  2065  11739       9515
14                                                Mclust.pdf 18191  92180      70848
15                                              multcomp.pdf  5852  38769      32344
16                                       OpenMxUserGuide.pdf 37320 233817     197571

To leave a comment for the author, please follow the link and comment on their blog: Nicebread » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.