(This article was first published on Nicebread » R, and kindly contributed to R-bloggers)
Inspired by this blog post from theBioBucket, I created a script to parse all pdf files in a directory. Due to its reliance on the Terminal, it’s Mac specific, but modifications for other systems shouldn’t be too hard (as a start for Windows, see BioBucket’s script).
First, you have to install the command line tool pdftotext (a binary can be found on Carsten Blüm’s website). Then, run following script within a directory with pdfs:
# helper function: get number of words in a string, separated by tab, space, return, or point. nwords <- function(x){ res <- strsplit(as.character(x), "[ \t\n,\\.]+") res <- lapply(res, length) unlist(res) } # sanitize file name for terminal usage (i.e., escape spaces) sanitize <- function(str) { gsub('([#$%&~_\\^\\\\{}\\s\\(\\)])', '\\\\\\1', str, perl = TRUE) } # get a list of all files in the current directory fi <- list.files() fi2 <- fi[grepl(".pdf", fi)] ## Parse files and do something with it ... res <- data.frame() # keeps records of the calculations for (f in fi2) { print(paste("Parsing", f)) f2 <- sanitize(f) system(paste0("pdftotext ", f2), wait = TRUE) # read content of converted txt file filetxt <- sub(".pdf", ".txt", f) text <- readLines(filetxt, warn=FALSE) # adjust encoding of text - you have to know it Encoding(text) <- "latin1" # Do something with the content - here: get word and character count of all pdfs in the current directory text2 <- paste(text, collapse="\n") # collapse lines into one long string res <- rbind(res, data.frame(filename=f, wc=nwords(text2), cs=nchar(text2), cs.nospace=nchar(gsub("\\s", "", text2)))) # remove converted text file file.remove(filetxt) } print(res) |
… gives following result (wc = word count, cs = characgter count, cs.nospace = character count without spaces):
> print(res)
filename wc cs cs.nospace
1 Applied_Linear_Regression.pdf 33697 186280 154404
2 Baron-rpsych.pdf 22665 128440 105024
3 bootstrapping regressions.pdf 6309 34042 27694
4 Ch_multidimensional_scaling.pdf 718 4632 3908
5 corrgram.pdf 6645 40726 33965
6 eRm - Extended Rach Modeling (Paper).pdf 11354 65273 53578
7 eRm (Folien).pdf 371 1407 886
8 Faraway 2002 - Practical Regression and ANOVA using R.pdf 68777 380902 310037
9 Farnsworth-EconometricsInR.pdf 20482 125207 101157
10 ggplot_book.pdf 10681 65388 53551
11 ggplot2-lattice.pdf 18067 118591 93737
12 lavaan_usersguide_0.3-1.pdf 12608 64232 52962
13 lme4 - Bootstrapping.pdf 2065 11739 9515
14 Mclust.pdf 18191 92180 70848
15 multcomp.pdf 5852 38769 32344
16 OpenMxUserGuide.pdf 37320 233817 197571
To leave a comment for the author, please follow the link and comment on his blog: Nicebread » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Zero Inflated Models and Generalized Linear Mixed Models with R.
Zuur, Saveliev, Ieno (2012).