Counting the number of words in a LaTeX file with stringi

May 16, 2014
By

(This article was first published on Rexamine » Blog/R-bloggers, and kindly contributed to R-bloggers)

In my recent post I promised to present the most interesting features of the stringi package in more detail.

Here's one of such jolly features. Many LaTeX users may find it very useful.

Loading a text file with encoding auto-detection

Here's a LaTeX document consisting of a Polish poem. Probably, most of you wouldn't have been able to guess the file's character encoding if I hadn't left some hints. But it's OK, we have a little challenge.

Let's use some (currently experimental) stringi functions to guess the file's encoding.

First of all, we should read the file as a raw vector (anyway, each text file is a sequence of bytes).

library(stringi)
# experimental function (as per stringi_0.2-5):
download.file("http://www.rexamine.com/manual_upload/powrot_taty_latin2.tex", 
    dest = "powrot_taty_latin2.tex")
file <- stri_read_raw("powrot_taty_latin2.tex")
head(file, 15)
##  [1] 25 25 20 45 4e 43 4f 44 49 4e 47 20 3d 20 49

Let's try to detect the file's character encoding automatically.

stri_enc_detect(file)[[1]]  # experimental function
## $Encoding
## [1] "ISO-8859-2" "ISO-8859-1" "ISO-8859-9"
## 
## $Language
## [1] "pl" "pt" "tr"
## 
## $Confidence
## [1] 0.46 0.19 0.07

Encoding detection is, at best, an imprecise operation using statistics and heuristics. ICU indicates that most probably we deal with Polish text in ISO-8859-2 (a.k.a. latin2) here. What a coincidence: it's true.

Let's re-encode the file. Our target encoding will be UTF-8, as it is a “superset'' of all 8-bit encodings. We really love portable code:

file <- stri_conv(file, stri_enc_detect(file)[[1]]$Encoding[1], "UTF-8")
file <- stri_split_lines1(file)  # split a string into text lines
print(file[22:28])  # text sample
## [1] ",,Pójdźcie, o dziatki, pójdźcie wszystkie razem"
## [2] ""                                               
## [3] "Za miasto, pod słup na wzgórek,"                
## [4] ""                                               
## [5] "Tam przed cudownym klęknijcie obrazem,"         
## [6] ""                                               
## [7] "Pobożnie zmówcie paciórek."

Of course, if we knew a priori that the file is in ISO-8859-2, we'd just call:

file <- stri_conv(readLines("http://www.rexamine.com/manual_upload/powrot_taty_latin2.tex"), 
    "ISO-8859-2", "UTF-8")

So far so good.

Word count

LaTeX word counting is a quite complicated task and there are many possible approaches
to perform it. Most often, they rely on running some external tools (which may be a bit inconvenient for some users). Personally, I've always been most satisfied with the output produced by the Kile LaTeX IDE for KDE desktop environment.

LaTeX document statistics in Kile

As not everyone has Kile installed, I've had decided to grab Kile's algorithm (the power of open source!), made some not-too-invasive stringi-specific tweaks and here we are:

stri_stats_latex(file)
##     CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds 
##          2283           335           576           461            32 
##        Envirs 
##             2

Some other aggregates are also available (they are meaningful in case of any text file):

stri_stats_general(file)
##       Lines LinesNEmpty       Chars CharsNWhite 
##         232         122        3308        2930

Finally, here's the word count for my R programming book (in Polish). Importantly, each chapter is stored in a separate .tex file (there are 30 files), so "clicking out” the answer in Kile would be a bit problematic:

apply(
   sapply(
      list.files(path="~/Publikacje/ProgramowanieR/rozdzialy/",
         pattern=glob2rx("*.tex"), recursive=TRUE, full.names=TRUE),
      function(x)
      stri_stats_latex(readLines(x))
   ), 1, sum)
## CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds        Envirs
##    718755        458403        281989        120202         37055          6119

Notably, my publisher was satisfied with the above estimate. :-)

Next time we'll take a look at ICU's very powerful transliteration services.

More information

For more information check out the stringi package website and its on-line documentation.

For bug reports and feature requests visit our GitHub profile.

Any comments and suggestions are warmly welcome.

Marek Gagolewski

To leave a comment for the author, please follow the link and comment on his blog: Rexamine » Blog/R-bloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.