Counting the number of words in a LaTeX file with stringi

[This article was first published on Rexamine » Blog/R-bloggers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my recent post I promised to present the most interesting features of the stringi package in more detail.

Here's one of such jolly features. Many LaTeX users may find it very useful.

Loading a text file with encoding auto-detection

Here's a LaTeX document consisting of a Polish poem. Probably, most of you wouldn't have been able to guess the file's character encoding if I hadn't left some hints. But it's OK, we have a little challenge.

Let's use some (currently experimental) stringi functions to guess the file's encoding.

First of all, we should read the file as a raw vector (anyway, each text file is a sequence of bytes).

library(stringi)
# experimental function (as per stringi_0.2-5):
download.file("http://www.rexamine.com/manual_upload/powrot_taty_latin2.tex", 
    dest = "powrot_taty_latin2.tex")
file <- stri_read_raw("powrot_taty_latin2.tex")
head(file, 15)

##  [1] 25 25 20 45 4e 43 4f 44 49 4e 47 20 3d 20 49

Let's try to detect the file's character encoding automatically.

stri_enc_detect(file)[[1]]  # experimental function

## $Encoding
## [1] "ISO-8859-2" "ISO-8859-1" "ISO-8859-9"
## 
## $Language
## [1] "pl" "pt" "tr"
## 
## $Confidence
## [1] 0.46 0.19 0.07

Encoding detection is, at best, an imprecise operation using statistics and heuristics. ICU indicates that most probably we deal with Polish text in ISO-8859-2 (a.k.a. latin2) here. What a coincidence: it's true.

Let's re-encode the file. Our target encoding will be UTF-8, as it is a “superset'' of all 8-bit encodings. We really love portable code:

file <- stri_conv(file, stri_enc_detect(file)[[1]]$Encoding[1], "UTF-8")
file <- stri_split_lines1(file)  # split a string into text lines
print(file[22:28])  # text sample

## [1] ",,Pójdźcie, o dziatki, pójdźcie wszystkie razem"
## [2] ""                                               
## [3] "Za miasto, pod słup na wzgórek,"                
## [4] ""                                               
## [5] "Tam przed cudownym klęknijcie obrazem,"         
## [6] ""                                               
## [7] "Pobożnie zmówcie paciórek."

Of course, if we knew a priori that the file is in ISO-8859-2, we'd just call:

file <- stri_conv(readLines("http://www.rexamine.com/manual_upload/powrot_taty_latin2.tex"), 
    "ISO-8859-2", "UTF-8")

So far so good.

Word count

LaTeX word counting is a quite complicated task and there are many possible approaches
to perform it. Most often, they rely on running some external tools (which may be a bit inconvenient for some users). Personally, I've always been most satisfied with the output produced by the Kile LaTeX IDE for KDE desktop environment.

LaTeX document statistics in Kile

As not everyone has Kile installed, I've had decided to grab Kile's algorithm (the power of open source!), made some not-too-invasive stringi-specific tweaks and here we are:

stri_stats_latex(file)

##     CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds 
##          2283           335           576           461            32 
##        Envirs 
##             2

Some other aggregates are also available (they are meaningful in case of any text file):

stri_stats_general(file)

##       Lines LinesNEmpty       Chars CharsNWhite 
##         232         122        3308        2930

Finally, here's the word count for my R programming book (in Polish). Importantly, each chapter is stored in a separate .tex file (there are 30 files), so "clicking out” the answer in Kile would be a bit problematic:

apply(
   sapply(
      list.files(path="~/Publikacje/ProgramowanieR/rozdzialy/",
         pattern=glob2rx("*.tex"), recursive=TRUE, full.names=TRUE),
      function(x)
      stri_stats_latex(readLines(x))
   ), 1, sum)

## CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds        Envirs
##    718755        458403        281989        120202         37055          6119

Notably, my publisher was satisfied with the above estimate. 🙂

Next time we'll take a look at ICU's very powerful transliteration services.

More information

For more information check out the stringi package website and its on-line documentation.

For bug reports and feature requests visit our GitHub profile.

Any comments and suggestions are warmly welcome.

Marek Gagolewski

To leave a comment for the author, please follow the link and comment on their blog: Rexamine » Blog/R-bloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)