Hunspell: Spell Checker and Text Parser for R

[This article was first published on OpenCPU, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

opencpu logo

Hunspell is the spell checker library used in LibreOffice, OpenOffice, Mozilla Firefox, Google Chrome, Mac OS X, InDesign, and a few more. Base R has some spell checking functionality via the aspell function which wraps the aspell or hunspell command line program on supported systems. The new hunspell R package on the other hand directly links to the hunspell c++ library and works on all platforms without installing additional dependencies.

Basic tools

The hunspell_check function takes a vector of words and checks each individual word for correctness.

<span class="n">library</span><span class="p">(</span><span class="n">hunspell</span><span class="p">)</span>
<span class="n">words</span> <span class="o"><-</span> <span class="n">c</span><span class="p">(</span><span class="s2">"beer"</span><span class="p">,</span> <span class="s2">"wiskey"</span><span class="p">,</span> <span class="s2">"wine"</span><span class="p">)</span>
<span class="n">hunspell_check</span><span class="p">(</span><span class="n">words</span><span class="p">)</span>
<span class="c1">## [1]  TRUE FALSE  TRUE

The hunspell_find function takes a character vector with text (in plain, latex or man format) and returns a list with incorrect words for each line.

<span class="n">bad_words</span> <span class="o"><-</span> <span class="n">hunspell_find</span><span class="p">(</span><span class="s2">"spell checkers are not neccessairy for langauge ninja's"</span><span class="p">)</span>
<span class="n">print</span><span class="p">(</span><span class="n">bad_words</span><span class="p">)</span>
<span class="c1">## [1] "neccessairy" "langauge"    "ninja's"    

Finally hunspell_suggest is used to suggest correct alternatives for each (incorrect) input word.

<span class="n">hunspell_suggest</span><span class="p">(</span><span class="n">bad_words</span><span class="p">[[</span><span class="m">1</span><span class="p">]])</span>
<span class="c1">## [[1]]
## [1] "necessary"    "necessarily"  "necessaries"  "recessionary" "accessory"    "incarcerate" 
## [[2]]
## [1] "language"  "Langeland" "Lagrange"  "Lange"     "gaugeable" "linkage"   "Langland" 
## [[3]]
## [1] "ninjas"   "Janina's" "Nina's"   "ninja"    "Janine's" "meninx"   "nark's"

Parsing text

The first challenge in spell-checking is extracting individual words from formatted text. The hunspell_find function supports three parsers via the format parameter: plain text, latex and man. For example to check the OpenCPU paper for spelling errors we use the latex source code:

<span class="n">download.file</span><span class="p">(</span><span class="s2">""</span><span class="p">,</span> <span class="s2">"1406.4806v1.tar.gz"</span><span class="p">,</span>  <span class="n">mode</span> <span class="o">=</span> <span class="s2">"wb"</span><span class="p">)</span>
<span class="n">untar</span><span class="p">(</span><span class="s2">"1406.4806v1.tar.gz"</span><span class="p">)</span>
<span class="n">text</span> <span class="o"><-</span> <span class="n">readLines</span><span class="p">(</span><span class="s2">"content.tex"</span><span class="p">,</span> <span class="n">warn</span> <span class="o">=</span> <span class="n">FALSE</span><span class="p">)</span>
<span class="n">words</span> <span class="o"><-</span> <span class="n">hunspell_find</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">format</span> <span class="o">=</span> <span class="s2">"latex"</span><span class="p">)</span>
<span class="n">sort</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">unlist</span><span class="p">(</span><span class="n">words</span><span class="p">)))</span>

Base R also has a few filters to extract words from R, Sweave or Rd code, see RdTextFilter, SweaveTeXFilter in tools. For example to check your R package manual for typos (assuming you are in the pkg source dir)

<span class="k">for</span><span class="p">(</span><span class="n">list.files</span><span class="p">(</span><span class="s2">"man"</span><span class="p">,</span> <span class="n">full.names</span> <span class="o">=</span> <span class="n">TRUE</span><span class="p">)</span> <span class="k">in</span> <span class="n">man_files</span><span class="p">){</span>
  <span class="n">cat</span><span class="p">(</span><span class="s2">"nFile"</span><span class="p">,</span> <span class="n">file</span><span class="p">,</span> <span class="s2">":n  "</span><span class="p">)</span>
  <span class="n">txt</span> <span class="o"><-</span> <span class="n">RdTextFilter</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="n">keepSpacing</span> <span class="o">=</span> <span class="n">FALSE</span><span class="p">)</span>
  <span class="n">cat</span><span class="p">(</span><span class="n">sQuote</span><span class="p">(</span><span class="n">sort</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">unlist</span><span class="p">(</span><span class="n">hunspell_find</span><span class="p">(</span><span class="n">txt</span><span class="p">))))),</span> <span class="n">sep</span> <span class="o">=</span><span class="s2">", "</span><span class="p">)</span>
<span class="p">}</span>

Morphological analysis

A cool feature in hunspell is the morphological analysis. The hunspell_analyze function will show you how a word breaks down into a valid stem plus affix. Hunspell uses a special dictionary format that defines which stems and affixes are valid in a given language.

For example suppose we take a few variations of the word love. To get the possible stems+affix for each word:

<span class="n">hunspell_analyze</span><span class="p">(</span><span class="n">c</span><span class="p">(</span><span class="s2">"love"</span><span class="p">,</span> <span class="s2">"loving"</span><span class="p">,</span> <span class="s2">"lovingly"</span><span class="p">,</span> <span class="s2">"loved"</span><span class="p">,</span> <span class="s2">"lover"</span><span class="p">,</span> <span class="s2">"lovely"</span><span class="p">,</span> <span class="s2">"love"</span><span class="p">))</span>
<span class="c1">## [1] " st:love"
## [1] " st:loving"    " st:love fl:G"
## [1] " st:lovingly"
## [1] " st:loved"     " st:love fl:D"
## [1] " st:lover"     " st:love fl:R"
## [1] " st:lovely"    " st:love fl:Y"
## [1] " st:love"

Alternatively the hunspell_stem returns only the stem. Not sure how you would use this but it’s certainly cool.


Thanks to Daniel Falbel for suggesting this package on the rOpenSci forums!

To leave a comment for the author, please follow the link and comment on their blog: OpenCPU. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)