Hunspell: Spell Checker and Text Parser for R

[This article was first published on OpenCPU, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

opencpu logo

Hunspell is the spell checker library used in LibreOffice, OpenOffice, Mozilla Firefox, Google Chrome, Mac OS X, InDesign, and a few more. Base R has some spell checking functionality via the aspell function which wraps the aspell or hunspell command line program on supported systems. The new hunspell R package on the other hand directly links to the hunspell c++ library and works on all platforms without installing additional dependencies.

Basic tools

The hunspell_check function takes a vector of words and checks each individual word for correctness.

<span class="n">library</span><span class="p">(</span><span class="n">hunspell</span><span class="p">)</span>
<span class="n">words</span> <span class="o"><-</span> <span class="n">c</span><span class="p">(</span><span class="s2">"beer"</span><span class="p">,</span> <span class="s2">"wiskey"</span><span class="p">,</span> <span class="s2">"wine"</span><span class="p">)</span>
<span class="n">hunspell_check</span><span class="p">(</span><span class="n">words</span><span class="p">)</span>
<span class="c1">## [1]  TRUE FALSE  TRUE
</span>

The hunspell_find function takes a character vector with text (in plain, latex or man format) and returns a list with incorrect words for each line.

<span class="n">bad_words</span> <span class="o"><-</span> <span class="n">hunspell_find</span><span class="p">(</span><span class="s2">"spell checkers are not neccessairy for langauge ninja's"</span><span class="p">)</span>
<span class="n">print</span><span class="p">(</span><span class="n">bad_words</span><span class="p">)</span>
<span class="c1">## [1] "neccessairy" "langauge"    "ninja's"    
</span>

Finally hunspell_suggest is used to suggest correct alternatives for each (incorrect) input word.

<span class="n">hunspell_suggest</span><span class="p">(</span><span class="n">bad_words</span><span class="p">[[</span><span class="m">1</span><span class="p">]])</span>
<span class="c1">## [[1]]
## [1] "necessary"    "necessarily"  "necessaries"  "recessionary" "accessory"    "incarcerate" 
##
## [[2]]
## [1] "language"  "Langeland" "Lagrange"  "Lange"     "gaugeable" "linkage"   "Langland" 
##
## [[3]]
## [1] "ninjas"   "Janina's" "Nina's"   "ninja"    "Janine's" "meninx"   "nark's"
</span>

Parsing text

The first challenge in spell-checking is extracting individual words from formatted text. The hunspell_find function supports three parsers via the format parameter: plain text, latex and man. For example to check the OpenCPU paper for spelling errors we use the latex source code:

<span class="n">download.file</span><span class="p">(</span><span class="s2">"http://arxiv.org/e-print/1406.4806v1"</span><span class="p">,</span> <span class="s2">"1406.4806v1.tar.gz"</span><span class="p">,</span>  <span class="n">mode</span> <span class="o">=</span> <span class="s2">"wb"</span><span class="p">)</span>
<span class="n">untar</span><span class="p">(</span><span class="s2">"1406.4806v1.tar.gz"</span><span class="p">)</span>
<span class="n">text</span> <span class="o"><-</span> <span class="n">readLines</span><span class="p">(</span><span class="s2">"content.tex"</span><span class="p">,</span> <span class="n">warn</span> <span class="o">=</span> <span class="n">FALSE</span><span class="p">)</span>
<span class="n">words</span> <span class="o"><-</span> <span class="n">hunspell_find</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">format</span> <span class="o">=</span> <span class="s2">"latex"</span><span class="p">)</span>
<span class="n">sort</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">unlist</span><span class="p">(</span><span class="n">words</span><span class="p">)))</span>

Base R also has a few filters to extract words from R, Sweave or Rd code, see RdTextFilter, SweaveTeXFilter in tools. For example to check your R package manual for typos (assuming you are in the pkg source dir)

<span class="k">for</span><span class="p">(</span><span class="n">list.files</span><span class="p">(</span><span class="s2">"man"</span><span class="p">,</span> <span class="n">full.names</span> <span class="o">=</span> <span class="n">TRUE</span><span class="p">)</span> <span class="k">in</span> <span class="n">man_files</span><span class="p">){</span>
  <span class="n">cat</span><span class="p">(</span><span class="s2">"nFile"</span><span class="p">,</span> <span class="n">file</span><span class="p">,</span> <span class="s2">":n  "</span><span class="p">)</span>
  <span class="n">txt</span> <span class="o"><-</span> <span class="n">RdTextFilter</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="n">keepSpacing</span> <span class="o">=</span> <span class="n">FALSE</span><span class="p">)</span>
  <span class="n">cat</span><span class="p">(</span><span class="n">sQuote</span><span class="p">(</span><span class="n">sort</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">unlist</span><span class="p">(</span><span class="n">hunspell_find</span><span class="p">(</span><span class="n">txt</span><span class="p">))))),</span> <span class="n">sep</span> <span class="o">=</span><span class="s2">", "</span><span class="p">)</span>
<span class="p">}</span>

Morphological analysis

A cool feature in hunspell is the morphological analysis. The hunspell_analyze function will show you how a word breaks down into a valid stem plus affix. Hunspell uses a special dictionary format that defines which stems and affixes are valid in a given language.

For example suppose we take a few variations of the word love. To get the possible stems+affix for each word:

<span class="n">hunspell_analyze</span><span class="p">(</span><span class="n">c</span><span class="p">(</span><span class="s2">"love"</span><span class="p">,</span> <span class="s2">"loving"</span><span class="p">,</span> <span class="s2">"lovingly"</span><span class="p">,</span> <span class="s2">"loved"</span><span class="p">,</span> <span class="s2">"lover"</span><span class="p">,</span> <span class="s2">"lovely"</span><span class="p">,</span> <span class="s2">"love"</span><span class="p">))</span>
<span class="c1">## [1] " st:love"
## [1] " st:loving"    " st:love fl:G"
## [1] " st:lovingly"
## [1] " st:loved"     " st:love fl:D"
## [1] " st:lover"     " st:love fl:R"
## [1] " st:lovely"    " st:love fl:Y"
## [1] " st:love"
</span>

Alternatively the hunspell_stem returns only the stem. Not sure how you would use this but it’s certainly cool.

Thanks!

Thanks to Daniel Falbel for suggesting this package on the rOpenSci forums!

To leave a comment for the author, please follow the link and comment on their blog: OpenCPU.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)