Site icon R-bloggers

Hunspell: Spell Checker and Text Parser for R

[This article was first published on OpenCPU, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hunspell is the spell checker library used in LibreOffice, OpenOffice, Mozilla Firefox, Google Chrome, Mac OS X, InDesign, and a few more. Base R has some spell checking functionality via the aspell function which wraps the aspell or hunspell command line program on supported systems. The new hunspell R package on the other hand directly links to the hunspell c++ library and works on all platforms without installing additional dependencies.

Basic tools

The hunspell_check function takes a vector of words and checks each individual word for correctness.

library(hunspell)
words <- c("beer", "wiskey", "wine")
hunspell_check(words)
## [1]  TRUE FALSE  TRUE

The hunspell_find function takes a character vector with text (in plain, latex or man format) and returns a list with incorrect words for each line.

bad_words <- hunspell_find("spell checkers are not neccessairy for langauge ninja's")
print(bad_words)
## [1] "neccessairy" "langauge"    "ninja's"    

Finally hunspell_suggest is used to suggest correct alternatives for each (incorrect) input word.

hunspell_suggest(bad_words[[1]])
## [[1]]
## [1] "necessary"    "necessarily"  "necessaries"  "recessionary" "accessory"    "incarcerate" 
##
## [[2]]
## [1] "language"  "Langeland" "Lagrange"  "Lange"     "gaugeable" "linkage"   "Langland" 
##
## [[3]]
## [1] "ninjas"   "Janina's" "Nina's"   "ninja"    "Janine's" "meninx"   "nark's"

Parsing text

The first challenge in spell-checking is extracting individual words from formatted text. The hunspell_find function supports three parsers via the format parameter: plain text, latex and man. For example to check the OpenCPU paper for spelling errors we use the latex source code:

download.file("http://arxiv.org/e-print/1406.4806v1", "1406.4806v1.tar.gz",  mode = "wb")
untar("1406.4806v1.tar.gz")
text <- readLines("content.tex", warn = FALSE)
words <- hunspell_find(text, format = "latex")
sort(unique(unlist(words)))

Base R also has a few filters to extract words from R, Sweave or Rd code, see RdTextFilter, SweaveTeXFilter in tools. For example to check your R package manual for typos (assuming you are in the pkg source dir)

for(list.files("man", full.names = TRUE) in man_files){
  cat("nFile", file, ":n  ")
  txt <- RdTextFilter(file, keepSpacing = FALSE)
  cat(sQuote(sort(unique(unlist(hunspell_find(txt))))), sep =", ")
}

Morphological analysis

A cool feature in hunspell is the morphological analysis. The hunspell_analyze function will show you how a word breaks down into a valid stem plus affix. Hunspell uses a special dictionary format that defines which stems and affixes are valid in a given language.

For example suppose we take a few variations of the word love. To get the possible stems+affix for each word:

hunspell_analyze(c("love", "loving", "lovingly", "loved", "lover", "lovely", "love"))
## [1] " st:love"
## [1] " st:loving"    " st:love fl:G"
## [1] " st:lovingly"
## [1] " st:loved"     " st:love fl:D"
## [1] " st:lover"     " st:love fl:R"
## [1] " st:lovely"    " st:love fl:Y"
## [1] " st:love"

Alternatively the hunspell_stem returns only the stem. Not sure how you would use this but it’s certainly cool.

Thanks!

Thanks to Daniel Falbel for suggesting this package on the rOpenSci forums!

To leave a comment for the author, please follow the link and comment on their blog: OpenCPU.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.