Hunspell: Spell Checker and Text Parser for R

March 13, 2016
By

(This article was first published on OpenCPU, and kindly contributed to R-bloggers)

opencpu logo

Hunspell is the spell checker library used in LibreOffice, OpenOffice, Mozilla Firefox, Google Chrome, Mac OS X, InDesign, and a few more. Base R has some spell checking functionality via the aspell function which wraps the aspell or hunspell command line program on supported systems. The new hunspell R package on the other hand directly links to the hunspell c++ library and works on all platforms without installing additional dependencies.

Basic tools

The hunspell_check function takes a vector of words and checks each individual word for correctness.

library(hunspell)
words <- c("beer", "wiskey", "wine")
hunspell_check(words)
## [1]  TRUE FALSE  TRUE

The hunspell_find function takes a character vector with text (in plain, latex or man format) and returns a list with incorrect words for each line.

bad_words <- hunspell_find("spell checkers are not neccessairy for langauge ninja's")
print(bad_words)
## [1] "neccessairy" "langauge"    "ninja's"    

Finally hunspell_suggest is used to suggest correct alternatives for each (incorrect) input word.

hunspell_suggest(bad_words[[1]])
## [[1]]
## [1] "necessary"    "necessarily"  "necessaries"  "recessionary" "accessory"    "incarcerate" 
##
## [[2]]
## [1] "language"  "Langeland" "Lagrange"  "Lange"     "gaugeable" "linkage"   "Langland" 
##
## [[3]]
## [1] "ninjas"   "Janina's" "Nina's"   "ninja"    "Janine's" "meninx"   "nark's"

Parsing text

The first challenge in spell-checking is extracting individual words from formatted text. The hunspell_find function supports three parsers via the format parameter: plain text, latex and man. For example to check the OpenCPU paper for spelling errors we use the latex source code:

download.file("http://arxiv.org/e-print/1406.4806v1", "1406.4806v1.tar.gz",  mode = "wb")
untar("1406.4806v1.tar.gz")
text <- readLines("content.tex", warn = FALSE)
words <- hunspell_find(text, format = "latex")
sort(unique(unlist(words)))

Base R also has a few filters to extract words from R, Sweave or Rd code, see RdTextFilter, SweaveTeXFilter in tools. For example to check your R package manual for typos (assuming you are in the pkg source dir)

for(list.files("man", full.names = TRUE) in man_files){
  cat("nFile", file, ":n  ")
  txt <- RdTextFilter(file, keepSpacing = FALSE)
  cat(sQuote(sort(unique(unlist(hunspell_find(txt))))), sep =", ")
}

Morphological analysis

A cool feature in hunspell is the morphological analysis. The hunspell_analyze function will show you how a word breaks down into a valid stem plus affix. Hunspell uses a special dictionary format that defines which stems and affixes are valid in a given language.

For example suppose we take a few variations of the word love. To get the possible stems+affix for each word:

hunspell_analyze(c("love", "loving", "lovingly", "loved", "lover", "lovely", "love"))
## [1] " st:love"
## [1] " st:loving"    " st:love fl:G"
## [1] " st:lovingly"
## [1] " st:loved"     " st:love fl:D"
## [1] " st:lover"     " st:love fl:R"
## [1] " st:lovely"    " st:love fl:Y"
## [1] " st:love"

Alternatively the hunspell_stem returns only the stem. Not sure how you would use this but it’s certainly cool.

Thanks!

Thanks to Daniel Falbel for suggesting this package on the rOpenSci forums!

To leave a comment for the author, please follow the link and comment on their blog: OpenCPU.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Mango solutions



RStudio homepage



Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training

datasociety

http://www.eoda.de





ODSC

ODSC

CRC R books series





Six Sigma Online Training









Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)