Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I often have had requests for a spell checker for R character vectors. The utils::aspell function can be used to check spelling but many Windows users have reported difficulty with the function.

I came across an article on spelling in R entitled “Watch Your Spelling!” by Kurt Hornik and Duncan Murdoch. The paper walks us through definitions of spell checking, history, and a suggested spell checker implementation for R. A terrific read. Hornik & Murdoch (2010) end with the following call:

Clearly, more work will be needed: modern statistics needs better lexical resources, and a dictionary based on the most frequent spell check false alarms can only be a start. We hope that this article will foster community interest in contributing to the development of such resources, and that refined domain specific dictionaries can be made available and used for improved text analysis with R in the near future (p. 28).

I answered a question on stackoverflow.com a few months back that lead to creating a suite of spell checking functions. The original functions used an agrep approach that was slow and inaccurate. I discovered Mark van der Loo’s terrific stringdist package to do the heavy lifting. It calculates string distances very quickly with various methods.

The rest of this blog post is meant as a minimal introduction to qdap‘s spell checking functions. A video will lead you through most of the process and accompanying scripts are provided.

## Primitive Spell Checking Function

The which_misspelled function is a low level function that basically determines if each word of a single string is in a dictionary. It optionally gives suggested corrections.

library(qdap)
x <- "Robots are evl creatres and deserv exterimanitation."
which_misspelled(x, suggest=FALSE)
which_misspelled(x, suggest=TRUE)

## Interactive Spell Checking

Typically a user will want to use the interactive spell checker (spell_checker_interactive) as it is more flexible and accurate.

dat <- DATA$state dat[1] <- "Jasperita I likedd the cokie icekream" dat ## [1] "Jasperita I likedd the cokie icekream" ## [2] "No it's not, it's dumb." ## [3] "What should we do?" ## [4] "You liar, it stinks!" ## [5] "I am telling the truth!" ## [6] "How can we be certain?" ## [7] "There is no way." ## [8] "I distrust you." ## [9] "What are you talking about?" ## [10] "Shall we move on? Good then." ## [11] "I'm hungry. Let's eat. You already?" (o <- check_spelling_interactive(dat)) preprocessed(o) fixit <- attributes(o)$correct
fixit(dat)


## A More Realistic Usage

m <- check_spelling_interactive(mraja1spl$dialogue[1:75]) preprocessed(m) fixit <- attributes(m)$correct
fixit(mraja1spl\$dialogue[1:75])

## References

Hornik, K., & Murdoch, D. (2010). Watch Your Spelling!. The R Journal, 3(2), 22-28.