Spell Checker for R…qdap::check_spelling

September 4, 2014
By

(This article was first published on TRinker's R Blog » R, and kindly contributed to R-bloggers)

I often have had requests for a spell checker for R character vectors. The utils::aspell function can be used to check spelling but many Windows users have reported difficulty with the function.

I came across an article on spelling in R entitled “Watch Your Spelling!” by Kurt Hornik and Duncan Murdoch. The paper walks us through definitions of spell checking, history, and a suggested spell checker implementation for R. A terrific read. Hornik & Murdoch (2010) end with the following call:

Clearly, more work will be needed: modern statistics needs better lexical resources, and a dictionary based on the most frequent spell check false alarms can only be a start. We hope that this article will foster community interest in contributing to the development of such resources, and that refined domain specific dictionaries can be made available and used for improved text analysis with R in the near future (p. 28).

I answered a question on stackoverflow.com a few months back that lead to creating a suite of spell checking functions. The original functions used an agrep approach that was slow and inaccurate. I discovered Mark van der Loo’s terrific stringdist package to do the heavy lifting. It calculates string distances very quickly with various methods.

The rest of this blog post is meant as a minimal introduction to qdap‘s spell checking functions. A video will lead you through most of the process and accompanying scripts are provided.

Primitive Spell Checking Function

The which_misspelled function is a low level function that basically determines if each word of a single string is in a dictionary. It optionally gives suggested corrections.

library(qdap)
x <- "Robots are evl creatres and deserv exterimanitation."
which_misspelled(x, suggest=FALSE)
which_misspelled(x, suggest=TRUE)

Interactive Spell Checking

Typically a user will want to use the interactive spell checker (spell_checker_interactive) as it is more flexible and accurate.

dat <- DATA$state
dat[1] <- "Jasperita I likedd the cokie icekream"
dat
##  [1] "Jasperita I likedd the cokie icekream"
##  [2] "No it's not, it's dumb."              
##  [3] "What should we do?"                   
##  [4] "You liar, it stinks!"                 
##  [5] "I am telling the truth!"              
##  [6] "How can we be certain?"               
##  [7] "There is no way."                     
##  [8] "I distrust you."                      
##  [9] "What are you talking about?"          
## [10] "Shall we move on?  Good then."        
## [11] "I'm hungry.  Let's eat.  You already?"
(o <- check_spelling_interactive(dat))
preprocessed(o)
fixit <- attributes(o)$correct
fixit(dat)

A More Realistic Usage

m <- check_spelling_interactive(mraja1spl$dialogue[1:75])
preprocessed(m)
fixit <- attributes(m)$correct
fixit(mraja1spl$dialogue[1:75])

References

Hornik, K., & Murdoch, D. (2010). Watch Your Spelling!. The R Journal, 3(2), 22-28.

To leave a comment for the author, please follow the link and comment on their blog: TRinker's R Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)