Peter Norvig’s Spell Checker in Two Lines of Base R

December 16, 2014
By

(This article was first published on Publishable Stuff, and kindly contributed to R-bloggers)

Peter Norvig, the director of research at Google, wrote a nice essay on How to Write a Spelling Corrector a couple of years ago. That essay explains and implements a simple but effective spelling correction function in just 21 lines of Python. Highly recommended reading! I was wondering how many lines it would take to write something similar in base R. Turns out you can do it in (at least) two pretty obfuscated lines:

``````sorted_words <- names(sort(table(strsplit(tolower(paste(readLines("http://www.norvig.com/big.txt"), collapse = " ")), "[^a-z]+")), decreasing = TRUE))
correct <- function(word) { c(sorted_words[ adist(word, sorted_words) <= min(adist(word, sorted_words), 2)], word)[1] }

```
While not working exactly as Norvig’s version it should result in similar spelling corrections:
```correct("piese")

```
```## [1] "piece"

```
```correct("ov")

```
```## [1] "of"

```
```correct("cakke")

```
```## [1] "cake"

```

So let’s deobfuscate the two-liner slightly (however, the code below might not make sense if you don’t read Norvig’s essay first):

```# Read in big.txt, a 6.5 mg collection of different English texts.
raw_text <- paste(readLines("http://www.norvig.com/big.txt"), collapse = " ")
# Make the text lowercase and split it up creating a huge vector of word tokens.
split_text <- strsplit(tolower(raw_text), "[^a-z]+")
# Count the number of different type of words.
word_count <- table(split_text)
# Sort the words and create an ordered vector with the most common type of words first.
sorted_words <- names(sort(word_count, decreasing = TRUE))

correct <- function(word) {
# Calculate the edit distance between the word and all other words in sorted_words.
# Calculate the minimum edit distance to find a word that exists in big.txt
# with a limit of two edits.
min_edit_dist <- min(edit_dist, 2)
# Generate a vector with all words with this minimum edit distance.
# Since sorted_words is ordered from most common to least common, the resulting
# vector will have the most common / probable match first.
proposals_by_prob <- c(sorted_words[ edit_dist <= min(edit_dist, 2)])
# In case proposals_by_prob would be empty we append the word to be corrected...
proposals_by_prob <- c(proposals_by_prob, word)
# ... and return the first / most probable word in the vector.
proposals_by_prob[1]
}

```
Some thoughts:

The main reason for why the R version is so short is because base R includes the `adist` function. (A one line spell checker in R is indeed possible using the `aspell` function 🙂
A second reason for why the R version is so short is that the many vectorized functions in R make it possible to do a lot of work in one line.
Indeed, the horrible line creating the `sorted_words` vector would be a perfect target for some magrittr magic.
The R version does not solve the problem in exactly the same way as Norvig’s code. He maintains the count of each word in the `NWORDS` variable in order to be able to extract the most probable matching word. This is not necessary in the R code, as we already have a sorted vector we know that the first item always will be the most probable. Still, I believe the two approaches result in the same spelling corrections (but prove me wrong :).
There are links to implementations in many other languages at the bottom of Norvig’s essay. Looking at the Java version reminds me of my dark Java past and madness like `HashMap candidates = new HashMap();`.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' };

(function(d, t) {
var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;
var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r);
}(document, 'script'));

Related
ShareTweet

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

```