Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Peter Norvig, the director of research at Google, wrote a nice essay on How to Write a Spelling Corrector a couple of years ago. That essay explains and implements a simple but effective spelling correction function in just 21 lines of Python. Highly recommended reading! I was wondering how many lines it would take to write something similar in base R. Turns out you can do it in (at least) two pretty obfuscated lines:

```sorted_words <- names(sort(table(strsplit(tolower(paste(readLines("http://www.norvig.com/big.txt"), collapse = " ")), "[^a-z]+")), decreasing = TRUE))
correct <- function(word) { c(sorted_words[ adist(word, sorted_words) <= min(adist(word, sorted_words), 2)], word) }
```

While not working exactly as Norvig’s version it should result in similar spelling corrections:

```correct("piese")
```
##  “piece”
correct(“ov”)
##  “of”
correct(“cakke”) ##  “cake” So let’s deobfuscate the two-liner slightly (however, the code below might not make sense if you don’t read Norvig’s essay first):

```# Read in big.txt, a 6.5 mg collection of different English texts.
raw_text <- paste(readLines("http://www.norvig.com/big.txt"), collapse = " ")
# Make the text lowercase and split it up creating a huge vector of word tokens.
split_text <- strsplit(tolower(raw_text), "[^a-z]+")
# Count the number of different type of words.
word_count <- table(split_text)
# Sort the words and create an ordered vector with the most common type of words first.
sorted_words <- names(sort(word_count, decreasing = TRUE))

correct <- function(word) {
# Calculate the edit distance between the word and all other words in sorted_words.
# Calculate the minimum edit distance to find a word that exists in big.txt
# with a limit of two edits.
min_edit_dist <- min(edit_dist, 2)
# Generate a vector with all words with this minimum edit distance.
# Since sorted_words is ordered from most common to least common, the resulting
# vector will have the most common / probable match first.
proposals_by_prob <- c(sorted_words[ edit_dist <= min(edit_dist, 2)])
# In case proposals_by_prob would be empty we append the word to be corrected...
proposals_by_prob <- c(proposals_by_prob, word)
# ... and return the first / most probable word in the vector.
proposals_by_prob
}
```

Some thoughts:

• The main reason for why the R version is so short is because base R includes the `adist` function. (A one line spell checker in R is indeed possible using the `aspell` function 🙂
• A second reason for why the R version is so short is that the many vectorized functions in R make it possible to do a lot of work in one line.
• Indeed, the horrible line creating the `sorted_words` vector would be a perfect target for some magrittr magic.
• The R version does not solve the problem in exactly the same way as Norvig’s code. He maintains the count of each word in the `NWORDS` variable in order to be able to extract the most probable matching word. This is not necessary in the R code, as we already have a sorted vector we know that the first item always will be the most probable. Still, I believe the two approaches result in the same spelling corrections (but prove me wrong :).
• There are links to implementations in many other languages at the bottom of Norvig’s essay. Looking at the Java version reminds me of my dark Java past and madness like `HashMap<Integer, String> candidates = new HashMap<Integer, String>();`.