Peter Norvig’s Spell Checker in Two Lines of Base R

[This article was first published on Publishable Stuff, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Peter Norvig, the director of research at Google, wrote a nice essay on How to Write a Spelling Corrector a couple of years ago. That essay explains and implements a simple but effective spelling correction function in just 21 lines of Python. Highly recommended reading! I was wondering how many lines it would take to write something similar in base R. Turns out you can do it in (at least) two pretty obfuscated lines:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">sorted_words <span style="color: #666666"><-</span> names(sort(table(strsplit(tolower(paste(readLines(<span style="color: #BA2121">"http://www.norvig.com/big.txt"</span>), collapse <span style="color: #666666">=</span> <span style="color: #BA2121">" "</span>)), <span style="color: #BA2121">"[^a-z]+"</span>)), decreasing <span style="color: #666666">=</span> <span style="color: #008000; font-weight: bold">TRUE</span>))
correct <span style="color: #666666"><-</span> <span style="color: #008000; font-weight: bold">function</span>(word) { c(sorted_words[ adist(word, sorted_words) <span style="color: #666666"><=</span> min(adist(word, sorted_words), <span style="color: #666666">2</span>)], word)[<span style="color: #666666">1</span>] }
</pre></div>

While not working exactly as Norvig’s version it should result in similar spelling corrections:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">correct(<span style="color: #BA2121">"piese"</span>)
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">## [1] "piece"</span>
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">correct(<span style="color: #BA2121">"ov"</span>)
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">## [1] "of"</span>
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">correct(<span style="color: #BA2121">"cakke"</span>)
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic">## [1] "cake"</span>
</pre></div>

So let’s deobfuscate the two-liner slightly (however, the code below might not make sense if you don’t read Norvig’s essay first):

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic"># Read in big.txt, a 6.5 mg collection of different English texts.</span>
raw_text <span style="color: #666666"><-</span> paste(readLines(<span style="color: #BA2121">"http://www.norvig.com/big.txt"</span>), collapse <span style="color: #666666">=</span> <span style="color: #BA2121">" "</span>)
<span style="color: #408080; font-style: italic"># Make the text lowercase and split it up creating a huge vector of word tokens.</span>
split_text <span style="color: #666666"><-</span> strsplit(tolower(raw_text), <span style="color: #BA2121">"[^a-z]+"</span>)
<span style="color: #408080; font-style: italic"># Count the number of different type of words.</span>
word_count <span style="color: #666666"><-</span> table(split_text)
<span style="color: #408080; font-style: italic"># Sort the words and create an ordered vector with the most common type of words first.</span>
sorted_words <span style="color: #666666"><-</span> names(sort(word_count, decreasing <span style="color: #666666">=</span> <span style="color: #008000; font-weight: bold">TRUE</span>))

correct <span style="color: #666666"><-</span> <span style="color: #008000; font-weight: bold">function</span>(word) {
  <span style="color: #408080; font-style: italic"># Calculate the edit distance between the word and all other words in sorted_words.</span>
  edit_dist <span style="color: #666666"><-</span> adist(word, sorted_words)
  <span style="color: #408080; font-style: italic"># Calculate the minimum edit distance to find a word that exists in big.txt </span>
  <span style="color: #408080; font-style: italic"># with a limit of two edits.</span>
  min_edit_dist <span style="color: #666666"><-</span> min(edit_dist, <span style="color: #666666">2</span>)
  <span style="color: #408080; font-style: italic"># Generate a vector with all words with this minimum edit distance.</span>
  <span style="color: #408080; font-style: italic"># Since sorted_words is ordered from most common to least common, the resulting</span>
  <span style="color: #408080; font-style: italic"># vector will have the most common / probable match first.</span>
  proposals_by_prob <span style="color: #666666"><-</span> c(sorted_words[ edit_dist <span style="color: #666666"><=</span> min(edit_dist, <span style="color: #666666">2</span>)])
  <span style="color: #408080; font-style: italic"># In case proposals_by_prob would be empty we append the word to be corrected...</span>
  proposals_by_prob <span style="color: #666666"><-</span> c(proposals_by_prob, word)
  <span style="color: #408080; font-style: italic"># ... and return the first / most probable word in the vector.</span>
  proposals_by_prob[<span style="color: #666666">1</span>]
}
</pre></div>

Some thoughts:

  • The main reason for why the R version is so short is because base R includes the adist function. (A one line spell checker in R is indeed possible using the aspell function 🙂
  • A second reason for why the R version is so short is that the many vectorized functions in R make it possible to do a lot of work in one line.
  • Indeed, the horrible line creating the sorted_words vector would be a perfect target for some magrittr magic.
  • The R version does not solve the problem in exactly the same way as Norvig’s code. He maintains the count of each word in the NWORDS variable in order to be able to extract the most probable matching word. This is not necessary in the R code, as we already have a sorted vector we know that the first item always will be the most probable. Still, I believe the two approaches result in the same spelling corrections (but prove me wrong :).
  • There are links to implementations in many other languages at the bottom of Norvig’s essay. Looking at the Java version reminds me of my dark Java past and madness like HashMap candidates = new HashMap();.

To leave a comment for the author, please follow the link and comment on their blog: Publishable Stuff.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)