Peter Norvig’s Spell Checker in Two Lines of Base R

Posted on December 16, 2014 by Rasmus Bååth in R bloggers | 0 Comments

[This article was first published on Publishable Stuff, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Peter Norvig, the director of research at Google, wrote a nice essay on How to Write a Spelling Corrector a couple of years ago. That essay explains and implements a simple but effective spelling correction function in just 21 lines of Python. Highly recommended reading! I was wondering how many lines it would take to write something similar in base R. Turns out you can do it in (at least) two pretty obfuscated lines:

sorted_words <- names(sort(table(strsplit(tolower(paste(readLines("http://www.norvig.com/big.txt"), collapse = " ")), "[^a-z]+")), decreasing = TRUE))
correct <- function(word) { c(sorted_words[ adist(word, sorted_words) <= min(adist(word, sorted_words), 2)], word)[1] }

While not working exactly as Norvig’s version it should result in similar spelling corrections:

correct("piese")

# Read in big.txt, a 6.5 mg collection of different English texts. raw_text <- paste(readLines("http://www.norvig.com/big.txt"), collapse = " ") # Make the text lowercase and split it up creating a huge vector of word tokens. split_text <- strsplit(tolower(raw_text), "[^a-z]+") # Count the number of different type of words. word_count <- table(split_text) # Sort the words and create an ordered vector with the most common type of words first. sorted_words <- names(sort(word_count, decreasing = TRUE)) correct <- function(word) { # Calculate the edit distance between the word and all other words in sorted_words. edit_dist <- adist(word, sorted_words) # Calculate the minimum edit distance to find a word that exists in big.txt # with a limit of two edits. min_edit_dist <- min(edit_dist, 2) # Generate a vector with all words with this minimum edit distance. # Since sorted_words is ordered from most common to least common, the resulting # vector will have the most common / probable match first. proposals_by_prob <- c(sorted_words[ edit_dist <= min(edit_dist, 2)]) # In case proposals_by_prob would be empty we append the word to be corrected... proposals_by_prob <- c(proposals_by_prob, word) # ... and return the first / most probable word in the vector. proposals_by_prob[1] }