Stemming and Spell Checking in R

March 20, 2016
By

(This article was first published on OpenCPU, and kindly contributed to R-bloggers)

opencpu logo

Last week we introduced the new hunspell R package. This week a new version was released which adds support for additional languages and text analysis features.

Additional languages

By default hunspell uses the US English dictionary en_US but the new version allows for
checking and analyzing in other languages as well. The ?hunspell help page has detailed
instructions on how to install additional dictionaries.

> library(hunspell)
> hunspell_info("ru_RU")
$dict
[1] "/Users/jeroen/workspace/hunspell/tests/testdict/ru_RU.dic"

$encoding
[1] "UTF-8"

$wordchars
[1] NA
> hunspell("чёртова карова", dict = "ru_RU")[[1]]
[1] "карова"

It turned out this feature was much more difficult to implement than I expected. Much of the Hunspell
library dates from before UTF-8 became popular and therefore many dictionaries use local 8 bit character encodings such as ISO-8859-1 for English and KOI8-R for Russian. To spell check in these languages, the character encoding of the document text has to match that of the dictionary. However R only supports latin and UTF-8
so we need to convert strings in C with iconv, which opens up a new can of worms. Anyway it should
all work now.

Text analysis and wordclouds

In last weeks post we showed how to parse and spell
check a latex file:

# Check an entire latex document
library(hunspell)
setwd(tempdir())
download.file("http://arxiv.org/e-print/1406.4806v1", "1406.4806v1.tar.gz",  mode = "wb")
untar("1406.4806v1.tar.gz")
text <- readLines("content.tex", warn = FALSE)
bad_words <- hunspell(text, format = "latex")
sort(unique(unlist(bad_words)))

The new version also exposes the parser directly, so you can easily extract words and derive the stems to summarize some text, for example to display in a wordcloud.

# Summarize text by stems (e.g. for wordcloud)
allwords <- hunspell_parse(text, format = "latex")
stems <- unlist(hunspell_stem(unlist(allwords)))
words <- head(sort(table(stems), decreasing = TRUE), 200)

To leave a comment for the author, please follow the link and comment on their blog: OpenCPU.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)