Stemming and Spell Checking in R

[This article was first published on OpenCPU, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

opencpu logo

Last week we introduced the new hunspell R package. This week a new version was released which adds support for additional languages and text analysis features.

Additional languages

By default hunspell uses the US English dictionary en_US but the new version allows for
checking and analyzing in other languages as well. The ?hunspell help page has detailed
instructions on how to install additional dictionaries.

<span class="o">></span> <span class="n">library</span><span class="p">(</span><span class="n">hunspell</span><span class="p">)</span>
<span class="o">></span> <span class="n">hunspell_info</span><span class="p">(</span><span class="s2">"ru_RU"</span><span class="p">)</span>
<span class="o">$</span><span class="n">dict</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="s2">"/Users/jeroen/workspace/hunspell/tests/testdict/ru_RU.dic"</span>

<span class="o">$</span><span class="n">encoding</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="s2">"UTF-8"</span>

<span class="o">$</span><span class="n">wordchars</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="n">NA</span>
<span class="o">></span> <span class="n">hunspell</span><span class="p">(</span><span class="s2">"чёртова карова"</span><span class="p">,</span> <span class="n">dict</span> <span class="o">=</span> <span class="s2">"ru_RU"</span><span class="p">)[[</span><span class="m">1</span><span class="p">]]</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="s2">"карова"</span>

It turned out this feature was much more difficult to implement than I expected. Much of the Hunspell
library dates from before UTF-8 became popular and therefore many dictionaries use local 8 bit character encodings such as ISO-8859-1 for English and KOI8-R for Russian. To spell check in these languages, the character encoding of the document text has to match that of the dictionary. However R only supports latin and UTF-8
so we need to convert strings in C with iconv, which opens up a new can of worms. Anyway it should
all work now.

Text analysis and wordclouds

In last weeks post we showed how to parse and spell
check a latex file:

<span class="c1"># Check an entire latex document
</span><span class="n">library</span><span class="p">(</span><span class="n">hunspell</span><span class="p">)</span>
<span class="n">setwd</span><span class="p">(</span><span class="n">tempdir</span><span class="p">())</span>
<span class="n">download.file</span><span class="p">(</span><span class="s2">"http://arxiv.org/e-print/1406.4806v1"</span><span class="p">,</span> <span class="s2">"1406.4806v1.tar.gz"</span><span class="p">,</span>  <span class="n">mode</span> <span class="o">=</span> <span class="s2">"wb"</span><span class="p">)</span>
<span class="n">untar</span><span class="p">(</span><span class="s2">"1406.4806v1.tar.gz"</span><span class="p">)</span>
<span class="n">text</span> <span class="o"><-</span> <span class="n">readLines</span><span class="p">(</span><span class="s2">"content.tex"</span><span class="p">,</span> <span class="n">warn</span> <span class="o">=</span> <span class="n">FALSE</span><span class="p">)</span>
<span class="n">bad_words</span> <span class="o"><-</span> <span class="n">hunspell</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">format</span> <span class="o">=</span> <span class="s2">"latex"</span><span class="p">)</span>
<span class="n">sort</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">unlist</span><span class="p">(</span><span class="n">bad_words</span><span class="p">)))</span>

The new version also exposes the parser directly, so you can easily extract words and derive the stems to summarize some text, for example to display in a wordcloud.

<span class="c1"># Summarize text by stems (e.g. for wordcloud)
</span><span class="n">allwords</span> <span class="o"><-</span> <span class="n">hunspell_parse</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">format</span> <span class="o">=</span> <span class="s2">"latex"</span><span class="p">)</span>
<span class="n">stems</span> <span class="o"><-</span> <span class="n">unlist</span><span class="p">(</span><span class="n">hunspell_stem</span><span class="p">(</span><span class="n">unlist</span><span class="p">(</span><span class="n">allwords</span><span class="p">)))</span>
<span class="n">words</span> <span class="o"><-</span> <span class="n">head</span><span class="p">(</span><span class="n">sort</span><span class="p">(</span><span class="n">table</span><span class="p">(</span><span class="n">stems</span><span class="p">),</span> <span class="n">decreasing</span> <span class="o">=</span> <span class="n">TRUE</span><span class="p">),</span> <span class="m">200</span><span class="p">)</span>

To leave a comment for the author, please follow the link and comment on their blog: OpenCPU.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)