R: Stem (Pre-Processed) Text Blocks

August 24, 2014

(This article was first published on Consistently Infrequent » R, and kindly contributed to R-bloggers)


I recently needed to stem every word in a block of text i.e. reduce each word to a root form.


The stemmer I was using would only stem the last word in each block of text e.g.


wordStem('walk walks walked walking walker walkers', language = 'en')
# [1] 'walk walks walked walking walker walk';


I wrote a function which splits a block of text into individual words, stems each word, and then recombines the words together into a block of text

stem_text<- function(text, language = "porter", mc.cores = 1) {
  # stem each word in a block of text
  stem_string <- function(str, language) {
    str <- strsplit(x = str, split = "\s")
    str <- wordStem(unlist(str), language = language)
    str <- paste(str, collapse = " ")
  # stem each text block in turn
  x <- mclapply(X = text, FUN = stem_string, language, mc.cores = mc.cores)
  # return stemed text blocks

This works under the assumptions that the text only contains text and whitespace (i.e. it has been appropriately pre-processed).

# Blocks of text
sentences <- c('walk walks walked walking walker walkers',
               'Never ignore coincidence unless of course you are busy In which case always ignore coincidence')

# Stem blocks of text
stem_text(sentences, language = 'en', mc.cores = 2)

# [1] 'walk walk walk walk walker walker';                                                
# [2] 'Never ignor coincid unless of cours you are busi In which case alway ignor coincid'

To leave a comment for the author, please follow the link and comment on their blog: Consistently Infrequent » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)