R: Stem (Pre-Processed) Text Blocks

August 24, 2014
By

(This article was first published on Consistently Infrequent » R, and kindly contributed to R-bloggers)

Objective

I recently needed to stem every word in a block of text i.e. reduce each word to a root form.

Problem

The stemmer I was using would only stem the last word in each block of text e.g.

require(SnowballC)

wordStem('walk walks walked walking walker walkers', language = 'en')
# [1] 'walk walks walked walking walker walk';

Solution

I wrote a function which splits a block of text into individual words, stems each word, and then recombines the words together into a block of text

stem_text<- function(text, language = "porter", mc.cores = 1) {
  # stem each word in a block of text
  stem_string <- function(str, language) {
    str <- strsplit(x = str, split = "\s")
    str <- wordStem(unlist(str), language = language)
    str <- paste(str, collapse = " ")
    return(str)
  }
  
  # stem each text block in turn
  x <- mclapply(X = text, FUN = stem_string, language, mc.cores = mc.cores)
  
  # return stemed text blocks
  return(unlist(x))
}

This works under the assumptions that the text only contains text and whitespace (i.e. it has been appropriately pre-processed).

# Blocks of text
sentences <- c('walk walks walked walking walker walkers',
               'Never ignore coincidence unless of course you are busy In which case always ignore coincidence')

# Stem blocks of text
stem_text(sentences, language = 'en', mc.cores = 2)

# [1] 'walk walk walk walk walker walker';                                                
# [2] 'Never ignor coincid unless of cours you are busi In which case alway ignor coincid'

To leave a comment for the author, please follow the link and comment on his blog: Consistently Infrequent » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.