Pre-processing text: R/tm vs. python/NLTK

February 16, 2011

(This article was first published on Michael Bommarito » r, and kindly contributed to R-bloggers)

  Let’s say that you want to take a set of documents and apply a computational linguistic technique.  If your method is based on the bag-of-words model, you probably need to pre-process these documents first by segmenting, tokenizing, stripping, stopwording, and stemming each one (phew, that’s a lot of -ing’s).  

  In the past, I’ve relied on NLTK to perform these tasks.  Python is my strongest language and NLTK is mature, fast, and well-documented.  However, I’ve been focusing on performing tasks entirely within R lately, and so I’ve been giving the tm package a chance.  So far, I’ve been disappointed with its speed (at least from a relative sense).  

  Here’s a simple example that hopefully some better R programmer out there can help me with.  I have been tracking tweets on the #25bahman hashtag (97k of them so far).  The total size of the dataset is 18M, so even fairly inefficient code should have no problem with this.  I want to build a corpus of documents where each tweet is a document and each document is a bag of stemmed, stripped tokens.  You can see the code in my repository here, and I’ve embedded two of the three examples below.

Big N.B.: The R code wasn’t done after 15 minutes.  My laptop has an i7 640M and 8G of RAM, so the issue isn’t my machine. The timing below actually excludes the last line of the embedded R example below.

Here are the timing results with 3 runs per example.  As you can see, R/tm is more than twice as slow to build the un-processed corpus as Python/NLTK is to build the processed corpus. The comparison only gets worse when you parallelize the Python code.

  1. Python, unparallelized: ~1:05
  2. Python, parallelized (8 runners): ~0:45
  3. R, unparallelized: ~2:15

  So, dear Internet friends, what can I do? Am I using tm properly? Is this just a testament to the quality of NLTK? Am I cursed to forever write pre-processing code in one language and perform analysis in another?

Email Google Buzz Facebook Twitter Share

To leave a comment for the author, please follow the link and comment on their blog: Michael Bommarito » r. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , ,

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training


CRC R books series

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)