Revisiting text processing with R and Python

[This article was first published on Bommarito Consulting » r, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

  Back in 2011, I covered the relative performance difference of the most popular libraries for text processing in R and Python.   In case you can’t guess the answer, Python and NLTK  won by a significant margin over R and tm.  Text processing with R seemed simple on paper, but performance and flexibility limitations have kept me away since then except for very small corpora.

  Since then, R has garnered a huge amount of attention from a growing community of enterprise and academic users.  In 2011, the only mature text processing package was tm; now, with more and more big-name vendors like Oracle and HP piling marketing dollars into the language as a platform for big data analytics, you’d hope that the state of affairs would have improved.

  Sadly, they have not come far enough to make R practical for many tasks.  tm is still the most commonly used package, and much of the new work in text processing and natural language processing has built on tm (see reverse lists on the tm CRAN page).  If all you need are simple tokenization functions, the tau package does provide basic, efficient utilities here (although they simply wrap the built-in R regular expression methods).

  If you can live with rJava, you have a few more options – rWeka and openNLP both provide access to the Weka and Apache OpenNLP via JNI.  However, in my experience, the system constraints regarding rJava ( ambiguity, `env` sandboxing, architecture mismatches) and JNI+R copy performance haven’t made the effort worthwhile.

  At the end of the day, those of us working with large text corpora still need dual-language workflows to process text prior to classifying or learning in R.  R and its packages make the latter half of this work much easier, but my gut instinct is that scikit-learn, pylab, and pandas make Python a better single-language solution for most problems today.

To leave a comment for the author, please follow the link and comment on their blog: Bommarito Consulting » r. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)