October 24, 2010

This past Tuesday I had the opportunity to present a short talk (a bit long) related to text mining at the Los Angeles R Users’ Group. Since I do most of my text mining in Python, I took this opportunity to discuss RPy2, an interface to R from Python. My slides are below:

Download/view slides here. Topics include

  • Using Python with R with an example using web mining.
  • Web mining using pure R rather than Python.

Code for demonstration is here:

  1. is a pure Python script that extracts data from a web forum and dumps it to disk. To actually use it, you will need to register for an account.
  2. reads the data from the forum from disk and calls R from Python to perform some basic analysis.
  3. curljson_demo.R grabs some JSON data from the Twitter Search API using RCurl and converts it to R lists using rjson.


Running the code requires some packages that you need to install.

  • twill package for web browsing, that installs a Python package for you. Requires the mechanize package as well. twill is a wrapper to mechanize.
  • BeautifulSoup package for Python for HTML parsing.
  • R must be built to use as a shared library using --enable-R-shlib, otherwise Python cannot call it.
  • RPy2, the Python interface to R.

To see the main talk of the evening, click here.

