(This article was first published on Byte Mining, and kindly contributed to R-bloggers)
This past Tuesday I had the opportunity to present a short talk (a bit long) related to text mining at the Los Angeles R Users’ Group. Since I do most of my text mining in Python, I took this opportunity to discuss RPy2, an interface to R from Python. My slides are below:
Download/view slides here. Topics include
- Using Python with R with an example using web mining.
- Web mining using pure R rather than Python.
Code for demonstration is here:
- offtopic_demo.py is a pure Python script that extracts data from a web forum and dumps it to disk. To actually use it, you will need to register for an account.
- RPy2_demo.py reads the data from the forum from disk and calls R from Python to perform some basic analysis.
- curljson_demo.R grabs some JSON data from the Twitter Search API using RCurl and converts it to R lists using rjson.
Video:
Running the code requires some packages that you need to install.
- twill package for web browsing, that installs a Python package for you. Requires the mechanize package as well. twill is a wrapper to mechanize.
- BeautifulSoup package for Python for HTML parsing.
- R must be built to use as a shared library using --enable-R-shlib, otherwise Python cannot call it.
- RPy2, the Python interface to R.
To see the main talk of the evening, click here.
Some Recommended Books
Natural Language Processing
- Foundations of Statistical Natural Language Processing, Manning and Schuetze.
- Speech and Language Processing, Jurafsky and Martin.
- Natural Language Processing and Text Mining, Kao and Poteet.
Text Mining
- Practical Text Mining with Perl, Bilisoly. See my review of this book in the Journal of Statistical Software here which is also excerpted on Amazon!
- Text Mining: Applications and Theory, Berry and Kogan (NEW).
- The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Feldman and Sanger.
- Mastering Regular Expressions, Friedl.
Data Mining
- Elements of Statistical Learning: Data Mining, Inference and Prediction. Hastie, Tibshirani and Friedman.
- Data Mining: Concepts and Techniques (recommended by @nealrichter). Han, Kamber and Pei.
- Data Mining: Practical Machine Learning Tools and Techniques [the fern book]. Witten and Frank.
- Introduction to Data Mining [the rock book]. Tan, Steinbach, Kumar.
Web Mining
- Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Liu.
- Mining the Web: Discovering Knowledge from Hypertext Data, Chakrabarti.
- Mining Graph Data, Cook and Holder.
- Managing and Mining Graph Data, Aggarwal and Wang.
- Social Network Analysis: Methods and Applications, Wasserman and Faust.
To leave a comment for the author, please follow the link and comment on his blog: Byte Mining.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Zero Inflated Models and Generalized Linear Mixed Models with R.
Zuur, Saveliev, Ieno (2012).