A while back I posted something about scraping a webpage using the BeautifulSoup module in Python. One of the comments to that post was by Larry — a blogger over at IEORTools — suggesting that I take a look at the XML library in R. Given that one of the points of this blog is to become more familiar with some of the R tools, it seemed like a reasonable suggestion — and I went with it.
I decided to replicate my work in python for scraping the MLS data using the XML package for R. OK, I didn’t replicate it exactly because I only scraped five years worth of data. I figured that five years would be a sufficient amount of time for comparison purposes. The only major criterion that I enforced was that they both had to export nearly identical .csv files. I say “nearly” because R seemed to want to wrap everything in parenthesis and it also exported the row names (1, 2, 3, etc.) as default options in write.table(). Neither of these defaults are an issue, so I didn’t bother changing them. I wrote in a few print statements for commenting purposes in both scripts to show where a difference (if any) in timing might exist. The code can be found in my scraping repository on github.
I don’t really know much about how system.time() works in R to be honest. However, I used this function as the basis of my comparison. Of course, I was source’ing an R file and using the system() function in R to run the python script, i.e., system(“path/to/py_script.py”). The results can be summarized in the following graph.
As you can see in the figure, there is about a 3x speedup in using the XML package relative to using BeautifulSoup! This is not what I was expecting. Further, it appears that the overall “user” speedup is approximately 5x. In fact, the only place where python seems to beat the R package is in the user.self portion of the time….whatever the hell this means.
As I said before, I decided to print out some system times within each script because scraping this data is iterative. That is, I scrape and process the data for each year within the loop (over years). So I was curious to see if there each option was scraping and processing at about the same speed. It turns out that XML beat BeautifulSoup here as well.
## From system call to python:
Sun Aug 29 18:07:57 2010 -- Starting
Sun Aug 29 18:08:00 2010 -- Year: 2005
Sun Aug 29 18:08:02 2010 -- Year: 2006
Sun Aug 29 18:08:04 2010 -- Year: 2007
Sun Aug 29 18:08:06 2010 -- Year: 2008
Sun Aug 29 18:08:08 2010 -- Year: 2009
Sun Aug 29 18:08:08 2010 -- Finished
and in R:
 "2010-08-29 18:10:29 -- Starting"
 "2010-08-29 18:10:29 -- Year: 2005"
 "2010-08-29 18:10:29 -- Year: 2006"
 "2010-08-29 18:10:29 -- Year: 2007"
 "2010-08-29 18:10:30 -- Year: 2008"
 "2010-08-29 18:10:30 -- Year: 2009"
 "2010-08-29 18:10:30 -- Finished "
What do I conclude from this? Well, use R damnit! The XML package is super easy to use and it’s fast. Will I still use python? Of course! I would bet thatpython/BeautifulSoup would be a superior option if I had to scrape and process huge amounts of data — which will happen sooner rather than later.
My computer’s technical specs: 2.66 GHz Intel Core 2 Duo, 8 GB RAM (DDR3), R version 2.11.0, XML v3.1-0, python 2.6.1, and BeautifulSoup v22.214.171.124.
Preview of upcoming post: I am going to compare my two fantasy football drafts with the results similar drafts that are posted online! Exciting stuff…you know, if you’re a nerd and like sports.