Scraping R-bloggers with Python – Part 2

January 5, 2012
By

(This article was first published on The PolStat R Feed, and kindly contributed to R-bloggers)

In my previous post I showed how to write a small simple python script to download the pages of R-bloggers.com. If you followed that post and ran the script, you should have a folder on your hard drive with 2409 .html files labeled post1.html , post2.html and so forth. The next step is to write a small script that extract the information we want from each page, and store that information in a .csv file that is easily read by R. In this post I will show how to extract the post title, author name and date of a given post and store it in a .csv file with a unique id.

To do this open a document in your favorite python editor (I like to use aquamacs) and name it: extraction.py. As in the previous post we start by importing the modules that we will use for the extraction:

from BeautifulSoup import BeautifulSoup
import os
import re

As in the previous post we will be using the BeautifulSoup module to extract the relevant information from the pages. The os module is used to get a list of file from the directory where we have saved the .html files, and finally the re module allows us to use regular expressions to format the titles that include a comma value or a newline value (\n). We need to remove these as they would mess up the formatting of the .csv file.

After having read in the modules, we need to get a list of files that we can iterate over. First we need to specify the path were the files are saved, and then we use the os module to get all the filenames in the specified directory:

path = "/Users/thomasjensen/Documents/RBloggersScrape/download"
listing = os.listdir(path)

It might be that there are other files in the given directory, hence we apply a filter, in shape of a list comprehension, to weed out any file names that do not match our naming scheme:

listing = [name for name in listing if re.search(r"post\d+\.html",name) != None]

Notice that a regular expression was used to determine whether a given name in the list matched our naming scheme. For more on regular expressions have a look at this site.

The final steps in preparing our extraction is to change the working directory to where we have our .html files, and create an empty dictionary:

os.chdir(path)
data = {}

Dictionaries are one of the great features of Python. Essentially a dictionary is a mapping of a key to a specific value, however the fact that dictionaries can be nested within each other, allows us to create data structures similar to R’s data frames.

Now we are ready to begin extracting information from our downloaded pages. Much as in the previous post, we will loop over all the file names, read each file into Python and create a BeautifulSoup object from the file:

for page in listing:
    site = open(page,"rb")
    soup = BeautifulSoup(site)

In order to store the values we extract from a given page, we update the dictionary with a unique key for the page. Since our naming scheme made sure that each file had a unique name, we simply remove the .html part from the page name, and use that as our key:

key = re.sub(".html","",page)
data.update({key:{}})

This will create a mapping between our key and an empty dictionary, nested within the data dictionary. Once this is done we can start extract information and store it in our newly created nested dictionary. The content we want is located in the main column, which has the id tag “leftcontent” in the html code. To get at this we use find() function on soup object created above:

content = soup.find("div", id = "leftcontent")

The first “h1” tag in our content object contains the title, so again we will use the find() function on the content object, to find the first “h1” tag:

title = content.findNext("h1").text

To get the text within the “h1” tag the .text had been added to our search with in the content object.

To find the author name, we are lucky that there is a class of “div” tags called “meta” which contain a link with the author name in it. To get the author name we simply find the meta div class and search for a link. Then we pull out the text of the link tag:

author = content.find("div",{"class":"meta"}).findNext("a").text

Getting the date is a simple matter as it is nested within div tag with the class “date”:

date = content.find("div",{"class":"date"}).text

Once we have the three variables we put them in dictionaries that are nested within the nested dictionary we created with the key:

data[key]["title"] = title
data[key]["author"] = author
data[key]["date"] = date

Once we have run the loop and gone through all posts, we need to write them in the right format to a .csv file. To begin with we open a .csv file names output:

output = open("/Users/thomasjensen/Documents/RBloggersScrape/output.csv","wb")

then we create a header that contain the variable names and write it to the output.csv file as the first row:

variables = unicode(",".join(["id","date","author","title"]))
header = variables + "\n"
output.write(header.encode("utf8"))

Next we pull out all the unique keys from our dictionary that represent individual posts:

keys = data.keys()

Now it is a simple matter of looping through all the keys, pull out the information associated with each key, and write that information to the output.csv file:

for key in keys:
    print key
    id = key
    date = re.sub(",","",data[key]["date"])
    author = data[key]["author"]
    title = re.sub(",","",data[key]["title"])
    title = re.sub("\\n","",title)
    linelist = [id,date,author,title]
    linestring = unicode(",".join(linelist))
    linestring = linestring + "\n"
    output.write(linestring.encode("utf-8"))

Notice that we first create four variables that contain the id, date, author and title information. With regards to the title we use two regular expressions to remove any commas and “\n” from the title, as these would create new columns or new line breaks in the output.csv file. Finally we put the variables together in a list, and turn the list into a string with the list items separated by a comma. Then a linebreak is added to the end of the string, and the string is written to the output.csv file. As a last step we close the file connection:

output.close()

And that is it. If you followed the steps you should now have a csv file in your directory with 2409 rows, and four variables – ready to be read into R. Stay tuned for the next post which will show how we can use this data to see how R-bloggers has developed since 2005. The full extraction script is shown below:

To leave a comment for the author, please follow the link and comment on his blog: The PolStat R Feed.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.