R and the web (for beginners), Part III: Scraping MPs’ expenses in detail from the web

Posted on August 23, 2012 by GivenTheData in R bloggers | 0 Comments

[This article was first published on GivenTheData, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this last post of my little series (see my latest post) on R and the web I explain how to extract data of a website (web scraping/screen scraping) with R. If the data you want to analyze are a part of a web page, for example a HTML-table (or hundreds of them) it might be very time-consuming (and boring!) to manually copy/paste all of its content or even typewrite it to a spreadsheet table or data frame. Instead, you can let R do the job for you!

This post is really aimed at beginners. Thus, to keep things simple it only deals with scraping one data table from one web page: a table published by BBC NEWS containing the full range of British Members of Parliament’ expenses in 2007-2008. Quite an interesting data set if you are into political scandals…

Web scraping with R

There are several R packages that might be helpful for web scraping, such as XML, RCurl, and scrapeR. In this example only the XML package is used. As a fist step, you parse the whole HTML-file and extract all HTML-tables in it:

library(XML)

# URL of interest:
mps <- "http://news.bbc.co.uk/2/hi/uk_politics/8044207.stm"

# parse the document for R representation:
mps.doc <- htmlParse(mps)

# get all the tables in mps.doc as data frames

mps.tabs <- readHTMLTable(mps.doc)

mps.tabs is a list containing in each element a HTML-table from the parsed website (mps.doc) as data.frame. The website contains several HTML-tables (some are rather used to structure the website and not to present data). The list mps.tabs actually has seven entries, hence there were seven HTML-tables in the parsed document:

length(mps.tabs)

To proceed you need to check which of these data frames (list entries) contains the table you want (the MPs’ expenses). You can do that “manually” by checking how the data frame starts and ends and compare it with the original table of the website:

head(mps.tabs[[1]]) #and

tail(mps.tabs[[1]]) #for 1 to 7

With only seven entries this is quite fast. But alternatively you could also write a little loop to do the job for you. The loop checks each data frame for certain conditions. In this case: the string of the first row and first column and the string in the last row and column. According to the original table from the website that should be:

first <- "Abbott, Ms Diane"

last <- "157,841"

# … and the loop:

for (i in 1:length(mps.tabs)) {

lastrow <- nrow(mps.tabs[[i]]) # get number of rows

lastcol <- ncol(mps.tabs[[i]])

if (as.character(mps.tabs[[i]][1,1])==first & as.character(mps.tabs[[i]][lastrow,lastcol])==last) {

tabi <- i

}

Check if that is realy what you want and extract the relevant table as data frame.

head(mps.tabs[[tabi]])

tail(mps.tabs[[tabi]])

mps <- mps.tabs[[tabi]]

Before you can properly analyze this data set we have to remove the commas in the columns with expenses and format them as numeric:

money <- sapply(mps[,-1:-3], FUN= function(x) as.numeric(gsub(",", "", as.character(x), fixed = TRUE) ))

mps2 <- cbind(mps[,1:3],money)

Now you are ready to go… For example, you could compare how the total expenses are distributed for each of the five biggest parties:

# which are the five biggest parties by # of mps?

nbig5 <- names(summary(mps2$Party)[order(summary(mps2$Party)*-1)][1:5])

#subset of mps only with the five biggest parties:

big5 <- subset(mps2, mps$Party%in%nbig5)

# load the lattice package for a nice plot

library(lattice)

bwplot(Total ~ Party, data=big5, ylab=”Total expenses per MP (in £)”)

Here is the resulting plot:

And the relevant R code in one piece:

library(XML)

# URL of interest:
mps <- "http://news.bbc.co.uk/2/hi/uk_politics/8044207.stm"

# parse the document for R representation:
mps.doc <- htmlParse(mps)

# get all the tables in mps.doc as data frames

mps.tabs <- readHTMLTable(mps.doc)

# loop to find relevant table:

first <- "Abbott, Ms Diane"

last <- "157,841"

for (i in 1:length(mps.tabs)) {

lastrow <- nrow(mps.tabs[[i]]) # get number of rows

lastcol <- ncol(mps.tabs[[i]])

if (as.character(mps.tabs[[i]][1,1])==first & as.character(mps.tabs[[i]][lastrow,lastcol])==last) {

tabi <- i

}

}

# extract the relevant table and format it:

mps <- mps.tabs[[tabi]]

money <- sapply(mps[,-1:-3], FUN= function(x) as.numeric(gsub(",", "", as.character(x), fixed = TRUE) ))

mps2 <- cbind(mps[,1:3],money)

# which are the five biggest parties by # of mps?

nbig5 <- names(summary(mps2$Party)[order(summary(mps2$Party)*-1)][1:5])

#subset of mps only with the five biggest parties:

big5 <- subset(mps2, mps$Party%in%nbig5)

# load the lattice package for a nice plot

library(lattice)

bwplot(Total ~ Party, data=big5, ylab=”Total expenses per MP (in £)”)

More web scraping examples on r-bloggers.com

If you are interested in more web scraping with R, check out the following links to posts with more advanced/specific examples presented on r-bloggers.com:

http://www.r-bloggers.com/web-scraper-for-google-scholar-updated/

http://www.r-bloggers.com/web-scraping-in-r/

http://www.r-bloggers.com/r-web-scraping-r-bloggers-facebook-page-to-gain-further-information-about-an-authors%E2%80%99-r-blog-posts-e-g-number-of-likes-comments-shares-etc/

http://www.r-bloggers.com/web-scraping-yahoo-search-page-via-xpath/

http://www.r-bloggers.com/how-to-buy-a-used-car-with-r-part-1/

To leave a comment for the author, please follow the link and comment on their blog: GivenTheData.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

R and the web (for beginners), Part III: Scraping MPs’ expenses in detail from the web

Web scraping with R

More web scraping examples on r-bloggers.com

Related

Web scraping with R

More web scraping examples on r-bloggers.com

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)