A Little Web Scraping Exercise with XML-Package

April 5, 2012

(This article was first published on theBioBucket*, and kindly contributed to R-bloggers)

Some months ago I posted an example of how to get the links of the contributing blogs on the R-Blogger site. I used readLines() and did some string processing using regular expressions.

With package XML this can be drastically shortened –
see this:

# get blogger urls with XML:
script <- getURL("www.r-bloggers.com")
doc <- htmlParse(script)
li <- getNodeSet(doc, "//ul[@class='xoxo blogroll']//a")
urls <- sapply(li, xmlGetAttr, "href")

With only a few lines of code this gives the same result as in the original post! Here I will also process the urls for retrieving links to each blog’s start page:

# get ids for those with only 2 slashes (no 3rd in the end):
id <- which(nchar(gsub("[^/]", "", urls )) == 2)
slash_2 <- urls[id]

# find position of 3rd slash occurrence in strings:
slash_stop <- unlist(lapply(str_locate_all(urls, "/"),"[[", 3))
slash_3 <- substring(urls, first = 1, last = slash_stop - 1)

# final result, replace the ones with 2 slashes,
# which are lacking in slash_3:
blogs <- slash_3; blogs[id] <- slash_2

p.s.: Thanks to Vincent Zoonekynd for helping out with the XML syntax.

To leave a comment for the author, please follow the link and comment on their blog: theBioBucket*.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)