(This article was first published on theBioBucket*, and kindly contributed to R-bloggers)
Some months ago I posted an example of how to get the links of the contributing blogs on the R-Blogger site. I used readLines() and did some string processing using regular expressions.With package XML this can be drastically shortened -
see this:
# get blogger urls with XML:With only a few lines of code this gives the same result as in the original post! Here I will also process the urls for retrieving links to each blog's start page:
library(RCurl)
library(XML)
script <- getURL("www.r-bloggers.com")
doc <- htmlParse(script)
li <- getNodeSet(doc, "//ul[@class='xoxo blogroll']//a")
urls <- sapply(li, xmlGetAttr, "href")
# get ids for those with only 2 slashes (no 3rd in the end):p.s.: Thanks to Vincent Zoonekynd for helping out with the XML syntax.
id <- which(nchar(gsub("[^/]", "", urls )) == 2)
slash_2 <- urls[id]
# find position of 3rd slash occurrence in strings:
slash_stop <- unlist(lapply(str_locate_all(urls, "/"),"[[", 3))
slash_3 <- substring(urls, first = 1, last = slash_stop - 1)
# final result, replace the ones with 2 slashes,
# which are lacking in slash_3:
blogs <- slash_3; blogs[id] <- slash_2
To leave a comment for the author, please follow the link and comment on his blog: theBioBucket*.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Zero Inflated Models and Generalized Linear Mixed Models with R.
Zuur, Saveliev, Ieno (2012).