Web Scraping Yahoo Search Page via XPath

November 10, 2011

(This article was first published on Consistently Infrequent » R, and kindly contributed to R-bloggers)

Seeing as I’m on a bit of an XPath kick as of late, I figured I’d continue on scraping search results but this time from Yahoo.com

Rolling my own version of xpathSApply to handle NULL elements seems to have done the trick and so far it’s been relatively easy to do the scraping. I’ve created an R function which will scrape information from a Yahoo Search page (with the user suplying the Yahoo Search URL) and will extract as much information as it can whilst maintaining the data frame structure (full source code at end of post). For example:

# load packages

# user provides url and the function extracts relevant information into a data frame as follows
u <- "http://uk.search.yahoo.com/search;_ylt=A7x9QV6rWrxOYTsAHNFLBQx.?fr2=time&rd=r1&fr=yfp-t-702&p=Wil%20Wheaton&btf=w"
df <- get_yahoo_search_df(u)
t(df[1, ])

#             1
# title       "Wil Wheaton - Google+"
# url         "https://plus.google.com/108176814619778619437"
# description "Wil Wheaton - Google+6 days ago"
# cached      ",65b6306b&icp=1&.intl=uk&sig=6lwcOA8_4oGClQam_5I0cA--"
# recorded    "6 days ago"

I’ve only tested these on web results. The idea of these posts is to get basic functionality and then if I feel it might be fun, to expand the functionality in the future.

It’s nice having an online blog where I can keep these functions I’ve come up with during coding exercises. Maybe if I make enough of these Web Search Engine scrapers I can go ahead and make my first R package. Though the downside of web scraping is that if the structrure/entities of the HTML code change then the scrapers may stop working. That could make the package difficult to maintain. I can’t really think of how the package itself might be useful to anyone apart from teaching me personally how to build a package.

Maybe that’ll be worth it in and of itself. Ha, version 2.0 could be just a collection of the self contained functions, version 3.0 could have the functions converted to S3 (which I really want to learn), version 4.0 could have them converted to S4 (again, something I’d like to learn) and version 5.0 could have reference classes (I still don’t know what those things are). Just thinking out loud, could be a good way to learn more R. Doubt I’ll do it though but we’ll see. I have to find time to start learning Python so might have to put R on the back burner soon!

Full source code here (function is self-contained, just copy and paste):

# load packages

get_yahoo_search_df <- function(u) {
  # I hacked my own version of xpathSApply to deal with cases that return NULL which were causing me problems
  xpathSNullApply <- function(doc, path.base, path, FUN, FUN2 = NULL) {
    nodes.len <- length(xpathSApply(doc, path.base))
    paths <- sapply(1:nodes.len, function(i) gsub( path.base, paste(path.base, "[", i, "]", sep = ""), path, fixed = TRUE))
    xx <- lapply(paths, function(xpath) xpathSApply(doc, xpath, FUN))
    if(!is.null(FUN2)) xx <- FUN2(xx)
    xx[sapply(xx, length)<1] <- NA
    xx <- as.vector(unlist(xx))

  # download html and parse into tree structure
  html <- getURL(u, followlocation = TRUE)
  doc <- htmlParse(html)

  # path to nodes of interest
  path.base <- "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li"

  # construct data frame
  df <- data.frame(
    title = xpathSNullApply(doc, path.base, "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li/div/div/h3/a", xmlValue),
    url = xpathSNullApply(doc, path.base, "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li/div/div/h3/a[@href]", xmlAttrs, FUN2 = function(xx) sapply(xx, function(x) x[2])),
    description = xpathSNullApply(doc, path.base, "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li/div/div", xmlValue),
    cached = xpathSNullApply(doc, path.base, "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li/div/a[@href]", xmlAttrs, FUN2 = function(xx) sapply(xx, function(x) x[1])),
    recorded = xpathSNullApply(doc, path.base, "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li/div/div/span[@id='resultTime']", xmlValue),
    stringsAsFactors = FALSE)

  # free doc from memory

  # return data frame

u <- "http://uk.search.yahoo.com/search;_ylt=A7x9QV6rWrxOYTsAHNFLBQx.?fr2=time&rd=r1&fr=yfp-t-702&p=Wil%20Wheaton&btf=w"
df <- get_yahoo_search_df(u)
t(df[1:5, ])

#             1
# title       "Wil Wheaton - Google+"
# url         "https://plus.google.com/108176814619778619437"
# description "Wil Wheaton - Google+6 days ago"
# cached      ",65b6306b&icp=1&.intl=uk&sig=6lwcOA8_4oGClQam_5I0cA--"
# recorded    "6 days ago"
#             2
# title       "WIL WHEATON DOT NET"
# url         "http://www.wilwheaton.net/coollinks.php"
# description "Wil Wheaton - Don't be a dick! - Writer and Actor - Your Mom - I'm Wil Wheaton. I'm an author (that's why I'm wilwheatonbooks), an actor, and a lifelong geek."
# cached      ",4a4e7c54&icp=1&.intl=uk&sig=VC7eV8GUMXVuu9apHagYNg--"
# recorded    "2 days ago"
#             3
# title       "this is one hell of a geeky weekend - WWdN: In Exile"
# url         "http://wilwheaton.typepad.com/wwdnbackup/2008/05/this-is-one-hel.html"
# description "WIL WHEATON DOT NET2 days ago"
# cached      ",34d4424b&icp=1&.intl=uk&sig=ZN.UpexVV4pm3yn7XiEURw--"
# recorded    "2 days ago"
#             4
# title       "Wil Wheaton - Google+ - I realized today that when someone ..."
# url         "https://plus.google.com/108176814619778619437/posts/ENTkBMZKeGY"
# description ">Cool Sites. Okay, I'm talking to the guys here: do you ever get \"the sigh\"? You know what I'm talking about...you're really into some cool website, and your ..."
# cached      ",dba19826&icp=1&.intl=uk&sig=jGaKkuIFOINEBBfBwarrgg--"
# recorded    "6 days ago"
#             5
# title       "The Hot List: Dwight Slade, Back Fence PDX, Wil Wheaton vs ..."
# url         "http://www.oregonlive.com/movies/index.ssf/2011/11/the_hot_list_dwight_slade_back.html"
# description "this is one hell of a geeky weekend - WWdN: In Exile2 days ago"
# cached      ",e585aa21&icp=1&.intl=uk&sig=KufdBZ_Thr1Mm8.SnjpMUQ--"
# recorded    "4 hours ago"

To leave a comment for the author, please follow the link and comment on their blog: Consistently Infrequent » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...


Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)