Web Scraping Google+ via XPath

November 11, 2011
By

(This article was first published on Consistently Infrequent » R, and kindly contributed to R-bloggers)

Google+ just opened up to allow brands, groups, and organizations to create their very own public Pages on the site. This didn’t bother me to much but I’ve been hearing a lot about Google+ lately so figured it might be fun to set up an XPath scraper to extract information from each post of a status update page. I was originally going to do one for Facebook but this just seemed more interesting, so maybe I’ll leave that for next week if I get time. Anyway, here’s how it works (full code link at end of post):

u <- "https://plus.google.com/110286587261352351537/posts"
df <- scrape_google_plus_page(u, is.URL = TRUE)
t(df[2, ])
# posted.by               "Felicia Day"
# ID                      "110286587261352351537"
# message                 "Commentary vid for DA: R Ep 5!"
# message.share.link.name "Dragon Age: Redemption Ep5 Annotation"
# message.share.link      "http://www.youtube.com/watch?v=GZ4NGa0qeaM"
# post.date               "2011-11-09"
# comments                "67"
# comments.by             "Christopher H, Jesse McGlothlin, Ronel Villeno, Sealavindo Marine, Alexander Pinckard and 1 more"
# sample.comments         "Christopher H  -  Watching your commentary videos are like watching your video blog in The Guild! :)    "
# shares                  "20"
# shares.by               "Amy Mayer, Bonnie Zabytko, Brad Chasenore, Dark Matter fanzine, Donald Coleman and 15 more"
# pluses                  "270"
# type                    "Public"

You simply supply the function with a Google+ post page url and it scrapes whatever information it can off of each post on the page. It doesn’t load more data after the initial set because I don’t really understand how to do it. The html element which refers to  loading more posts is:

<span role="button" title="Load more posts" tabindex="0" style="">More</span>

but how one would use that is beyond me. I think it’s probably something to do with javascript but I don’t think R has any way of accessing it, at least as far as I know. Plus, I don’t know javascript. This makes the function of limited usability. One way around this limitation however (and it’s something I’m doing with my facebook wallpost page scraper) is to simply provide the html file of the Google+ posts page, which you will have saved as an .html file on your disk after pressing the ‘more’ button as many times as you desire, and then giving that file path directly to  scrape_google_plus_page function which will automatically do the rest.

u <- file.choose()
df <- scrape_google_plus_page(u, is.URL = FALSE)

The full code can be found here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/scrape_google_plus_page.R

P.S. I’m new to github in terms of someone who uploads code but it does seems very useful. And cool. Bow-tie cool. Yeah.


To leave a comment for the author, please follow the link and comment on his blog: Consistently Infrequent » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags:

Comments are closed.