Web Scraping Google+ via XPath

Posted on November 11, 2011 by Tony Breyal in R bloggers | 0 Comments

[This article was first published on Consistently Infrequent » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Google+ just opened up to allow brands, groups, and organizations to create their very own public Pages on the site. This didn’t bother me to much but I’ve been hearing a lot about Google+ lately so figured it might be fun to set up an XPath scraper to extract information from each post of a status update page. I was originally going to do one for Facebook but this just seemed more interesting, so maybe I’ll leave that for next week if I get time. Anyway, here’s how it works (full code link at end of post):

u <- "https://plus.google.com/110286587261352351537/posts"
df <- scrape_google_plus_page(u, is.URL = TRUE)
t(df[2, ])
# posted.by               "Felicia Day"
# ID                      "110286587261352351537"
# message                 "Commentary vid for DA: R Ep 5!"
# message.share.link.name "Dragon Age: Redemption Ep5 Annotation"
# message.share.link      "https://www.youtube.com/watch?v=GZ4NGa0qeaM"
# post.date               "2011-11-09"
# comments                "67"
# comments.by             "Christopher H, Jesse McGlothlin, Ronel Villeno, Sealavindo Marine, Alexander Pinckard and 1 more"
# sample.comments         "Christopher H  -  Watching your commentary videos are like watching your video blog in The Guild! :)    "
# shares                  "20"
# shares.by               "Amy Mayer, Bonnie Zabytko, Brad Chasenore, Dark Matter fanzine, Donald Coleman and 15 more"
# pluses                  "270"
# type                    "Public"

You simply supply the function with a Google+ post page url and it scrapes whatever information it can off of each post on the page. It doesn’t load more data after the initial set because I don’t really understand how to do it. The html element which refers to loading more posts is:

<span role="button" title="Load more posts" tabindex="0" style="">More</span>

but how one would use that is beyond me. I think it’s probably something to do with javascript but I don’t think R has any way of accessing it, at least as far as I know. Plus, I don’t know javascript. This makes the function of limited usability. One way around this limitation however (and it’s something I’m doing with my facebook wallpost page scraper) is to simply provide the html file of the Google+ posts page, which you will have saved as an .html file on your disk after pressing the ‘more’ button as many times as you desire, and then giving that file path directly to scrape_google_plus_page function which will automatically do the rest.

u <- file.choose()
df <- scrape_google_plus_page(u, is.URL = FALSE)

The full code can be found here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/scrape_google_plus_page.R

P.S. I’m new to github in terms of someone who uploads code but it does seems very useful. And cool. Bow-tie cool. Yeah.

To leave a comment for the author, please follow the link and comment on their blog: Consistently Infrequent » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Web Scraping Google+ via XPath

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)