Google+ just opened up to allow brands, groups, and organizations to create their very own public Pages on the site. This didn’t bother me to much but I’ve been hearing a lot about Google+ lately so figured it might be fun to set up an XPath scraper to extract information from each post of a status update page. I was originally going to do one for Facebook but this just seemed more interesting, so maybe I’ll leave that for next week if I get time. Anyway, here’s how it works (full code link at end of post):
u <- "https://plus.google.com/110286587261352351537/posts" df <- scrape_google_plus_page(u, is.URL = TRUE) t(df[2, ]) # posted.by "Felicia Day" # ID "110286587261352351537" # message "Commentary vid for DA: R Ep 5!" # message.share.link.name "Dragon Age: Redemption Ep5 Annotation" # message.share.link "http://www.youtube.com/watch?v=GZ4NGa0qeaM" # post.date "2011-11-09" # comments "67" # comments.by "Christopher H, Jesse McGlothlin, Ronel Villeno, Sealavindo Marine, Alexander Pinckard and 1 more" # sample.comments "Christopher H - Watching your commentary videos are like watching your video blog in The Guild! 🙂 " # shares "20" # shares.by "Amy Mayer, Bonnie Zabytko, Brad Chasenore, Dark Matter fanzine, Donald Coleman and 15 more" # pluses "270" # type "Public"
You simply supply the function with a Google+ post page url and it scrapes whatever information it can off of each post on the page. It doesn’t load more data after the initial set because I don’t really understand how to do it. The html element which refers to loading more posts is:
<span role="button" title="Load more posts" tabindex="0" style="">More</span>
u <- file.choose() df <- scrape_google_plus_page(u, is.URL = FALSE)
The full code can be found here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/scrape_google_plus_page.R
P.S. I’m new to github in terms of someone who uploads code but it does seems very useful. And cool. Bow-tie cool. Yeah.