Rescuing Twapperkeeper Archives Before They Vanish, Redux

Posted on December 11, 2011 by Tony Hirst in R bloggers | 0 Comments

[This article was first published on OUseful.Info, the blog... » Rstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In Rescuing Twapperkeeper Archives Before They Vanish, I described a routine for grabbing Twapperkeeper archives, parsing them, and saving them to a local desktop file using the R programming language (downloading RStudio is the easiest way I know of getting R…).

Following a post fron @briankelly (Responding to the Forthcoming Demise of TwapperKeeper), where Brian described how to lookup all the archives saved by a person on Twapperkeeper and using that as a basis for an archive rescue strategy, I thought I’d tweak my code to grab all the hashtag archives for a particular user (other archives are also available, search as search term archives; I don’t grab the list of these… IF you fancy generalising the code, please post a link to it in the comments;-)

What should have been a trivial task didn’t work, of course: the R XML parser seemed to choke on some of the archive files claiming they weren’t the claimed UTF-8 encoding. Character encodings are still something that I don’t understand at all (and more than a few times have caused me to give up on a hack), but on the offchance, I tried using a more resilient file loader (curl, if that means anything to you…;-) rather than the XML package loader, and it seems to do the trick (warnings are still raised, but that’s an improvement on errors, that tend cause everything to stop).

Anyway, here’s the revised code, along with an additional routine for grabbing all the hashtag archives saved on Twapperkeeper by a named individual. If I get a chance (i.e. when I learn how to do it!), I’ll add in a line to two that will grab all the archives from a list of named individuals…

require(XML)
require(stringr)
require(RCurl)

#A helper function to remove @ symbols from user names...
trim <- function (x) sub('@','',x)
tagtrim <- function (x) sub('#','',x)

twapperkeeperRescue=function(hashtag,num=10000){
    #startCrib: http://libreas.wordpress.com/2011/12/09/twapperkeeper/
    #tweak - reduce to a grab of 10000 archived tweets
    url <- paste("http://twapperkeeper.com/rss.php?type=hashtag&name=",hashtag,"&l=",num, sep="")
    print(url)
    #This is a hackfix I tried on spec - use the RCurl library to load in the file...
    lurl=getURL(url)
    #...then parse it, rather than loading it in directly using the XML parser...
    doc <- xmlTreeParse(lurl,useInternal=T,encoding = "UTF-8")
    tweet <- xpathSApply(doc, "//item//title", xmlValue)  
    pubDate <- xpathSApply(doc, "//item//pubDate", xmlValue)
    #endCrib
    df=data.frame(cbind(tweet,pubDate))
    print('...extracting from...')
    df$from=sapply(df$tweet,function(tweet) str_extract(tweet,"^([[:alnum:]_]*)"))
    print('...extracting id...')
    df$id=sapply(df$tweet,function(tweet) str_extract(tweet,"[[:digit:]/s]*$"))
    print('...extracting txt...')
    df$txt=sapply(df$tweet,function(tweet) str_trim(str_replace(str_sub(str_replace(tweet,'- tweet id [[:digit:]/s]*$',''),end=-35),"^([[:alnum:]_]*:)",'')))
    print('...extracting to...')
    df$to=sapply(df$txt,function(tweet) trim(str_extract(tweet,"^(@[[:alnum:]_]*)")))
    print('...extracting rt...')
    df$rt=sapply(df$txt,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
    return(df)
}

#usage: 
#tag='ukdiscovery'
#twarchive.df=twapperkeeperRescue(tag)

#if you want to save the parsed archive:
twapperkeeperSave=function(hashtag,num=10000,path='./'){
    tweet.df=twapperkeeperRescue(hashtag,num)
    fn <- paste(path,"twArchive_",hashtag,".csv")
    write.csv(tweet.df,fn)
}
#usage:
#twapperkeeperSave(tag)


#The following function grabs a list of hashtag archives saved by a given user
# and then rescues each archive in turn...
twapperkeeperUserRescue=function(uname='psychemedia',num=10000){
	#This routine only grabs hashtag archives;
	#Search archives and other archives can also be identified an downloaded if you feel like generalising this bit of code...;-)
	url=paste('http://twapperkeeper.com/allnotebooks.php?type=hashtag&name=&description=&tag=&created_by=',uname,sep='')
	archives=readHTMLTable(url,which=2,header=T)
	archives$Name=sapply(archives$Name,function(tag) tagtrim(tag))
	mapply(twapperkeeperSave,archives$Name,num)
}
#usage:
#user='psychemedia'
#twapperkeeperUserRescue(user)
#twapperkeeperUserRescue(user,1000)
#The numerical argument is the number of archived tweets you want to save (max 50000)
#Note to self: need to trap this maxval...

Now… do I build some archive analytics and visualisation on top of this, or do I have a play with building an archive rescuer in Scraperwiki?!