An automatic code to extract tweets (and to produce the “Somewhere else” review)

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A few weeks ago, I ask in a post the (simple) question “dear reader, who are you?” just to know more about the readers of my blog. I found that extremely interesting (even if – to be honest – I was expecting more answers to start a more serious sociological study of the readers of my blog). And an interesting point was that a lot of readers of my blog come to read the “somewhere else” posts, which is a review of interesting posts and articles found on the internet. Those links I share actually come from my tweets. I have on my blog a backup of my tweets, and usually, that’s where I go if I want to find some article, or some graph, or some map I have in mind, that I’ve seen somewhere (but usually I can’t remember where). But most of the time, I feel bored, because there is nothing new: it is simply a copy and paste from my tweets.

And this afternoon @tomroud asked how those posts were written: was there an automatic procedure, or was I doing it manually? Until tonight, I was doing it manually. But because it was some kind of stupid challenge, I did try to produce a code that will generate a simple list of my tweets that I can use to produce a post.

Nevertheless, there are still two problems I cannot fix with a code:

  • in my “somewhere else” posts, there was a language distinction, with posts and articles in English first, and then those in French. Unfortunately, I could not find a function that detects the language of a tweet. I remember that we’ve been trying with @3wen to write such a code, but I could not find it… I guess @3wen had a first draft so if we can find it, I will upload it on my blog (or he will upload it on his)
  • in my posts, I include the picture, if any. This part will still be done manually because it is much more difficult (but I guess it is possible…)

Now, before starting, we will need  functions from an old post, to convert twitter’s shorten url to real ones,

extraire <- function(entree,motif){
res <- regexec(motif,entree)
if(length(res[[1]])==2){
 debut <- (res[[1]])[2]
 fin <- debut+(attr(res[[1]],"match.length"))[2]-1
return(substr(entree,debut,fin))
}else return(NA)}
unshorten <- function(url){
uri <- getURL(url, header=TRUE, nobody=TRUE, followlocation=FALSE, 
cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
res <- try(extraire(uri,"rnlocation: (.*?)rnserver"))
return(res)}

Now, let us consider the following code. The first step, of course, is to run some lines that will allow me to use Twitter's API,

require(twitteR)
reqURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
apiKey <- "yourAPIkey"
apiSecret <- "yourAPIsecret"

twitCred <- OAuthFactory$new(consumerKey=apiKey,consumerSecret=apiSecret,requestURL=reqURL,accessURL=accessURL,authURL=authURL)

twitCred$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))

registerTwitterOAuth(twitCred)

Then, I need to be cautious become some of my tweets are in French, and some weird symbols might appear,

Sys.setlocale("LC_CTYPE","fr_FR.UTF-8")

Now I can write my function

somewhere_else <- function(){

tweets_freak <- searchTwitter("from:@freakonometrics", n = 500)

save(tweets_freak, file="somewhere_else.RData")

tweets_freak_df <- do.call("rbind", lapply(tweets_freak, as.data.frame))

text_tweets_freak <- tweets_freak_df$text

tweets_freak_message <- text_tweets_freak[which(substr(text_tweets_freak,1,1)!="@")]

SE <- which(substr(tweets_freak_message,1,15)==""Somewhere else")
first_SE <- SE[1]

tweets_freak <- tweets_freak_message[1:(first_SE-1)]

substitute_id <- function(x){
split_x <- strsplit(x,"@")[[1]]
x_id <- paste(split_x,collapse="http://twitter.com/",sep="")
split_x_id <- strsplit(x_id,"http")
n <- length(split_x_id[[1]])
tweet_x <- strsplit(split_x_id[[1]]," ")

if(n==1) rt <- x_id
if(n>1){
for(i in 2:n){
url <- tweet_x[[i]][1]
split=FALSE
if(substr(url,nchar(url),nchar(url))%in%c(":",",",";",")","(")) split <- TRUE
if(split==FALSE) unshort_url <- unshorten(paste("http",url,sep=""))
if(split==TRUE) unshort_url <- unshorten(paste("http",substr(url,1,nchar(url)-1),sep=""))
tweet=FALSE
if(substr(url,4,10)=="twitter") tweet=TRUE
if((split==FALSE)&(tweet==FALSE)) tweet_x_2 <- c("<a href="",unshort_url,"">",unshort_url,"</a>")
if((split==TRUE)&(tweet==FALSE)) tweet_x_2 <- c("<a href="",unshort_url,"">",unshort_url,"</a>",substr(url,nchar(url),nchar(url)))
if((split==FALSE)&(tweet==TRUE)) tweet_x_2 <- c("<a href="",unshort_url,"">@",substr(unshort_url,21,nchar(unshort_url)),"</a>")
if((split==TRUE)&(tweet==TRUE)) tweet_x_2 <- c("<a href="",unshort_url,"">@",substr(unshort_url,21,nchar(unshort_url))
,"</a>",substr(url,nchar(url),nchar(url)))
tweet_x[[i]] <- c(tweet_x_2,tweet_x[[i]][-1])
}
rt <- paste("<li>",paste(unlist(tweet_x),collapse=" "),"</li>",sep="")
}
return(rt)
}

tweets_freak_sub <- lapply(tweets_freak, substitute_id)
write.table(unlist(tweets_freak_sub),file="tweets_somewhere_else.txt",quote=FALSE,row.names=FALSE)

cat("Number of tweets.....",length(tweets_freak_sub),"n")
cat("File.................",paste(getwd(),"tweets_somewhere_else.txt",sep="/"),"n")
cat("Donen")
}

The first tricky part was to recognize names mentionned in my tweets (since some of them are retweets). The second one was to create an html link each time there is a link (I did not take into account hastags, here). If I run it, get

> somewhere_else()
Number of tweets..... 72 
File.... /home/arthur/tweets_somewhere_else.txt 
Done
Warning message:
In doRppAPICall("search/tweets", n, params = params, retryOnRateLimit = retryOnRateLimit,  :
  500 tweets were requested but the API can only return 191

If I make a copy and paste from the text file, I have

which makes sense, because those are indeed my most recent posts,

etc. I will have to spend some time to include pictures, graphs, maps, videos, etc, but that function should save me some time!

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)