R, Twitter and URLs

August 26, 2013
Yesterday evening, I wanted to play with Twitter, and see which websites I was using as references in my tweets, to get a Top 4 list.
The first problem I got was because installing twitteR on Ubuntu is not that simple ! You have to install properly RCurl… But you before install the package in R, it is necessary to run the following line in a terminal
$ sudo apt-get install 
then, launch R
$ R
and then you can run the standard
> install.packages("RCurl")
and install finally the package of interest,
> install.packages("twitteR")
Then, the second problem I had was that twitteR has been updated recently because of Twitter’s new API. Now, you should register on Twitter’s developers webpage, get an Id and a password, then use it in the following function (I did change both of them, below, so if you try to run the following code, you will – probably – get an error message),
> library(twitteR)
> cred <- getTwitterOAuth("ikzCtYif9Rwoood45w","rsCCifp99kw5sJfKfOUhhwyVmPl9A")
> registerTwitterOAuth(cred)
[1] TRUE
> T <- userTimeline('freakonometrics',n=5000)
you should also go on some webpage and enter a PIN that you find online.
To enable the connection, please direct your web browser to:


When complete, record the PIN given to you and provide it here:
It is a pain in ass, trust me. Anyway, I have be able to run it. I can now have the list with all my (recent) tweets
> T <- userTimeline('freakonometrics',n=5000)

Now, my (third) problem was to extract from my tweets the url of references. The second tweet of the list was

But when you look at the text, you see

> T[[2]]
[1] "freakonometrics: [textmining] \"How a Computer Program Helped Reveal J. K. 
Rowling as Author of A Cuckoos Calling\" http://t.co/wdmBGL8cmj by @garethideas"
So what I get is not the url used in my tweet, but a shortcut to the urls, from http://t.co/. Hopefully, @3wen (as always) has been able to help me with the following functions,
> extraire <- function(entree,motif){
+	res <- regexec(motif,entree)
+	if(length(res[[1]])==2){
+		debut <- (res[[1]])[2]
+		fin <- debut+(attr(res[[1]],"match.length"))[2]-1
+		return(substr(entree,debut,fin))
+	}else return(NA)}
> unshorten <- function(url){
+	uri <- getURL(url, header=TRUE, nobody=TRUE, followlocation=FALSE, 
+       cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
+	res <- try(extraire(uri,"\r\nlocation: (.*?)\r\nserver"))
+	return(res)}

Now, if we use those functions, we can get the true url,

> url <- "http://t.co/wdmBGL8cmj"
> unshorten(url)
[1] http://www.scientificamerican.com/article.cfm?id=how-a-computer-program-helped-show..
Now I can play with my list, to extract urls, and the address of the website,
> exturl <- function(i){
+ text_tw <- T_text[i]
+ locunshort2 <- NULL
+ indtext <- which(substr(unlist(strsplit(text_tw, " ")),1,4)=="http")
+ if(length(indtext)>0){
+ loc <- unlist(strsplit(text_tw, " "))[indtext]
+ locunshort=unshorten(loc)
+ if(is.na(locunshort)==FALSE){
+ locunshort2 <- unlist(strsplit(locunshort, "/"))[3]}}
+ return(locunshort2)}
Using apply with this function, and my list, and counting using a simple table() function, I can see that my top four (over more than 900 tweets) of reference websites is the following:
             www.nytimes.com         www.guardian.co.uk 
                          19                         21 
      www.washingtonpost.com             www.lemonde.fr 
                          21                         22
Nice, isn’t it ?

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.