R, Twitter and URLs

August 26, 2013
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

Yesterday evening, I wanted to play with Twitter, and see which websites I was using as references in my tweets, to get a Top 4 list.
The first problem I got was because installing twitteR on Ubuntu is not that simple ! You have to install properly RCurl… But you before install the package in R, it is necessary to run the following line in a terminal
$ sudo apt-get install 
  libcurl4-gnutls-dev
then, launch R
$ R
and then you can run the standard
> install.packages("RCurl")
and install finally the package of interest,
> install.packages("twitteR")
Then, the second problem I had was that twitteR has been updated recently because of Twitter’s new API. Now, you should register on Twitter’s developers webpage, get an Id and a password, then use it in the following function (I did change both of them, below, so if you try to run the following code, you will – probably – get an error message),
> library(twitteR)
> cred <- getTwitterOAuth("ikzCtYif9Rwoood45w","rsCCifp99kw5sJfKfOUhhwyVmPl9A")
> registerTwitterOAuth(cred)
[1] TRUE
> T <- userTimeline('freakonometrics',n=5000)
you should also go on some webpage and enter a PIN that you find online.
To enable the connection, please direct your web browser to:

http://api.twitter.com/oauth/authorize?oauth_token=cQaDmxGe...

When complete, record the PIN given to you and provide it here:
It is a pain in ass, trust me. Anyway, I have be able to run it. I can now have the list with all my (recent) tweets
> T <- userTimeline('freakonometrics',n=5000)

Now, my (third) problem was to extract from my tweets the url of references. The second tweet of the list was

But when you look at the text, you see

> T[[2]]
[1] "freakonometrics: [textmining] \"How a Computer Program Helped Reveal J. K. 
Rowling as Author of A Cuckoos Calling\" http://t.co/wdmBGL8cmj by @garethideas"
So what I get is not the url used in my tweet, but a shortcut to the urls, from http://t.co/. Hopefully, @3wen (as always) has been able to help me with the following functions,
> extraire <- function(entree,motif){
+	res <- regexec(motif,entree)
+	if(length(res[[1]])==2){
+		debut <- (res[[1]])[2]
+		fin <- debut+(attr(res[[1]],"match.length"))[2]-1
+		return(substr(entree,debut,fin))
+	}else return(NA)}
> unshorten <- function(url){
+	uri <- getURL(url, header=TRUE, nobody=TRUE, followlocation=FALSE, 
+       cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
+	res <- try(extraire(uri,"\r\nlocation: (.*?)\r\nserver"))
+	return(res)}

Now, if we use those functions, we can get the true url,

> url <- "http://t.co/wdmBGL8cmj"
> unshorten(url)
[1] http://www.scientificamerican.com/article.cfm?id=how-a-computer-program-helped-show..
Now I can play with my list, to extract urls, and the address of the website,
> exturl <- function(i){
+ text_tw <- T_text[i]
+ locunshort2 <- NULL
+ indtext <- which(substr(unlist(strsplit(text_tw, " ")),1,4)=="http")
+ if(length(indtext)>0){
+ loc <- unlist(strsplit(text_tw, " "))[indtext]
+ locunshort=unshorten(loc)
+ if(is.na(locunshort)==FALSE){
+ locunshort2 <- unlist(strsplit(locunshort, "/"))[3]}}
+ return(locunshort2)}
Using apply with this function, and my list, and counting using a simple table() function, I can see that my top four (over more than 900 tweets) of reference websites is the following:
             www.nytimes.com         www.guardian.co.uk 
                          19                         21 
      www.washingtonpost.com             www.lemonde.fr 
                          21                         22
Nice, isn’t it ?

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.