# Do you still have time to sleep ?

June 11, 2012
By

(This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers)

Last week, @3wen (Ewen) helped me to write nice R functions to extract tweets in R and build datasets containing a lot of information. I've tried a couple of time on my own. Once on tweet contents, but it was not convincing and once on the activity on Twitter following an event (e.g. the death of someone famous). I have to admit that I am not a big fan of databases that can be generated using standard function to study tweets. For instance, we can only extract tweets, not re-tweets (which is also an important indicator of tweet-activity). @3wen suggested to use
require("RJSONIO")
The first step is to extract some information from a tweet, and store it in a dataset (details can be found on https://dev.twitter.com/)
obtenir_ligne <- function(unTweet){
date_courante=unTweet$created_at id_courant=unTweet$id_str
text=unTweet$text nb_followers=unTweet$user$followers_count nb_amis=unTweet$user$friends_count utc_offset=unTweet$user$utc_offset listeMentions=unTweet$entities$user_mentions return(c(list(c(id_courant,date_courante,text, nb_followers,nb_amis,utc_offset)), list(do.call("rbind",lapply(listeMentions, function(x,id_courant) c(id_courant, x$screen_name),unTweet$id_str))))) } Now that we have the code to extract information from one tweet, let us find several tweets, from one user, say my account, nom="Freakonometrics" The (small) problem here, is that we have a limitation: we can only get 100 tweets per call of the function n=100 tweets_courants=scan(paste( "http://api.twitter.com/1/statuses/user_timeline.json? include_entities=true&include_rts=true&screen_name= ",nom,"&count=",n,sep=""),what = "character", encoding="latin1") tweets_courants=paste(tweets_courants[ 1:length(tweets_courants)],collapse=" ") tweets_courants=fromJSON(tweets_courants, method = "C") Then, we use our function to build a database with 100 lines, extracTweets <- lapply(tweets_courants, obtenir_ligne) mentions=do.call("rbind",lapply(extracTweets, function(x) x[[2]])) colnames(mentions)=list("id","screen_name") res=t(sapply(extracTweets,function(x) x[[1]])) colnames(res) <- list("id","date","text", "nb_followers","nb_amis","utc_offset") The idea then is simply to use a loop, based on the latest id observed dernier_id=tweets_courants[[length(tweets_courants)]]$id_str
So, here we go,
compteurLimite=100

while(compteurLimite<4100){
tweets_courants=scan(paste(
include_entities=true&include_rts=true&screen_name=
",nom,"&count=",n,"&max_id=",dernier_id,sep=""),
what = "character", encoding="latin1")
tweets_courants=paste(tweets_courants[
1:length(tweets_courants)],collapse=" ")
tweets_courants=fromJSON(tweets_courants,
method = "C")

extracTweets <- lapply(tweets_courants[
2:length(tweets_courants)],obtenir_ligne)
mentions=rbind(mentions,do.call("rbind",
lapply(extracTweets,function(x) x[[2]])))
res=rbind(res,t(sapply(extracTweets,function(x) x[[1]])))
t(sapply(extracTweets,function(x) x[[1]]))
dernier_id=tweets_courants[[length(
tweets_courants)]]$id_str compteurLimite=compteurLimite+100 } resFreakonometrics=res= data.frame(res,stringsAsFactors=FALSE) All the information about my own tweets (and re-tweets) are stored in a nice dataset. Actually, we have even more, since we have extracted also names of people mentioned in tweets, mentionsFreakonometrics= data.frame(mentions) We can look at people I mention in my tweets gazouillis=sapply(split(mentionsFreakonometrics, mentions$screen_name),nrow)
gazouillis=gazouillis[order(gazouillis,
decreasing=TRUE)]

plot(gazouillis)
plot(gazouillis,log="xy")
> gazouillis[1:20]
155              84              77              56
J_P_Boucher         embruns      SkyZeLimit        coulmont
42              39              35              31
Fabrice_BM            3wen          obouba          msotod
31              30              29              27
StatFr     nholzschuch        renaudjf        squintar
26              25              23              23
Vicnent        pareto35        romainqc        valatini
23              22              22              22 
If we plot those frequencies, we can clearly observe a standard Pareto distribution,
Now, let us spend some time with dates and time of tweets (it was the initial goal of this post)... One more time, there is a (small) technical problem that we have to deal with: language. We need a function to convert date in English (on Twitter) to dates in French (since I have a French version of R),
changer_date_anglais <- function(date_courante){
mois <- c("Jan","Fév", "Mar", "Avr", "Mai",
"Jui", "Jul", "Aoû", "Sep", "Oct", "Nov", "Déc")
months <- c("Jan", "Feb", "Mar", "Apr", "May",
"Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
jours <- c("Lun","Mar","Mer","Jeu",
"Ven","Sam","Dim")
days <- c("Mon","Tue","Wed","Thu",
"Fri","Sat","Sun")
leJour <- substr(date_courante,1,3)
leMois <- substr(date_courante,5,7)
return(paste(jours[match(leJour,days)]," ",
mois[match(leMois,months)],substr(
date_courante,8,nchar(date_courante)),sep=""))
}
So now, it is possible to plot the times where I am online, tweeting,
DATE=Vectorize(changer_date_anglais)(res$date) DATE=sapply(resSkyZeLimit$date,
changer_date_anglais,simplify=TRUE)

DATE2=strptime(as.character(DATE),
"%a %b %d %H:%M:%S %z %Y")
lt= as.POSIXlt(DATE2, origin="1970-01-01")
heure=lt$hour+lt$min/60
plot(DATE2,heure)

On this graph, we can see that I am clearly not online almost 6 hours a day (or at least not on Twitter). It is possible to visualize more precisely the period of the day where I might be on Twitter,

hist(heure,breaks=0:24,col="light green",proba=TRUE)
X=c(heure-24,heure,heure+24)
d=density(X,n = 512, from=0, to=24,bw=1)
lines(d$x,d$y*3,lwd=3,col="red")

or, if we want to illustrate with some kind of heat plot,

Note that we did it for my Twitter account, but we can also run the code on (almost) anyone on Twitter. Consider e.g. @adelaigue. Since Alexandre is tweeting in France, we have to play with time-zones,
res=extractR("adelaigue")DATE=Vectorize(changer_date_anglais)(res\$date)
DATE2=strptime(as.character(DATE),
"%a %b %d %H:%M:%S %z %Y",tz = "GMT")+2*60*60

or I can also look at @skythelimit who's usually twitting from Singapore (I am in Montréal). I can seen clearly when we might have overlaps,

res=extractR("skythelimit")

Nice isn't it. But it is possible to do much better... for instance, for those who do not ask specifically not to be Geo-located, we can see where they do tweet during the day, and during the night... I am quite sure a dozen posts with those functions can be written...

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...