Who interacts on Twitter during a conference (#JDSLille)

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Disclamer: This is a joint post with Avner Bar-Hen, a.k.a. @a_bh, Benjamin Guedj, a.k.a. @bguedj and Nathalie Villa, a.k.a. @Natty_V2

Organised annually since 1970 by the French Society of Statistics (SFdS), the Journées de Statistique (JdS) are the most important scientific event of the French statistical community. More than 400 researchers, teachers and practitioners meet at each edition. In 2015, JDS took place in Lille, in France.

SFdS regularly tweets (with the account @Statfr) and for the first year a live-tweet was organized durind JdS. The Hashtag was #JDSLille. The aim of this post is a (brief) statistical analysis of the live-tweet.

Before starting, let us load some appropriate packages (and clean our R session)

> rm(list=ls())
> graphics.off()
> library(twitteR)
> library(tm)
> library(ggplot2)
> library(wordcloud)
> library(igraph)
> library(stringr)
> library(sna)

Either you know how to scrap twitter data (see https://rhandbook.wordpress.com/tag/registertwitteroauth/ for example) and you can use the following code (after entering all the standard security parameters)

> setup_twitter_oauth(.............)
> tweet =searchTwitter("#JDSLille",n=218)
> df <- do.call("rbind", lapply(tweet, as.data.frame))

or you can use directly the dataframe by downloading the tweets

> load(url("http://freakonometrics.free.fr/JDSLille.RData"))

Now we have the data stored. Including the RT we found 219 tweets that represent 48 different users that posted atleast one tweet with the hashtag #JDSLille.

> length(tweet)
[1] 219

The most active Twitter account was @bguedj, who initiates the tweet live. The second one is @Statfr the official account of SFdS.

> counts=table(df$screenName)
> subset(counts,counts>9)
 
       bguedj  LaurenceBroze    Lionning13 
           46             12            24 
   melinaGALL          nc233        StatFr 
           12             15            26

Let have a look at first to the RT. The graph below counts the number of RT per Twitter account

> df$text <- sapply(df$text,
+ function(x) iconv(x,to='UTF-8'))
> trim <- function (x) sub('@','',x)
> df$rt=sapply(df$text,function(tweet) 
trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
> ggplot()+geom_bar(aes(x=na.omit(df$rt)))+
theme(axis.text.x=element_text(angle=-90,size=10))+xlab(NULL)

It is also possible to see when the tweet are posted. colors indicated the RT and grey indicated new posts.

> ggplot()+geom_bar(aes(x=df$created,
+ fill=df$rt ))

Let’s have a look at the most common words within tweets.

> myCorpus <- Corpus(VectorSource(df$text))
> myCorpus <- tm_map(myCorpus, stripWhitespace)
> myCorpus <- tm_map(myCorpus, removePunctuation)
> myCorpus <- tm_map(myCorpus, removeNumbers)
> myStopwords <- c(stopwords("fr"), "RT", names(counts))
> myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
> corpus.tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3,20)))
> findFreqTerms(corpus.tdm, lowfreq=10)
 [1] "aléatoires"  "amphi"       "bigdata"    
 [4] "conférence"  "exposé"      "gala"       
 [7] "graphes"     "jdslille"    "journées"   
[10] "les"         "lionning"    "matin"      
[13] "présente"    "session"     "soirée"     
[16] "statistique"

Let us visualize the network

> aretes = cbind(df$rt,df$screenName)
> aretes = aretes[apply(aretes,1,function(x)
+ sum(is.na(x))==0),]
> reseau = graph.data.frame(aretes,directed=T)
> plot(reseau)

As usual there is a big component (with 36 Twitter account) and few small ones (6 small clusters).

Let’go further.

> graphe = matrix(0,nrow=length(unique(c(df$rt,
+ df$screenName)))-1,ncol=length(unique(c(df$rt,df$screenName)))-1 )
> nom = unique(c(df$rt,df$screenName))
> rownames(graphe) = colnames(graphe) = nom[!is.na(nom)]
> for (i in 1:nrow(aretes)) graphe[aretes[i,1],
+ aretes[i,2]]=1
> net=graph.adjacency(adjmatrix=graphe,
+ mode="undirected",weighted=TRUE,diag=FALSE)
> net.components <- clusters(net)
> net.components$no
[1] 7

Let’s vizualize it

> couleur = findInterval(apply(graphe,1,sum),
+ c(0,1,5,30))
> gplot(graphe,usearrows =FALSE,
+ displayisolates = TRUE, vertex.col=couleur,
+ label=colnames(graphe)

or if we zoom in

Black means that none of the post was RT (most of the time they only RT another post). Red is between 1 and 5 RT and green is for more than 5 RT.

Let’s look at the largest component. Density is around 8.1% and transitivity is about 23.7%. Therefore we have a strong preferential attachment.

> net.lcc <- induced.subgraph(net,
+ net.components$membership==
+ which.max(net.components$csize))
> graph.density(net.lcc)
[1] 0.08108108
> transitivity(net.lcc)
[1] 0.2234043

We can also use

> net.clusters <- spinglass.community(net.lcc)
> table(net.clusters$membership)
 
1 2 3 4 5 6 7 
5 6 5 3 6 7 5 
> V(net.lcc)$community <- net.clusters$membership
> plot(net.lcc, main="Communities", 
+      vertex.frame.color=rainbow(max(
+net.clusters$members))[net.clusters$membership],
+      vertex.color=rainbow(max(
+net.clusters$members))[net.clusters$membership],
+      vertex.label=net.clusters$names,
+      edge.color="grey")

or if we zoom in

For sure, many other results can be obtained from the dataset. Please feel free to use it. And all comments and feedback are welcome!

The interesting point with this code is that it is possible to adapt it for any conference! Use for instance the function

conference = function(hash="#JDSLille",file=url("http://freakonometrics.free.fr/JDSLille.RData")){
if(is.na(file)){
tweet =searchTwitter(hash)
df <- do.call("rbind", lapply(tweet, as.data.frame))
}
if(!is.na(file)){
  load(file)
}
counts=table(df$screenName)
  top_tw=rev(sort(subset(counts,counts>quantile(counts,.9))))
  cat("Top 10% Twitter Accounts -------------n")
  for(i in 1:length(top_tw)) cat("    @",names(top_tw)[i],": ",top_tw[i]," tweetsn",sep="")
df$text <- sapply(df$text,function(x) iconv(x,to='UTF-8'))
trim <- function (x) sub('@','',x)
df$rt=sapply(df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
ggplot()+geom_bar(aes(x=na.omit(df$rt)))+theme(axis.text.x=element_text(angle=-90,size=10))+xlab(NULL)
ggplot()+geom_bar(aes(x=df$created,fill=df$rt ))
myCorpus <- Corpus(VectorSource(df$text))
myCorpus <- tm_map(myCorpus, stripWhitespace)
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
myStopwords <- c(stopwords("fr"), "RT", names(counts))
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
corpus.tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3,20)))
  top_terms=findFreqTerms(corpus.tdm, lowfreq=10)
  cat("Top Words -------------n")
  for(i in 1:length(top_terms)) cat("    ",top_terms[i],"n",sep="")
aretes = cbind(df$rt,df$screenName)
aretes = aretes[apply(aretes,1,function(x)sum(is.na(x))==0),]
reseau = graph.data.frame(aretes,directed=T)
plot(reseau)
graphe = matrix(0,nrow=length(unique(c(df$rt,df$screenName)))-1,ncol=length(unique(c(df$rt,df$screenName)))-1 )
nom = unique(c(df$rt,df$screenName))
rownames(graphe) = colnames(graphe) = nom[!is.na(nom)]
for (i in 1:nrow(aretes)) graphe[aretes[i,1],aretes[i,2]]=1
net=graph.adjacency(adjmatrix=graphe,mode="undirected",weighted=TRUE,diag=FALSE)
net.components <- clusters(net)
net.components$no
couleur = findInterval(apply(graphe,1,sum),c(0,1,5,30))
gplot(graphe,usearrows =FALSE,displayisolates = TRUE,vertex.col=couleur,label=colnames(graphe))
net.lcc <- induced.subgraph(net,net.components$membership==
which.max(net.components$csize))
graph.density(net.lcc)
transitivity(net.lcc)
net.clusters <- spinglass.community(net.lcc)
table(net.clusters$membership)
V(net.lcc)$community <- net.clusters$membership
plot(net.lcc, main="Communities", vertex.frame.color=rainbow(max(net.clusters$members))[net.clusters$membership],
vertex.color=rainbow(max(net.clusters$members))[net.clusters$membership], 
     vertex.label=net.clusters$names,
     edge.color="grey")
}

Note that the very last tweet related to the hashtag #JDSLille (during those five days) was

Indeed. That is a good idea ! See you next year for an update version of the statistical analysis of the tweet-live, on #JDSMontpellier.

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)