The echo of a tragedy in Social Media – “The Making of”

[This article was first published on Big Data Doctor » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post I’m going to describe how I pull together the visualization of The echo of a tragedy in Social Media.

I used 3 different technologies to pull it together: PHP for data gathering, R for data analysis and data output generation and D3.js for data visualization.

The day after the incidents in the Charlie Hebdo took place, I set up a Twitter monitor using the Search API based on the hashtag #jesuischarlie for 5 different languages: German, Spanish, English, Italian and of course French. I made sure for each language I had at least 10K Tweets gathered. I made available the data as csv (jesuischarlie.csv). The PHP part is easy so I’m not going to focus on it :)

The analysis part is no rocket science and is accomplished in following steps:

  • Counting the occurrences of all available hashtags per language
  • Taking the top 50 hashtags per language
  • Computing the occurrences of hashtags pairs within the selected hashtags
  • Generating the JSON according to the directed forced D3.js layout

The code taking care of the visualization is written in D3.js, embedded and readable in the visualization page.
A few comments on that:

  • There are 2 functions to make both text and opacity dependent on the number of occurrences
  • The nodes accept a click to highlight the connections to other nodes
  • The color of the highlighted connections change according to a round robin approach to allow for multiple highlighting
  • The values for gravity, distance and charge have been manually adjusted… worth spending some time playing around with them
  • The size of the images for the nodes is determined by a function to allow bigger sizes for the flags

Below you can find the R code… Again, it might not be 100% optimized, but does the job… Give it a go!

library(RMySQL)
require(data.table)
library(stringr)
 
con <- dbConnect(MySQL(), user="XXXXXX", password="XXXXXX",dbname="XXXXXXX", host="XXXXXXXX")
res=dbSendQuery(conn=con,"SELECT * FROM domaintweets where tag='#JeSuisCharlie'")
assets<-fetch(res,-1)
dbDisconnect(con)
# I've done it with a data based... you can do it with the data set I provided above with a read.csv statement
# assets<-read.csv("~/jesuischarlie/jesuischarlie.csv")
 
# languages we extracted tweets for
languages<-c("es","de","en","fr","it")
 
countPerLanguage<-function (assets, max)
{
  assets$ntags<-str_count(assets$tags, ",")
  #filtering out tweets with way too many hashtags
  assets<-assets[assets$ntags<5,]
  assets<-head(assets,max)
  tags<-assets$tags
 
  alltags<-paste(tags, collapse = ',')
  alltags<-gsub(x=alltags, pattern = ',,',replacement = ',')
  all.tags<-strsplit(x = alltags,split = ',')
  df.all.tags<- as.data.frame(all.tags)
  colnames(df.all.tags)<-c("tag")
  df.all.tags$tag<-as.character(df.all.tags$tag)
  df.all.tags$tag<-tolower(df.all.tags$tag)
  df.all.tags$count<-1
  agg.df.all.tags<-aggregate(count~tag, df.all.tags, length)
  agg.df.all.tags<-agg.df.all.tags[order(agg.df.all.tags$count, decreasing = T),]
  return (agg.df.all.tags)
}
 
# taking the top50 hashtags per languages
agg.languages<-NULL
for (i in 1:length(languages))
{
  df.lang<-subset(assets, lang==languages[i])
  top50 <- countPerLanguage(df.lang, 10000)
  lang<-rep(x = languages[i],times = nrow(top50))
  top50$lang<-lang
  top50<-head(top50,50)
  agg.languages<-rbind(agg.languages,top50)
}
 
# hashtag 2 hashtag connection
unique.tags<-unique(agg.languages$tag)
pairs.df<-NULL
for (i in 2:length(unique.tags))
{
  tag1 <- unique.tags[i]
  assets.tag <- subset(assets, grepl(pattern = paste0("^",tag1,",", "|",",", tag1,",") ,x = assets$tags) )
 
  for (j in i:length(unique.tags))
  {
    if (i!=j)
    {
      tag2<-unique.tags[j]
      print(paste(tag1, "-", tag2))
      pairs <- subset(assets.tag, grepl(pattern = paste0("^",tag2,",", "|",",", tag2,",") ,x = assets.tag$tags) )
      count.pairs<-nrow(pairs)
      if (count.pairs!= 0) {
        l<-list(unique.tags[i],unique.tags[j],count.pairs)
        df<-as.data.frame(l)
        colnames(df)<-c('tag1','tag2','Freq')
        pairs.df<-rbind(pairs.df,df)       
      }
    }
  }
}
pairs.df$tag1<-as.character(pairs.df$tag1)
pairs.df$tag2<-as.character(pairs.df$tag2)
 
# Generating JSON compatible with the force-directed graph http://bl.ocks.org/mbostock/4062045
 
languages<-c("de","en","it","fr","es")
languages.img<-c("https://cdn3.iconfinder.com/data/icons/finalflags/32/Germany-Flag.png",
                 "https://cdn3.iconfinder.com/data/icons/finalflags/32/United-Kingdom-flag.png",
                 "https://cdn3.iconfinder.com/data/icons/finalflags/32/Italy-Flag.png",
                 "https://cdn3.iconfinder.com/data/icons/finalflags/32/France-Flag.png",
                 "https://cdn3.iconfinder.com/data/icons/finalflags/32/Spain-Flag.png")
 
img.url<-"https://cdn4.iconfinder.com/data/icons/miu/22/editor_pencil_pen_edit_write_-16.png"
listNodes<-NULL
nodeNr<-0
# Genereting the Nodes
strJson<-'{ "nodes":['
 
for(j in 1:length(languages))
{
  lang<-agg.languages[agg.languages$lang==languages[j],]
  df<-data.frame(tag=languages[j],noderNr=nodeNr)
  listNodes<-rbind(df,listNodes)
  # entry for the language
  dt<-paste0('{"name":"', languages[j],'","group":',j,', "image":"',languages.img[j]  ,'", "count":', sum(lang$count) ,', "number":', nodeNr ,' },' )
  nodeNr<-nodeNr+1
 
  strJson<-paste0(strJson,dt)
  for(i in 1:nrow(lang))
  {   
    if (nrow(listNodes[listNodes$tag==lang[i,]$tag,])==0 &&  !(lang[i,]$tag %in% languages))
    {
      dt<-paste0('{"name":"', lang[i,]$tag,'","group":',j,', "image":"',img.url  ,'", "count":', lang[i,]$count ,', "number":', nodeNr ,' },' )
      strJson<-paste0(strJson,dt)
      df<-data.frame(tag=lang[i,]$tag,noderNr=nodeNr)
      listNodes<-rbind(df,listNodes)
      nodeNr<-nodeNr+1
    }
 
  }
}
strJson<-substr(strJson,1,nchar(strJson)-1)
 
# Genereting the Links
strJson<-paste0(strJson,'], "links": [')
for (l in 1:nrow(agg.languages))
{
  matches1<-subset(listNodes, tag==agg.languages[l,]$lang)
  matches2<-subset(listNodes, tag==agg.languages[l,]$tag)
  for (k in 1:nrow(matches1))
  {
    for (m in 1:nrow(matches2))
    {
      dt<-paste0('{"source":',matches1[k,]$noderNr,',"target":',matches2[m,]$noderNr,',"value": 0},')
      strJson<-paste0(strJson,dt) 
    }
  }
}
for (l in 1:nrow(pairs.df))
{
  matches1<-subset(listNodes, tag==pairs.df[l,]$tag1)
  matches2<-subset(listNodes, tag==pairs.df[l,]$tag2)
  for (k in 1:nrow(matches1))
  {
    for (m in 1:nrow(matches2))
    {
      dt<-paste0('{"source":',matches1[k,]$noderNr,',"target":',matches2[m,]$noderNr,',"value": 0},')
      strJson<-paste0(strJson,dt) 
    }
  }
}
 
strJson<-substr(strJson,1,nchar(strJson)-1)
strJson<-paste0(strJson,']}')
# Writing it to a file
write(x = strJson, file = '~/jesuischarlie/jesuischarlie2.json')

To leave a comment for the author, please follow the link and comment on their blog: Big Data Doctor » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)