Twitter analysis using R (Semantic analysis of French elections)

[This article was first published on Enhance Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last month the French elections viewed through Twitter: a semantic analysis post showed how the two contenders were perceived on Twitter during three key events of the campaign (Macron leaks, presidential debate and election day). The goal of the post is to show how to perform this twitter analysis using R.

Collecting tweets in real time with streamR (Twitter streaming API)

To perform the analysis, I needed an important number of tweets and I wanted to use all of the tweets concerning the election. The Twitter search API is limited since you only have access to a sample of tweets. On the other hand, the streaming API allows you to collect the data in real-time and to collect almost all tweets. Hence, I used the streamR package.

So, I collected tweets on 60 seconds batch and saved them on .json files. The use of batches instead of one large file is to improve RAM consumption (Instead of reading and then subsetting one large file, you can do the subset on each of the batches and then merge them). Here is the code to collect the data with streamR.

###Loading my twitter credentials
load("oauth.Rdata")
##Collecting data
require('streamR')
i=1
while(TRUE)
{
 i=i+1
 filterStream( file=paste0("tweet_macronleaks/tweets_rstats",i,".json"),
 track=c("#MacronLeaks"), timeout=60, oauth=my_oauth,language = 'fr')
}

The code is doing an infinite loop (stopped manually), the filterStream function filters the Twitter stream according to the defined filter. Here, we only take the tweets containing #MacronLeaks which are in French.

Tweets cleaning and pre-processing

Now that the tweets are collected, they need to be cleaned and pre-processed. A raw tweet will contain links, tabulation, @, #, double spaces,  … that will influence the analysis. It will also contain stop words (stop words are very frequent words in the language such as ‘and’, ‘or, ‘with’, …).
In addition to this, some tweets are retweeted (sometimes a lot) and may change the words and text distribution. Enough of the RT are kept to show that some tweets are more popular than others but most of them are removed to avoid them standing too much out of the crowd.

First, the saved tweets need to be read and merged:

require(data.table)
data.tweet=NULL
i=1
while(TRUE)
{
 i=i+1
 print(i)
 print(paste0("tweet_macronleaks/tweets_rstats",i,".json"))

if (is.null(data.tweet))
 data.tweet=data.table(parseTweets(paste0("tweet_macronleaks/tweets_rstats",i,".json")))
 else
 data.tweet=rbind(data.tweet,data.table(parseTweets(paste0("tweet_macronleaks/tweets_rstats",i,".json"))))
}

Then we only keep some of the RT. The retweet count is the indices of a given retweet, hence we only keep log(1+n) of the RT.

data.tweet[,min_RT:=min(retweet_count),by=text]
data.tweet[,max_RT:=max(retweet_count),by=text]
data.tweet=data.tweet[lang=='fr',]
data.tweet=data.tweet[retweet_count<=min_RT+log(max_RT-min_RT+1),]

Then, the text can be cleaned using function from the tm package

###Unaccent and clean the text
Unaccent <- function(x) {
 x=tolower(x)
 x = gsub("@\\w+", "", x) 
 x = gsub("[[:punct:]]", " ", x)
 x = gsub("[ |\t]{2,}", " ", x) 
 x = gsub("^ ", " ", x) 
 x = gsub("http\\w+", " ", x) 
 x=tolower(x)
 x=gsub('_',' ',x,fixed=T)
 x
 
}
require(tm)
###Remove accents
data.tweet$text=Unaccent(iconv(data.tweet$text,from="UTF-8",to="ASCII//TRANSLIT"))
##Remove top words
data.tweet$text=removeWords(data.tweet$text,c('rt','a',stopwords('fr'),'e','co','pr'))
##Remove double whitespaces
data.tweet$text=stripWhitespace(data.tweet$text)

Tokenization and creation of the vocabulary

Now that the tweets have been cleaned, they can be tokenized. During this step, each tweet will be split into tokens of its different words, here each word corresponds to a token.

# Create iterator over tokens
tokens <- space_tokenizer(data.tweet$text)
it = itoken(tokens, progressbar = FALSE)

Now a vocabulary can be created (it is a “summary” of the words distribution) based on the corpus. Then the vocabulary is pruned (very common and rare words are removed).

vocab = create_vocabulary(it)

vocab = prune_vocabulary(vocab,
 term_count_min = 5, 
 doc_proportion_max = 0.4,
 doc_proportion_min = 0.0005)
vectorizer = vocab_vectorizer(vocab, 
 grow_dtm = FALSE, 
 skip_grams_window = 5L)

tcm = create_tcm(it, vectorizer)

Now, we can create the word embedding, in this example, I used a glove embedding to learn vectors representations of the words. The new vector space has around 200 dimensions.

glove = GlobalVectors$new(word_vectors_size = 200, vocabulary = vocab, x_max = 100)
glove$fit(tcm, n_iter = 200)
word_vectors <- glove$get_word_vectors()

How to finish our twitter analysis with Tsne

Now that the words are vectors, we would like to plot them in two dimensions to show the meaning of the words in an appealing (and understandable) way. The number of dimension needs to be reduced to two, to do so, we will use T-sne. T-sne is a non-parametric dimensionality reduction algorithm and tends to perform well on word embedding. R has a package (actually two) to perform Tsne, we will use the most recent one  Rtsne.
To avoid overcrowding in our plot and reduce computing time, only words with more than 50 appearances will be used.

require('Rtsne')
set.seed(123)
word_vectors_sne=word_vectors[which(vocab$vocab$doc_counts>50&!rownames(word_vectors)%in%stopwords('fr')),]
tsne_out=Rtsne(word_vectors_sne,perplexity =2,initial_dims = 200,dims = 2)
DF_proj=data.frame(x=tsne_out$Y[,1],y=tsne_out$Y[,2],word=rownames(word_vectors_sne))

Now that the projection in 2 dimensions has been done, to color the plot we’d like to know which contenders is assigned to each word. To do so, a dictionary is created with the names and pseudo of each of the contenders and the distance from every word to each of these pseudos is computed.
For instance, to assign a candidate to the word ‘democracy’, the minimum distance between ‘democracy’ and ‘mlp’, ‘marine’, ‘fn will be computed. The same thing will be done between ‘democracy’ and ‘macron’, ’em’, ’emmarche’. If the first distance is the smallest then ‘democracy’ will be assigned to Marine Le Pen, otherwise, it will be assigned to Emmanuel Macron.

require(ggplot2)
require(ggrepel)
DF_proj=data.table(DF_proj)
DF_proj$count=vocab$vocab$doc_counts[which(vocab$vocab$doc_counts>500& !(rownames(word_vectors)%in%stopwords('fr')))]
DF_proj=DF_proj[word!='NA']

distance_to_candidat=function(word_vectors,words_list,word_in)
{
 max(sim2(word_vectors[words_list,,drop=F],word_vectors[word_in,,drop=F]))
}
closest_candidat=function(word_vectors,mot_in)
{
 mot_le_pen=c('marine','pen','lepen','fn','mlp')
 mot_macron=c('macron','emmanuel','em','enmarche','emmanuelmacron')
 dist_le_pen=distance_to_candidat(word_vectors,mot_le_pen,mot_in)
 dist_macron=distance_to_candidat(word_vectors,mot_macron,mot_in)
 if (dist_le_pen>dist_macron)
 'Le Pen'
 else
 'Macron'
}
DF_proj[,word:=as.character(word)]
DF_proj=DF_proj[word!=""]
DF_proj[,Candidat:=closest_candidat(word_vectors,word),by=word]

require(plotly)
gg=ggplot(DF_proj,aes(x,y,label=word,color=Candidat))+geom_text(aes(size=sqrt(count+1)))
ggplotly(gg)

You can get our latest news on Twitter:

 

 

The post Twitter analysis using R (Semantic analysis of French elections) appeared first on Enhance Data Science.

To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)