Doing a Twitter Analysis with R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Recently I took part at Coding Durer, a five days international and interdisciplinary hackathon for art history and information science. The goal of this hackathon is to bring art historians and information scientists together to work on data. It is kind of an extension to the cultural hackathon CodingDaVinci where I participated in the past. There is also a blog post about CDV. I will write another blog post about the result of Coding Durer another day but this article is going to be a twitter analysis of the hashtag #codingdurer. This article was a very good start for me to do the analysis.
First we want to get the tweets and we are going to use the awesome twitteR package. If you want to know how you can get the API key and stuff I recommend to visit this page here. If you have everything setup we are good to go. The code down below does the authentication with Twitter and loads our packages. I assume you know how to install a R package or at least find a solution on the web.
# get package require(twitteR) library(dplyr) library(ggplot2) library(tidytext) # do auth consumer_key <- "my_key" consumer_secret <- "my_secret" access_token <- "my_token" access_secret <- "my_access_secret" setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
We are now going to search for all the tweets containing the hashtag #codingdurer using the searchTwitter function from the twitteR package. After converting the result to a easy-to-work-with data frame we are going to remove all the retweets from our results because we do not want any duplicated tweets. I also removed the links from the twitter text as we do not need them.
# get tweets cd_twitter <- searchTwitter("#CodingDurer", n = 2000) cd_twitter_df <- twListToDF(cd_twitter) # remove retweets cd_twitter_unique <- cd_twitter_df %>% filter(!isRetweet) # remove link cd_twitter_nolink <- cd_twitter_unique %>% mutate(text = gsub("https?://[\\w\\./]+", "", text, perl = TRUE))
With the code down below we are going to extract the twenty most active twitter accounts during Coding Durer. I used some simple ggplot for graphics and saved it to a variable called people.
# who is tweeting people = cd_twitter_nolink %>% count(screenName, sort = TRUE) %>% slice(1:20) %>% ggplot(aes(x = reorder(screenName, n, function(n) -n), y = n)) + ylab("Number of Tweets") + xlab("") + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle("Most active twitter users")
Now we want to know the twenty most used words from the tweets. This is going to be a bit trickier. First we extract all the words being said. Then we are going to remove all the stop words (and some special words like codingdurer, https …) as they are going to be uninteresting for us. We are also going to remove any twitter account name from the tweets. Now we are almost good to go. We are just doing some singularization and then we can save the top twenty words as a ggplot graphic in a variable called word.
# what is being said tweet_words <- cd_twitter_nolink %>% select(id, text) %>% unnest_tokens(word, text) # remove stop words my_stop_words <- stop_words %>% select(-lexicon) %>% bind_rows(data.frame(word = c("codingdurer","https", "t.co", "amp"))) tweet_words_interesting <- tweet_words %>% anti_join(my_stop_words) # remove name of tweeters cd_twitter_df$screenName = tolower(cd_twitter_df$screenName) tweet_words_interesting = filter(tweet_words_interesting, !(word %in% unique(cd_twitter_df$screenName))) # singularize words tweet_words_interesting$word2 = singularize(unlist(tokenize(tweet_words_interesting$word))) tweet_words_interesting$word2[tweet_words_interesting$word2 == "datum"] = "data" tweet_words_interesting$word2[tweet_words_interesting$word == "people"] = "people" word = tweet_words_interesting %>% count(word2, sort = TRUE) %>% slice(1:20) %>% ggplot(aes(x = reorder(word2, n, function(n) -n), y = n)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ylab("Word Occurrence") + xlab("") + ggtitle("Most used words in tweets") # plot all together grid.arrange(people, word, nrow=2, top = "Twitter Analysis of #codingdurer")
The grid.arrange function let us plot both of our graphics at once. Now we can see who the most active twitter users were and what the most used words were. It is good to see words like art, data and project at the top.
Make sure you check out my Github for other data driven projects.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.