How to Use R to Scrape Tweets: Super Tuesday 2016
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Super Tuesday 2016 has come and gone, we have most of the election results, but what were the American public saying on Twitter?
The twitteR package for R allows you to scrape tweets from Twitter’s API and use them to form sentiment analysis. The Plotly chart below shows what the Twitter-verse was saying about the candidates during last night’s poll results.
A basic tutorial is below, but if you want to see my full code including the charts, you can check it out on my GitHub page. For the chart below, I scraped 20,000 tweets that mentioned each candidate and then ran them through a dictionary of positive and negative words to calculate the sentiment score, excluding neutral comments.
Basic Method for Using twitteR
First, you’ve got to have a Twitter account, (with your phone number attached). After you’ve got that, just go to Twitter’s apps page and create an application. Add the name and description of your app along with a website name. The website can be a test website. There’s also a field for callback URL, but that’s optional.
Sentiment Analysis with R
Grab your API keys and access tokens from Twitter, you’ll need them for the R script.
library(twitteR) library(ROAuth) library(httr) # Set API Keys api_key <- "xxxxxxxxxxxxxxxxxxxx" api_secret <- "xxxxxxxxxxxxxxxxxxxx" access_token <- "xxxxxxxxxxxxxxxxxxxx" access_token_secret <- "xxxxxxxxxxxxxxxxxxxx" setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)
By now you should be in. Now time to grab some data.
# Grab latest tweets tweets_sanders <- searchTwitter('@BernieSanders', n=1500) # Loop over tweets and extract text library(plyr) feed_sanders = laply(tweets_sanders, function(t) t$getText())
Now you’ve got a bunch of text data for Bernie Sanders, so how do we decide what’s a “good” tweet and a “bad” tweet? This is where I turned to the Hu and Liu Opinion Lexicon, a list of 6800 positive and negative words compiled by Bing Liu and Minqing Hu of the University of Illinois at Chicago.
Unpack the Opinion Lexicon into your working directory and you should be ready to roll.
# Read in dictionary of positive and negative works yay = scan('opinion-lexicon-English/positive-words.txt', what='character', comment.char=';') boo = scan('opinion-lexicon-English/negative-words.txt', what='character', comment.char=';') # Add a few twitter-specific negative phrases bad_text = c(boo, 'wtf', 'epicfail', 'douchebag') good_text = c(yay, 'upgrade', ':)', '#iVoted', 'voted')
Now, you’ve got your list of tweets and your list of opinionated words. The next thing to do is score the text of the tweets compared to how many of the “bad” and “good” words show up in each.
For this we’ll need a giant R function filled with lots of good gsub and match functions. Thanks to Jeff Breen for the function on which this was based.
score.sentiment = function(sentences, good_text, bad_text, .progress='none') { require(plyr) require(stringr) # we got a vector of sentences. plyr will handle a list # or a vector as an "l" for us # we want a simple array of scores back, so we use # "l" + "a" + "ply" = "laply": scores = laply(sentences, function(sentence, good_text, bad_text) { # clean up sentences with R's regex-driven global substitute, gsub(): sentence = gsub('[[:punct:]]', '', sentence) sentence = gsub('[[:cntrl:]]', '', sentence) sentence = gsub('\d+', '', sentence) #to remove emojis sentence <- iconv(sentence, 'UTF-8', 'ASCII') sentence = tolower(sentence) # split into words. str_split is in the stringr package word.list = str_split(sentence, '\s+') # sometimes a list() is one level of hierarchy too much words = unlist(word.list) # compare our words to the dictionaries of positive & negative terms pos.matches = match(words, good_text) neg.matches = match(words, bad_text) # match() returns the position of the matched term or NA # we just want a TRUE/FALSE: pos.matches = !is.na(pos.matches) neg.matches = !is.na(neg.matches) # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum(): score = sum(pos.matches) - sum(neg.matches) return(score) }, good_text, bad_text, .progress=.progress ) scores.df = data.frame(score=scores, text=sentences) return(scores.df) }
The good news about the obnoxiously long function, is it spits out a nice data frame that can be manipulated very easily.
# Call the function and return a data frame feelthabern <- score.sentiment(feed_sanders, good_text, bad_text, .progress='text') # Cut the text, just gets in the way plotdat <- plotdat[c("name", "score")] # Remove neutral values of 0 plotdat <- plotdat[!plotdat$score == 0, ] # Nice little quick plot qplot(factor(score), data=plotdat, geom="bar", fill=factor(name), xlab = "Sentiment Score")
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.