How to Use R to Scrape Tweets: Super Tuesday 2016

[This article was first published on R Tricks – Data Science Riot!, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Super Tuesday 2016 has come and gone, we have most of the election results, but what were the American public saying on Twitter?

The twitteR package for R allows you to scrape tweets from Twitter’s API and use them to form sentiment analysis. The Plotly chart below shows what the Twitter-verse was saying about the candidates during last night’s poll results.

A basic tutorial is below, but if you want to see my full code including the charts, you can check it out on my GitHub page. For the chart below, I scraped 20,000 tweets that mentioned each candidate and then ran them through a dictionary of positive and negative words to calculate the sentiment score, excluding neutral comments.

Basic Method for Using twitteR

First, you’ve got to have a Twitter account, (with your phone number attached). After you’ve got that, just go to Twitter’s apps page and create an application. Add the name and description of your app along with a website name. The website can be a test website. There’s also a field for callback URL, but that’s optional.

Sentiment Analysis with R

Grab your API keys and access tokens from Twitter, you’ll need them for the R script.

library(twitteR)
library(ROAuth)
library(httr)

# Set API Keys
api_key <- "xxxxxxxxxxxxxxxxxxxx"
api_secret <- "xxxxxxxxxxxxxxxxxxxx"
access_token <- "xxxxxxxxxxxxxxxxxxxx"
access_token_secret <- "xxxxxxxxxxxxxxxxxxxx"
setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)

By now you should be in. Now time to grab some data.

# Grab latest tweets
tweets_sanders <- searchTwitter('@BernieSanders', n=1500)

# Loop over tweets and extract text
library(plyr)
feed_sanders = laply(tweets_sanders, function(t) t$getText())

Now you’ve got a bunch of text data for Bernie Sanders, so how do we decide what’s a “good” tweet and a “bad” tweet? This is where I turned to the Hu and Liu Opinion Lexicon, a list of 6800 positive and negative words compiled by Bing Liu and Minqing Hu of the University of Illinois at Chicago.

Unpack the Opinion Lexicon into your working directory and you should be ready to roll.

# Read in dictionary of positive and negative works
yay = scan('opinion-lexicon-English/positive-words.txt',
                  what='character', comment.char=';')
boo = scan('opinion-lexicon-English/negative-words.txt',
                  what='character', comment.char=';')
# Add a few twitter-specific negative phrases
bad_text = c(boo, 'wtf', 'epicfail', 'douchebag')
good_text = c(yay, 'upgrade', ':)', '#iVoted', 'voted')

Now, you’ve got your list of tweets and your list of opinionated words. The next thing to do is score the text of the tweets compared to how many of the “bad” and “good” words show up in each.

For this we’ll need a giant R function filled with lots of good gsub and match functions. Thanks to Jeff Breen for the function on which this was based.

score.sentiment = function(sentences, good_text, bad_text, .progress='none')
{
    require(plyr)
    require(stringr)
    # we got a vector of sentences. plyr will handle a list
    # or a vector as an "l" for us
    # we want a simple array of scores back, so we use
    # "l" + "a" + "ply" = "laply":
    scores = laply(sentences, function(sentence, good_text, bad_text) {
        
        # clean up sentences with R's regex-driven global substitute, gsub():
        sentence = gsub('[[:punct:]]', '', sentence)
        sentence = gsub('[[:cntrl:]]', '', sentence)
        sentence = gsub('\d+', '', sentence)
        #to remove emojis
        sentence <- iconv(sentence, 'UTF-8', 'ASCII')
        sentence = tolower(sentence)        
        # split into words. str_split is in the stringr package
        word.list = str_split(sentence, '\s+')
        # sometimes a list() is one level of hierarchy too much
        words = unlist(word.list)
        
        # compare our words to the dictionaries of positive & negative terms
        pos.matches = match(words, good_text)
        neg.matches = match(words, bad_text)
        
        # match() returns the position of the matched term or NA
        # we just want a TRUE/FALSE:
        pos.matches = !is.na(pos.matches)
        neg.matches = !is.na(neg.matches)
        
        # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
        score = sum(pos.matches) - sum(neg.matches)
        
        return(score)
    }, good_text, bad_text, .progress=.progress )
    
    scores.df = data.frame(score=scores, text=sentences)
    return(scores.df)
}

The good news about the obnoxiously long function, is it spits out a nice data frame that can be manipulated very easily.

# Call the function and return a data frame
feelthabern <- score.sentiment(feed_sanders, good_text, bad_text, .progress='text')
# Cut the text, just gets in the way
plotdat <- plotdat[c("name", "score")]
# Remove neutral values of 0
plotdat <- plotdat[!plotdat$score == 0, ]

# Nice little quick plot
qplot(factor(score), data=plotdat, geom="bar", 
      fill=factor(name),
      xlab = "Sentiment Score")

To leave a comment for the author, please follow the link and comment on their blog: R Tricks – Data Science Riot!.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)