Twitter is a “microblogging” service where users can, usually publicly, share links, pictures, or short comments (up to 140 characters) onto a timeline. The public timeline consists of all public tweets, but people can build their own private timelines to narrow content to just what they want to see. (They do this by “following” users.) Over the years, many companies, news organizations, and users have considered the social media site essential for sharing news and other information. (Or cat memes.) Twitter has some organizational tools such as replies/conversation threads, mentions (i.e. naming other users using the @ notation), and hashtags (naming a topic using # notation). Twitter has encouraged the use of these organizational tools by automatically making mentions and hashtags clickable links.
These organizational tools can make for some interesting analysis. For instance, a game show may encourage viewers to vote on a winner using hashtags. On their end, they create a filter for a particular hashtag (e.g. #votemyplayer) and count votes. This also makes Twitter data ripe for text mining (which they use to identify trending topics).
Obtaining the Twitter data
Twitter makes it possible for software to obtain Twitter comments without having to resort to “web-scraping” techniques (i.e. downloading the data as a web page and then parsing the HTML). Instead, you can go through an Application Programming Interface (API) to obtain the comments directly. If you’re interested, Twitter has a whole subdomain related to accessing their data, including documentation. There are a lot of technical details, but for the casual user probably the only ones of interest are API key and rate limits. This post won’t fuss with rate limits, but more serious work may require some further understanding of these issues. However, you will need to create an API key. Follow these instructions, which are tailored for R users. It essentially consists of creating a token at Twitter’s app web site and running an R function with the token. I set variables
access_secret in an R block just copying and pasting from the Twitter apps site, not echoed in this blog post for obvious reasons.
Fortunately, the twitteR package makes obtaining data from Twitter easy. It’s on CRAN, so grab it using
install.packages (it will also install dependencies such as the
httr packages if you don’t have them already) before moving on.
We authenticate our R program to Twitter and then start with searching the public timeline for “Greenville”. Note due to the changing nature of Twitter, your results will probably be different:
searchTwitter returns data as a list, which may or may not be desirable. As a default, it returns the last 25 items matching the query you pass (this can be changed by using the
n= option to the function). I used
twListToDF (part of the
Analyzing the data
The first thing to notice is that many of these tweets may be “retweets”, where a user posts the exact same tweet as a previous user to create a larger audience for the tweet. This data point may be interesting in its own right, but for now, because we are just analyzing the text, we will filter out retweets:
The thing to notice here is that there are several different Greenvilles, so this makes analysis of the local area pretty hard. Many of the tweets can be about Greenville, NC or SC. In this particular dataset, there was even a Greenville Road in California (where there was a car fire). Rather than play a filtering game, it may be better to apply some knowledge specific to the area. For instance, local tweets will often be tagged with
#yeahThatgreenville. So we will search again for the
#yeahthatgreenville hashtag (and add a few more tweets as well). This time, we’ll keep retweets:
Here I do two separate queries and add them together using the
bind_rows function from
Who is tweeting
The first thing we can do is get a list of users who tweet under this hastag as well as their number of tweets:
So I snuck a trick into the above graph. In bar charts presenting counts, I usually prefer the order in descending bar length. That way I can identify the most and least common screen names quickly. I accomplish this by using
x=reorder(screenName,screenName,function (x) -length(x))) in the
aes() function above. Now we can see that
@GiovanniDodd was the most prolific tweeter in the last 200 tweets I accessed. Some of the prolific tweeters appear to be businesses, such as
@CourtyardGreenville or perhaps tourism accounts such as
What users are saying
To analyze what users are saying about “#yeahthatgreenville”, we use the
tidytext package. There are a number of packages that can be used to analyze text, and
tm used to be a favorite, but
tidytext fits within the context of tidy data. We prefer the tidy data framework because it works with data in a specific format and has a number of powerful tools that have a specific focus but interoperate well, much like the UNIX ideal. Here,
tidytext will allow us to use
dplyr and similar tools using the pipe operator. The code will be easier to read and follow.
I used the
select function from
dplyr to keep only the
text fields. The
unnest_tokens() functions creates a long dataset with a single word replacing the text. All the other fields remain unchanged. We can now easily create a bar chart of the words used the most:
This plot is very busy, so we plot, say, the top 20 words:
Unfortunately, this is terribly unexciting. Of course “a”, “to”, “for”, and similar words are going to be at the top. In text mining, we create a list of “stop words”, including these, which are so common they are usually not worth including in an analysis. The
tidytext package includes a
stop_words data frame to assist us:
stop_words slightly to be useful to us. This involves adding a column to help us filter out in the next step and adding some common, uninteresting words “https”, “t.co”, “yeahthatgreenville”, and “amp”. We filter these out for various reasons, e.g. “https” and “t.co” are used in URLs, “amp” is left over from tokening some HTML code, and we searched on “yeahthatgreenville”. Augmenting stop words is a bit of an iterative process, which I’m not showing here, but I went back and forth a few times to get this list.
Now, we can determine which of the words above are stop words and thus not worth analyzing:
anti_join function is probably not familiar to most data scientists or statisticians. It is the opposite of a merge in a sense. Basically, the command above merges the
my_stop_words data frames, and then removes the rows that came from the
my_stop_words dataset, leaving only the rows in
word) that does not match with something from
my_stop_words. This is desirable because our
my_stop_words dataset contains words we do not want to analyze.
Now we can analyze the more interesting words:
Sentiment analysis is, in short, the quantitative study of the emotional content of text. The most sophisticated analysis, of course, is very difficult, but we can make a start using a simple procedure. Many of the ideas here can be found in a vignette for the package written by Julia Silge and David Robinson.
As a start, we use the Bing lexicon, which maps a word to positive/negative according to whether its sentiment content is positive or negative.
Sentiment analysis then is an exercise in an inner-join:
Once you get to this point, sentiment analysis can start fairly easily:
There are many more positive words than negative words, so the mood tilts positive in our crude analysis. We can also group by tweet, and see whether there more more positive or negative tweets:
On average, there are 1.3387097 positive words per tweet and 1.0555556 negative words per tweet, if you accept the assumptions of the above analysis.
There is, of course, a lot more that can be done, but this will get you started. For some more sophisticated ideas you can check Julia Silge’s analysis of Reddit data, for instance. Another kind of analysis looking at sentiment and emotional content can be found here (with the caveat that it uses the predecessor to
dplyr and thus runs somewhat less efficiently). Finally, it would probably be useful to supplement the above sentiment data frames with situation-specific sentiment analysis, such as making
goallllllll in the above a positive word.
The R packages
tidytext make analyzing content from Twitter easy. This is helpful if you want to analyze, for instance, real time reactions to events. Above we pulled content from Twitter, split it into words, and analyzed words by frequency while eliminating “uninteresting” words. Then we analyzed whether tweets were on the whole positive or negative using pre-made lexicons mapping words to positive or negative.