Greenville on Twitter

[This article was first published on Upstate Data Analysis, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this blogpost, we use R to use Twitter data to analyze topics of interest to Greenville, SC. We will describe obtaining, manipulating, and summarizing the data.

Twitter is a “microblogging” service where users can, usually publicly, share links, pictures, or short comments (up to 140 characters) onto a timeline. The public timeline consists of all public tweets, but people can build their own private timelines to narrow content to just what they want to see. (They do this by “following” users.) Over the years, many companies, news organizations, and users have considered the social media site essential for sharing news and other information. (Or cat memes.) Twitter has some organizational tools such as replies/conversation threads, mentions (i.e. naming other users using the @ notation), and hashtags (naming a topic using # notation). Twitter has encouraged the use of these organizational tools by automatically making mentions and hashtags clickable links.

These organizational tools can make for some interesting analysis. For instance, a game show may encourage viewers to vote on a winner using hashtags. On their end, they create a filter for a particular hashtag (e.g. #votemyplayer) and count votes. This also makes Twitter data ripe for text mining (which they use to identify trending topics).

Obtaining the Twitter data

Twitter makes it possible for software to obtain Twitter comments without having to resort to “web-scraping” techniques (i.e. downloading the data as a web page and then parsing the HTML). Instead, you can go through an Application Programming Interface (API) to obtain the comments directly. If you’re interested, Twitter has a whole subdomain related to accessing their data, including documentation. There are a lot of technical details, but for the casual user probably the only ones of interest are API key and rate limits. This post won’t fuss with rate limits, but more serious work may require some further understanding of these issues. However, you will need to create an API key. Follow these instructions, which are tailored for R users. It essentially consists of creating a token at Twitter’s app web site and running an R function with the token. I set variables consumer_secret, consumer_key, access_token, and access_secret in an R block just copying and pasting from the Twitter apps site, not echoed in this blog post for obvious reasons.

Fortunately, the twitteR package makes obtaining data from Twitter easy. It’s on CRAN, so grab it using install.packages (it will also install dependencies such as the bit64 and httr packages if you don’t have them already) before moving on.

We authenticate our R program to Twitter and then start with searching the public timeline for “Greenville”. Note due to the changing nature of Twitter, your results will probably be different:

origop <- options("httr_oauth_cache")
options(httr_oauth_cache = TRUE)
library(twitteR)
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
[1] "Using direct authentication"
options(httr_oauth_cache = origop)


gvl_twitter <- searchTwitter("Greenville")
gvl_twitter_df <- twListToDF(gvl_twitter)

head(gvl_twitter_df)
                                                                                                                                                                   text
1                                  Can you recommend anyone for this #job? Delivery Driver - https://t.co/DRHtAzXQYO #Transportation #Greenville, NC #Hiring #CareerArc
2                                                        @marcelllamariee @ashlynmariieee shoulda went to Greenville I would have went and we would have fucked shit up
3                                                                                          I guess I'll go to Greenville today <ed><U+00A0><U+00BD><ed><U+00B8><U+0098>
4                          RT @sportsguymarv: In a loss to Cook Co., Brittany Davis from Greenville High scrambled a Triple-Double. 52 12 10. || #A1Skills #GHSA #High<U+0085>
  favorited favoriteCount       replyToSN             created truncated
1     FALSE             0            <NA> 2016-12-23 17:55:48     FALSE
2     FALSE             0 marcelllamariee 2016-12-23 17:54:59     FALSE
3     FALSE             0            <NA> 2016-12-23 17:54:24     FALSE
4     FALSE             0            <NA> 2016-12-23 17:53:34     FALSE
          replyToSID                 id         replyToUID
1               <NA> 812356044567375872               <NA>
2 812292443509026816 812355837901553664 777670063751135233
3               <NA> 812355693948837888               <NA>
4               <NA> 812355480425263104               <NA>
                                                                          statusSource
1                  <a href="http://www.tweetmyjobs.com" rel="nofollow">TweetMyJOBS</a>
2   <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
3   <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
4 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
     screenName retweetCount isRetweet retweeted   longitude   latitude
1 tmj_NC_transp            0     FALSE     FALSE -77.3936674 35.6096532
2  s_danielss16            0     FALSE     FALSE        <NA>       <NA>
3 __GorgeousNiq            0     FALSE     FALSE        <NA>       <NA>
4    QuayBrizzy           14      TRUE     FALSE        <NA>       <NA>
 [ reached getOption("max.print") -- omitted 2 rows ]

searchTwitter returns data as a list, which may or may not be desirable. As a default, it returns the last 25 items matching the query you pass (this can be changed by using the n= option to the function). I used twListToDF (part of the twitteR package) to convert to a data frame. The data frame contains a lot of useful information, such as the tweet, information about whether it’s a reply and the tweet to which it’s a reply, screen name, and date stamp. Thus, Twitter provides a rich data source to provide information on topics, interactions, and reactions.

Analyzing the data

Retweets

The first thing to notice is that many of these tweets may be “retweets”, where a user posts the exact same tweet as a previous user to create a larger audience for the tweet. This data point may be interesting in its own right, but for now, because we are just analyzing the text, we will filter out retweets:

library(dplyr)
gvl_twitter_unique <- gvl_twitter_df %>% filter(!isRetweet)

print(gvl_twitter_unique %>% select(text))
                                                                                                                                                                    text
1                                   Can you recommend anyone for this #job? Delivery Driver - https://t.co/DRHtAzXQYO #Transportation #Greenville, NC #Hiring #CareerArc
2                                                         @marcelllamariee @ashlynmariieee shoulda went to Greenville I would have went and we would have fucked shit up
3                                                                                           I guess I'll go to Greenville today <ed><U+00A0><U+00BD><ed><U+00B8><U+0098>
4  @Uber_Support We need one of these in Greenville SC!!!  Where are you guys?\nWhen are you opening a Greenlight station here? <ed><U+00A0><U+00BD><ed><U+00B8><U+008A>
5                                                                                                                Greenville Police Beat 12-23-16 https://t.co/Lt4rbams5z
6                                                                                                                        #market risk solutions greenville cancer center
7                                                      Can you recommend anyone for this #job in #Greenville, SC? https://t.co/Xt7hznlGgA #Purchasing #Hiring #CareerArc
8                                                            Check out my #listing in #Liberty #SC  #realestate #realtor https://t.co/CuzFTjp5ge https://t.co/sD1VypKlTu
9                                     Join the Aerotek team! See our latest #job opening here: https://t.co/gZ6jEIqOyN #Manufacturing #Greenville, SC #Hiring #CareerArc
10                                                        @Sie_SoSweet I'm driving to greenville and I don't feel like stopping <ed><U+00A0><U+00BD><ed><U+00B8><U+0082>
11                            Here's a little behind the scenes peek of 2017's 1st Off the Grid Greenville, A Visual Guide to Local Favorites... https://t.co/M33IWBDAZe
12                                                                                       #greenville electric company best western plus muskoka inn huntsville on canada
13                                                 Registered Nurse - $5,000 Sign On Bonus - Vidant Home Hospice - 925571 in Greenville, NC https://t.co/xONI21XSDq #job
14                                                              Apply to this job: Population Health Analyst Job - 926645 in Greenville, NC https://t.co/6J5IeqNak8 #job
15                                                   Greenville may be the TU tonight . <ed><U+00A0><U+00BE><ed><U+00B4><U+0094><ed><U+00A0><U+00BE><ed><U+00B4><U+0094>
16                                              I had the privilege of meeting Lottie Gibson back when James Akers Jr ran for Greenville County<U+0085> https://t.co/UzsRoWMNdr
17                                             Done been to Greenville 3 times this week<ed><U+00A0><U+00BD><ed><U+00B8><U+00BC><ed><U+00A0><U+00BD><ed><U+00B8><U+0082>
18                                                                                                                   Greenville <ed><U+00A0><U+00BD><ed><U+00B3><U+008D>
19                                                                                       What nail place in Greenville is the best bc every one I go to effs up my nails
20                                                                                                 I'm at @Walmart Supercenter in Greenville, TX https://t.co/uYSIZPA4VE
21    Goodbye Greenville, take off in 20, New York in an hour! <ed><U+00A0><U+00BD><ed><U+00BB><U+00A9> #TakeOff #NewYorkCity #BigAppleChristmas https://t.co/upN8kaGONK

The thing to notice here is that there are several different Greenvilles, so this makes analysis of the local area pretty hard. Many of the tweets can be about Greenville, NC or SC. In this particular dataset, there was even a Greenville Road in California (where there was a car fire). Rather than play a filtering game, it may be better to apply some knowledge specific to the area. For instance, local tweets will often be tagged with #yeahThatgreenville. So we will search again for the #yeahthatgreenville hashtag (and add a few more tweets as well). This time, we’ll keep retweets:

origop <- options("httr_oauth_cache")
options(httr_oauth_cache = TRUE)
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)  # needed to knit the Rmd file, may not be necessary for you to reauthenticate in 1 session
[1] "Using direct authentication"
options(httr_oauth_cache = origop)
gvl_twitter_unique <- searchTwitter("#yeahthatgreenville", n = 200) %>% twListToDF()

gvl_twitter_nolink <- gvl_twitter_unique %>% mutate(text = gsub("https?://[\\w\\./]+", 
    "", text, perl = TRUE))

Here I do two separate queries and add them together using the bind_rows function from dplyr.

Who is tweeting

The first thing we can do is get a list of users who tweet under this hastag as well as their number of tweets:

library(ggplot2)

gvl_twitter_nolink %>% ggplot(aes(x = reorder(screenName, screenName, function(x) -length(x)))) + 
    geom_bar() + theme(axis.text.x = element_text(angle = 60, hjust = 1)) + 
    xlab("")

plot of chunk unnamed-chunk-5

So I snuck a trick into the above graph. In bar charts presenting counts, I usually prefer the order in descending bar length. That way I can identify the most and least common screen names quickly. I accomplish this by using x=reorder(screenName,screenName,function (x) -length(x))) in the aes() function above. Now we can see that @GiovanniDodd was the most prolific tweeter in the last 200 tweets I accessed. Some of the prolific tweeters appear to be businesses, such as @CourtyardGreenville or perhaps tourism accounts such as @Greenville_SC.

What users are saying

To analyze what users are saying about “#yeahthatgreenville”, we use the tidytext package. There are a number of packages that can be used to analyze text, and tm used to be a favorite, but tidytext fits within the context of tidy data. We prefer the tidy data framework because it works with data in a specific format and has a number of powerful tools that have a specific focus but interoperate well, much like the UNIX ideal. Here, tidytext will allow us to use dplyr and similar tools using the pipe operator. The code will be easier to read and follow.

library(tidytext)

tweet_words <- gvl_twitter_nolink %>% select(id, text) %>% unnest_tokens(word, 
    text)

head(tweet_words)
                    id      word
1   812352595159355392       the
1.1 812352595159355392      wall
1.2 812352595159355392      gods
1.3 812352595159355392 mountains
1.4 812352595159355392       amp
1.5 812352595159355392    cities

I used the select function from dplyr to keep only the id and text fields. The unnest_tokens() functions creates a long dataset with a single word replacing the text. All the other fields remain unchanged. We can now easily create a bar chart of the words used the most:

tweet_words %>% ggplot(aes(x = reorder(word, word, function(x) -length(x)))) + 
    geom_bar() + theme(axis.text.x = element_text(angle = 60, hjust = 1)) + 
    xlab("")

plot of chunk unnamed-chunk-7

This plot is very busy, so we plot, say, the top 20 words:

tweet_words %>% count(word, sort = TRUE) %>% slice(1:20) %>% ggplot(aes(x = reorder(word, 
    n, function(n) -n), y = n)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 60, 
    hjust = 1)) + xlab("")

plot of chunk unnamed-chunk-8

Unfortunately, this is terribly unexciting. Of course “a”, “to”, “for”, and similar words are going to be at the top. In text mining, we create a list of “stop words”, including these, which are so common they are usually not worth including in an analysis. The tidytext package includes a stop_words data frame to assist us:

head(stop_words)
# A tibble: 6 × 2
       word lexicon
      <chr>   <chr>
1         a   SMART
2       a's   SMART
3      able   SMART
4     about   SMART
5     above   SMART
6 according   SMART

We’ll change stop_words slightly to be useful to us. This involves adding a column to help us filter out in the next step and adding some common, uninteresting words “https”, “t.co”, “yeahthatgreenville”, and “amp”. We filter these out for various reasons, e.g. “https” and “t.co” are used in URLs, “amp” is left over from tokening some HTML code, and we searched on “yeahthatgreenville”. Augmenting stop words is a bit of an iterative process, which I’m not showing here, but I went back and forth a few times to get this list.

my_stop_words <- stop_words %>% select(-lexicon) %>% bind_rows(data.frame(word = c("https", 
    "t.co", "yeahthatgreenville", "amp", "gvl")))

Now, we can determine which of the words above are stop words and thus not worth analyzing:

tweet_words_interesting <- tweet_words %>% anti_join(my_stop_words)

head(tweet_words_interesting)
                  id word
1 812352595159355392 wall
2 812352159916367873 wall
3 812351889681633280 wall
4 812352595159355392 gods
5 812352159916367873 gods
6 812351889681633280 gods

The anti_join function is probably not familiar to most data scientists or statisticians. It is the opposite of a merge in a sense. Basically, the command above merges the tweet_words and my_stop_words data frames, and then removes the rows that came from the my_stop_words dataset, leaving only the rows in tweet_words (the id and word) that does not match with something from my_stop_words. This is desirable because our my_stop_words dataset contains words we do not want to analyze.

Now we can analyze the more interesting words:

tweet_words_interesting %>% count(word, sort = TRUE) %>% slice(1:20) %>% ggplot(aes(x = reorder(word, 
    n, function(n) -n), y = n)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 60, 
    hjust = 1)) + xlab("")

plot of chunk unnamed-chunk-12

Sentiment analysis

Sentiment analysis is, in short, the quantitative study of the emotional content of text. The most sophisticated analysis, of course, is very difficult, but we can make a start using a simple procedure. Many of the ideas here can be found in a vignette for the package written by Julia Silge and David Robinson.

As a start, we use the Bing lexicon, which maps a word to positive/negative according to whether its sentiment content is positive or negative.

bing_lex <- get_sentiments("bing")

head(bing_lex)
# A tibble: 6 × 2
        word sentiment
       <chr>     <chr>
1    2-faced  negative
2    2-faces  negative
3         a+  positive
4   abnormal  negative
5    abolish  negative
6 abominable  negative

Sentiment analysis then is an exercise in an inner-join:

gvl_sentiment <- tweet_words_interesting %>% left_join(bing_lex)

head(gvl_sentiment)
                  id word sentiment
1 812352595159355392 wall      <NA>
2 812352159916367873 wall      <NA>
3 812351889681633280 wall      <NA>
4 812352595159355392 gods      <NA>
5 812352159916367873 gods      <NA>
6 812351889681633280 gods      <NA>

Once you get to this point, sentiment analysis can start fairly easily:

gvl_sentiment %>% filter(!is.na(sentiment)) %>% group_by(sentiment) %>% summarise(n = n())
# A tibble: 2 × 2
  sentiment     n
      <chr> <int>
1  negative    19
2  positive    83

There are many more positive words than negative words, so the mood tilts positive in our crude analysis. We can also group by tweet, and see whether there more more positive or negative tweets:

gvl_sent_anly2 <- gvl_sentiment %>% group_by(sentiment, id) %>% summarise(n = n()) %>% 
    ungroup() %>% group_by(sentiment) %>% summarise(n = mean(n, na.rm = TRUE))

gvl_sent_anly2
# A tibble: 3 × 2
  sentiment        n
      <chr>    <dbl>
1  negative 1.055556
2  positive 1.338710
3      <NA> 6.040404

On average, there are 1.3387097 positive words per tweet and 1.0555556 negative words per tweet, if you accept the assumptions of the above analysis.

There is, of course, a lot more that can be done, but this will get you started. For some more sophisticated ideas you can check Julia Silge’s analysis of Reddit data, for instance. Another kind of analysis looking at sentiment and emotional content can be found here (with the caveat that it uses the predecessor to dplyr and thus runs somewhat less efficiently). Finally, it would probably be useful to supplement the above sentiment data frames with situation-specific sentiment analysis, such as making goallllllll in the above a positive word.

Conclusions

The R packages twitteR and tidytext make analyzing content from Twitter easy. This is helpful if you want to analyze, for instance, real time reactions to events. Above we pulled content from Twitter, split it into words, and analyzed words by frequency while eliminating “uninteresting” words. Then we analyzed whether tweets were on the whole positive or negative using pre-made lexicons mapping words to positive or negative.

To leave a comment for the author, please follow the link and comment on their blog: Upstate Data Analysis.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)