How are you feeling..? – Election 2015

[This article was first published on Mango Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The way we communicate is changing. The social media revolution can literally change governments. Twitter is one of the leading mediums through which we, the people, pour forth our informed opinion or raging vitriol, our messages of peace or diatribes of hate. For better or worse, our voices have never been so loud. And when better to listen than on election day? As such, Twitter provides us here at Mango with a fantastic opportunity to be able to quantify the mood of a nation. And today I’m going to show you the impact of last-minute campaigning on the way twitter users may vote.

So how did we go about this? Well, first, we collected tweets that contained the hashtags like #GE2015, as well as the party-specific hashtags. We then extracted this into a plain text file that looked a little like the data below, taken from an initial run last week:

head(dat)

date               hashtags                 id lang

1 Wed Apr 29 13:25:39 +0000 2015                        593405787155881987   en

2 Wed Apr 29 13:25:39 +0000 2015                 GE2015 593405789487964161   en

3 Wed Apr 29 13:25:40 +0000 2015                 Greens 593405790523936768   en

4 Wed Apr 29 13:25:40 +0000 2015           GE15 voteSNP 593405791589269504   en

5 Wed Apr 29 13:25:42 +0000 2015             votelabour 593405798736338944  und

6 Wed Apr 29 13:25:42 +0000 2015 UKIP conspiracy GE2015 593405802699980800   en

screen_name

1    IngreyLouise

2      leocullen4

3 STynesideGreens

4    srahmanburgh

5    bryanellis01

6      LDNCalling

text

1 Vote Green in Leicester Castle Ward! :) https://t.co/gAxbWeFnQo

2 RT @WillBlackWriter: David Cameron says “No income tax, no VAT….and this time in 2020 you’ll be millionaires.”nn#GE2015 http://t.co/f61SK…

3   RT @martinbrampton: Top economist attacks Tory austerity – and Labour’s limp response http://t.co/hEJydiIOuo Only #Greens offer real change…

4   RT @NicolaSturgeon: Forget polls – only votes win elections. The more seats @theSNP win, the stronger Scotland will be. Let’s keep working …

5 https://t.co/RJIeGGUv2n #votelabour

6   RT @PeterMannionMP: “If the…polls are off by 15 (fifteen) % #UKIP win around 100 seats.” Yeah, good luck with that. #conspiracy #GE2015 h…

timestamp                    urls         user_mentions            user_name

1 1430313939261 https://t.co/gAxbWeFnQo                               Louise Young

2 1430313939817                               WillBlackWriter          leon Cullen

3 1430313940064  http://t.co/hEJydiIOuo        martinbrampton South Tyneside Green

4 1430313940318                         NicolaSturgeon theSNP         selma rahman

5 1430313942022 https://t.co/RJIeGGUv2n                                      bryan

6 1430313942967                                PeterMannionMP          Simon Mason

This is a lot of detailed information! The sheer volume of tweets – some 300,000 records from the last 36 hours – and the amount of detail in each meant that our analysis must be automated, and what better tool to use than R?! As you can see there is a lot of detailed information that is presented in a tweet. I used the polarity function from the package qdap to generate a numeric opinion from each tweet. The function generates an approximate positive or negative sentiment (or polarity).

> polarity(c("happy", "smile", "pleased"))

all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity

1 all               3           3            1           0                Inf

> polarity(c("sad", "cross", "angry"))

all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity

1 all               3           3       -0.667       0.577             -1.155

polarity("oh so plain")

all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity

1 all               1           3            0          NA                 NA

The numeric value is generated from a data table of word-score pairs. The default dictionary is key.pol from qdapDictionaries.

> qdapDictionaries::key.pol

x  y

1:     a plus  1

2:   abnormal -1

3:    abolish -1

4: abominable -1

5: abominably -1

---

6775:  zealously -1

6776:     zenith  1

6777:       zest  1

6778:      zippy  1

6779:     zombie -1

The first step is to decide to which party the tweet is referring. Looking at representative records within the dataset, some instances mentioned just one party in each tweet:

“Vote Green in Leicester Castle Ward! :)”

If we search for the string “Green”, and assume that given the provenance of the data the string “Green” refers to the Green Party, we can classify that tweet as being an opinion of the green party.

However, consider the following tweet:

“Top economist attacks Tory austerity – and Labour’s limp response http://t.co/hEJydiIOuo Only #Greens offer real change”

In this case, we could match strings “Tory”, “Labour”, and “Green”, but automating the sentiments that are attached to each party is a lot more challenging. As this was out of the scope of this particular piece of work, I took the decision that if more than one party was mentioned in any one tweet, that record was ignored. Thus my function searched for party names using regular expressions, but does not classify the tweet if more than one party is matched.

#' Classify to which party a tweet is referring

#'

#' Provide a pattern for each party and return a vector of labels.

#' Currently a simple search to find tweets that contain patterns matching

#' each party. Tweets mentioning multiple parties are not currently analysed.

#'

#' @param txt character vector

#' @param parties character vector of party names

#' @param patterns character vector of length parties (optional)

#' @param asis single logical if FALSE drop to lower case

#' @return character vector of length txt with value parties, or ""

#' @examples

#' findParty(c("Conservatives", "Greens", "cons", "tories"))

#' findParty(c("Conserv", "Greens", "Conserv cons",

#'     "Conserv tories", "Conserv snp", "snp", NA))
 

findParty <- function(txt, parties = c("Conservative", "Labour"),

patterns = NULL, asis = FALSE) {

if (is.null(patterns)) { patterns <- parties }
 
if (!asis) {
 
txt <- casefold(x = txt, upper = FALSE)

patterns <- casefold(x = patterns, upper = FALSE)

} 

out <- character(length = length(txt))

findMat <- matrix(FALSE, nrow = length(txt), ncol = length(parties)) 

for (party in seq_along(parties)) { findMat[, party] <- grepl(pattern = patterns[party], x = txt) }

justOne <- apply(X = findMat, MARGIN = 1L, FUN = sum, na.rm = TRUE) == 1L

for (party in seq_along(parties)) { out[justOne & findMat[, party, drop = TRUE]] <- parties[party] }

return(out)

}

Even when not writing an R package I always use roxygen2 headers now, to remind me to make sure that there’s sufficient information for others to understand my work. It really isn’t much extra effort, and you’ll thank yourself later. As described above, this function creates a matrix of results to allow each pattern to be matched in turn, then classifies where exactly one match is made.

I then made another function that uses the date to create date group bins, performs the polarity calculation, and returns the result. The dictionary lookup methods can be a little slow for mid-sized datasets like this, so I added parallelization for this loop.

#' Get Polarity of Groups

#'

#' Classify text of tweets in a data file and then use qdap

#' sentiment polarity analysis to guess opinion of tweet.

#'

#' @param data data.frame with columns enumerate{

#'     item  date character date with specified format

#'     item  text character message posted by user_name at date

#' }

#' The following columns are expected but not currently used enumerate{

#'     item  hashtags character

#'     item  id numeric

#'     item  lang label, typically "en", also "und", "fr", "cy", etc.

#'     item  screen_name

#'     item  timestamp numeric

#'     item  urls

#'     item  user_mentions

#'     item  user_name

#' }

#' @param file name of file to write

#' @param fmt single character specifying format of date column (see ?strptime)

#' @param summaryfmt single character specifying

#' @param parties character vector of groups to assign

#' output format of time grouping column (see ?strptime)

#' @param patterns character vector of length parties

#' @param ncores single integer max number of cores across which to split group search

#' (default 2)

#' @param onlyclassified single logical should only classified records be returned?

#' (default TRUE)

#' @return data frame invisibly

#' @import qdap foreach doSNOW

#' @examples

#' littledat <- structure(list(date = c(

#'         "Wed Apr 29 13:25:39 +0000 2015", "Wed Apr 29 13:25:39 +0000 2015",

#'         "Wed Apr 29 13:25:40 +0000 2015", "Wed Apr 29 13:25:40 +0000 2015",

#'         "Wed Apr 29 13:25:42 +0000 2015"), text = c("Vote Green in Leicester Castle Ward! :) https://t.co/gAxbWeFnQo",

#'         "RT @WillBlackWriter: David Cameron says "No income tax, no VAT....and this time in 2020 you'll be millionaires."nn#GE2015 http://t.co/f61SK…",

#'         "RT @martinbrampton: Top economist attacks Tory austerity – and Labour's limp response http://t.co/hEJydiIOuo Only #Greens offer real change…",

#'         "RT @NicolaSturgeon: Forget polls - only votes win elections. The more seats @theSNP win, the stronger Scotland will be. Let's keep working …",

#'         "https://t.co/RJIeGGUv2n #votelabour")),

#'     .Names = c("date", "text"), class = "data.frame", row.names = c(NA, 5L))

#' littleres <- getGroups(data = littledat)

#' dontrun{

#' system.time(res <- getGroups(data = dat))

#' }

getGroups <- function(data, fmt = "%a %b %d %H:%M:%S +0000 %Y",

summaryfmt = "%Y-%m-%d %H:%M",

parties = c("Conservative", "Labour"),

patterns = NULL,

ncores = 2L, onlyclassified = TRUE) {

if (missing(data)) { stop("data is missing") }
 
if (!all(c("date", "text") %in% colnames(data))) {

stop("columns 'date' and 'text' must be present") }

if (is.null(patterns)) {

patterns <- parties

asis <- FALSE

} else {

if (length(patterns) != length(parties)) {

stop("there must be one pattern for each party") }

asis <- TRUE

}

# get date then group by time period

data$date <- as.POSIXct(x = data$date, format = fmt)

tGroups <- format.POSIXct(x = data$date, format = summaryfmt)

uGroups <- unique(tGroups)

nGroups <- length(uGroups)
 
# remove websites

txt <- gsub(pattern = "http(s){0,1}://t.co/[A-Za-z0-9]{2,10}",

replacement = "", x = data$text)

txt <- casefold(x = txt, upper = FALSE)

# find party

party <- findParty(txt = txt, parties = parties,

patterns = patterns, asis = asis)

# remove party names

for (rem in seq_along(parties)) {

txt <- gsub(pattern = paste0(parties[rem], "[a-z]{0,9} "),

replacement = "", x = txt)

}
 
# set up cluster on local machine

cl <- makeCluster(ncores)

registerDoSNOW(cl)

# a foreach loop using local cluster

res <- foreach(i = seq_len(nGroups),

.packages = "qdap") %dopar% {

# get polarity

# extract values and clean

# skip unclassified groups if onlyclassified

if (onlyclassified) {

useRecords <- party != "" & tGroups == uGroups[i]

} else {

useRecords <- tGroups == uGroups[i]

}

pol <- rep(NA, times = sum(useRecords))

if (sum(useRecords) > 0L) {

pol  <- polarity(txt[useRecords], constrain = TRUE)$all[, "polarity"]

}

return(pol)

}

# tear down cluster

stopCluster(cl)

dataGrouped <- data.frame("Time" = tGroups, "Party" = party)

if (onlyclassified) { dataGrouped <- dataGrouped[party != "", ] }

dataGrouped$"Score" <- do.call("c", res)

return(dataGrouped)

}

The example for this function shows how the function can be used to get some quantitative measure from data of this structure.

We can then visualize these results:

require(ggplot2)

theme_set(theme_bw(base_size = 14))

theme_update(axis.text.x = element_text(angle = 90, vjust = 1))


# Create basic plot with smoother

partyPlot <- ggplot(aes(x = Time, y = Score), data = res) +

geom_point(aes(colour = Party)) +

geom_smooth(colour = "black", size = 2) +

facet_wrap( ~ Party)


# Party colours

partyPlot <- partyPlot +

scale_colour_manual(values = c("#14427311", "#6BAE2011", "#FA122C11", "#FF8C3C11",

"#41852D11", "#FCBA4011", "#8C227E11"))

partyPlot


The above ggplot2 plot shows interesting patterns of tweets during the last hours of the election 2015.  The Conservatives have the largest volume of tweets and also the largest variation in sentiment.  Plaid Cymru received the smallest number of tweets but they are largely positive.  The Liberal Democrats have an upward trend with the SNP having a downward trend.  More analysis would be required to determine the statistical significance of the results.

As mentioned above, we could certainly improve the sophistication with which we examine tweets, but this serves as a demonstration that the voice of public opinion can be quantitatively captured and analysed. So for our next post, we’re going to show you in more depth how we used the Twitter API to capture a portion of the tweets posted in the 2015 General Election.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)