More republican debate analysis with R

Posted on January 6, 2016 by En El Margen - R-English in R bloggers | 0 Comments

[This article was first published on En El Margen - R-English, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A few weeks late, here is a follow-up analysis using R, of the transcript of the latest Republican primary debate held at Las Vegas, Nevada.

Like the previous post, it should be interesting to see some word-clouds and some trends from the front-runners (and of course, Donald Trump).

Getting and cleaning the data

As in the last post, we’re going to import the data and clean with a function that was nicely improved by Alan Jordan:

# some packages for scraping and cleaning the data
library(rvest)
library(plyr)
library(dplyr)
library(stringi)
library(magrittr)

# function to partially separate and clean into a data.frame a debate from the presidency project
MakeDebateDF<-function(df){
  newdf <- data.frame(
    person = apply(df, 
                   MARGIN = 1, 
                   function(x){
                     stri_extract_first_regex(x, 
                                              "[A-Z'-]+(?=(:\s))")
                   }),
    message = apply(df, 
                    MARGIN = 1, 
                    function(x){
                      stri_replace_first_regex(x,
                                               "[A-Z'-]+:\s+", 
                                               "")
                    }),
    stringsAsFactors=FALSE
  )
  for (j in 2:nrow(newdf)) { 
  if (is.na(newdf[j,'person'])) 
		{newdf[j,'person'] <-  newdf[(j-1),'person'] }
	}

  return(newdf)
}

This time i’m only downloading one debate, and joining with the last four I had parsed…

# Importing debates --- 
# url for all debates
url <- "http://www.presidency.ucsb.edu/ws/index.php?pid="

### -------- debate in Las Vegas, Nevada (fifth debate)
lasvegas <- "111177"

debate_v <- read_html(paste0(url, lasvegas)) %>% 
  html_nodes("p") %>%
  html_text()

debate_v <- ldply(debate_v, rbind)
debate_v <- MakeDebateDF(debate_v)

Analyzing

Let’s join this data with the previous debates and see some stats and wordclouds…

# the last 4 debates were stored in "all_debates" object...
all_debates <- rbind(all_debates, 
                     debate_v)

Because he’s the most interesting to watch, let’s see what Trump says overall and in this debate…

library(ggplot2)
# this is for order_axis and theme_eem
# it can be downloaded using 
# devtools::install_github("eflores89/eem")
library(eem)
# all debates
trump_words <- apply(subset(all_debates, person == "TRUMP")['message'],
                    1,
                    paste)
# cloud
# function taken from: 
# http://www.sthda.com/english/wiki/word-cloud-generator-in-r-one-killer-function-to-do-everything-you-need
trump_cloud <- rquery.wordcloud(trump_words, 
    "text", 
    max.words = 300,
    excludeWords = c("going","and",
                    "applause","get",
                    "got","let"))

trump_freq <- trump_cloud$freqTable

# debate in Las Vegas
trump_words_l <- apply(subset(debate_v, person == "TRUMP")['message'],
                    1,
                    paste)
trump_cloud_l <- rquery.wordcloud(trump_words_l, 
    "text", 
    max.words = 300,
    excludeWords = c("going","and",
                    "applause","get",
                    "got","let"))

trump_freq_l <- trump_cloud_l$freqTable

Overall word-cloud

Las Vegas

Shifts in speech

Of course, over the same five debates, topics have shifted tremendously both among the contenders and Trump.

For example, let’s see what the most spoken words were by debate…

# using previous data for each debate....
debate_words_h <- rquery.wordcloud(x = debate_h$message) #ohio, 1st
  # just the frequency table...
  # a bit lazy to do myself!
  debate_words_h <- debate_words_h$freq %>% mutate("Debate" = "Ohio")
debate_words_c <- rquery.wordcloud(x = debate_c$message) #cali, 2nd
  debate_words_c <- debate_words_c$freq %>% mutate("Debate" = "California")
debate_words_b <- rquery.wordcloud(x = debate_b$message) #boulder, 3rd
  debate_words_b <- debate_words_b$freq %>% mutate("Debate" = "Boulder")
debate_words_w <- rquery.wordcloud(x = debate_w$message) #wisc, 4th
  debate_words_w <- debate_words_w$freq %>% mutate("Debate" = "Wisconsin")
debate_words_v <- rquery.wordcloud(x = debate_v$message) #vegas, 5th
  debate_words_v <- debate_words_v$freq %>% mutate("Debate" = "LasVegas")

# join all
all_debate_words <- rbind.data.frame(debate_words_h, debate_words_c) 
all_debate_words <- rbind.data.frame(all_debate_words, debate_words_b) 
all_debate_words <- rbind.data.frame(all_debate_words, debate_words_w) 
all_debate_words <- rbind.data.frame(all_debate_words, debate_words_v) 

# graph with some interesting words...
interesting_words <- subset(all_debate_words, word %in% c("government",
                                              "isis","president","senator",
                                              "money", "jobs", "tax", "obama",
                                              "clinton", "america"))

interesting_words$Debate <- factor(interesting_words$Debate, 
                          levels = c("Ohio","California",
                                     "Boulder","Wisconsin",
                                     "LasVegas"))

ggplot(data = interesting_words, 
        aes(x = Debate, 
            y = freq, 
            group = word)) + 
        geom_line(aes(colour = word)) +
        theme_eem() +
        scale_colour_eem(20) + 
        labs(x = "Debate", 
             y = "Frequency", 
             title = "Shifts in speech")

Apparently, “tax” is out: it wasn’t even mentioned this past debate, in contrast with the increasingly present “isis”. “Clinton” and “obama” are a constant:

Aggregate stats

Now lets see some aggregate stats by contender.

This function is a bit confusing and/or unnecesary, I’ll probably find a better way to do this in the future…

UnlistAndExtractInfo <- function(candidate){
# this function is not general - it only applies to these particular debates...
# all the debates must be named the same in the parent env.
# for example: debate_h ...

allwords_1 <- tolower(unlist(
              stri_extract_all_words(
              apply(
              subset(debate_h, person == candidate)['message'],
                    1,
                    paste))))
allwords_2 <- tolower(unlist(
              stri_extract_all_words(
              apply(
              subset(debate_c, person == candidate)['message'],
                    1,
                    paste))))
allwords_3 <- tolower(unlist(
              stri_extract_all_words(
              apply(
              subset(debate_b, person == candidate)['message'],
                    1,
                    paste))))
allwords_4 <- tolower(unlist(
              stri_extract_all_words(
              apply(
              subset(debate_w, person == candidate)['message'],
                    1,
                    paste))))
allwords_5 <- tolower(unlist(
              stri_extract_all_words(
              apply(
              subset(debate_v, person == candidate)['message'],
                    1,
                    paste))))
df_insights <- data.frame(
debate = c("Ohio", "California", "Colorado", "Wisconsin","Vegas"),
average_intervention = c(mean(stri_count_words(
                        apply(
                          subset(debate_h, person == candidate)['message'],
                                  1,
                        paste))),
                        mean(stri_count_words(
                        apply(
                          subset(debate_c, person == candidate)['message'],
                                  1,
                        paste))),
                        mean(stri_count_words(
                        apply(
                          subset(debate_b, person == candidate)['message'],
                                  1,
                        paste))),
                        mean(stri_count_words(
                        apply(
                          subset(debate_w, person == candidate)['message'],
                                  1,
                        paste))),
                        mean(stri_count_words(
                        apply(
                          subset(debate_c, person == candidate)['message'],
                                  1,
                        paste)))
                        ),
words_total = c(length(allwords_1),
                length(allwords_2),
                length(allwords_3),
                length(allwords_4),
                length(allwords_5)),
words_unique = c(length(unique(allwords_1)),
                 length(unique(allwords_2)),
                 length(unique(allwords_3)),
                 length(unique(allwords_4)),
                 length(unique(allwords_5))
                 ),
words_repeated_fromfirst = c(0, sum(allwords_2 %in% allwords_1), 
                            sum(allwords_3 %in% allwords_1),
                            sum(allwords_4 %in% allwords_1),
                            sum(allwords_5 %in% allwords_1)),
unique_words_repeated_fromfirst = c(0,
                            length(unique(allwords_2[allwords_2 %in% allwords_1])),
                            length(unique(allwords_3[allwords_3 %in% allwords_1])),
                            length(unique(allwords_4[allwords_4 %in% allwords_1])),
                            length(unique(allwords_5[allwords_5 %in% allwords_1]))
                            ),
words_repeated_fromsecond = c(0, 0, 
                            sum(allwords_3 %in% allwords_2),
                            sum(allwords_4 %in% allwords_2),
                            sum(allwords_5 %in% allwords_2)),
unique_words_repeated_fromsecond = c(0, 0,
                            length(unique(allwords_3[allwords_3 %in% allwords_2])),
                            length(unique(allwords_4[allwords_4 %in% allwords_2])),
                            length(unique(allwords_5[allwords_5 %in% allwords_2]))
                            ),
words_repeated_fromthird = c(0, 0, 0,
                            sum(allwords_4 %in% allwords_3),
                            sum(allwords_5 %in% allwords_3)),
unique_words_repeated_fromthird = c(0, 0, 0,
                            length(unique(allwords_4[allwords_4 %in% allwords_3])),
                            length(unique(allwords_5[allwords_5 %in% allwords_3]))
                            )
, stringsAsFactors = FALSE)
return(df_insights)
}

# going to create a data frame with all the counts from the top candidates...
candidates <- c("TRUMP","CARSON","RUBIO",
                "KASICH","CRUZ","BUSH",
                "FIORINA","PAUL","CHRISTIE")
info <- NULL
info_all <- NULL
for(i in 1:9){
info <- UnlistAndExtractInfo(candidates[i])
info$CANDIDATE <- candidates[i]
info_all <- rbind(info_all, info)
}

# i'm going to add a few more columns...
info_all %<>% mutate(carry_over_p1 = unique_words_repeated_fromfirst/words_unique,
                     word_repeat = words_total/words_unique)

Using this information to graph…

# graph of most words spoken by debate
ggplot(order_axis(
  subset(info_all, debate != "Ohio" & CANDIDATE != "CHRISTIE"), # christie didn't go to wisconsin
    CANDIDATE, carry_over_p1), 
       aes(x = CANDIDATE_o, 
           y = carry_over_p1)) + 
  geom_bar(stat = "identity", 
           aes(fill = CANDIDATE_o)) + 
  facet_grid(debate ~.) + 
  theme_eem() +
  scale_fill_eem(20) + 
  labs(title = "Repetition of words by candidate", 
       x = "Candidate", 
       y = "% of unique words repeated from first debate")

As the graph shows, Trump continues to lead in repetitiveness. In the latest debate, the Donald repeated 44.8% of the words he said during the first debate, followed by 38% from Kasich and 36% from Bush.

This is a key metric Trump has been consistently winning…

Again, if we plot total words versus unique words, to find the repetition of each individual word, we find Mr. Trump consistently below the trend: he says each word much more than the average candidate.

On the other hand, Carson and Fiorina tend to have a larger vocabulary of words.

ggplot(subset(info_all,CANDIDATE != "CHRISTIE"), 
       aes(x = words_total, 
           y = words_unique)) + 
    geom_point(aes(colour = CANDIDATE), size = 3, shape = 2) +
    stat_smooth()+
    theme_eem()+ # uses "eflores/eem"
    scale_colour_eem(20) + # uses "eflores/eem"
    labs(title = "Words per Debate",
         x = "Total Words", 
         y = "Unique Words")

Aggregating over the whole gives us a sense of this difference much more clearly:

# average times unique word is repeated...

ggplot(info_all, 
       aes(x = factor(CANDIDATE), 
           y = word_repeat, fill = eem_colors[1])) +
  geom_boxplot() +
  theme_eem()+
  labs(title = "Average repetition of unique words",
       x = "Candidate", 
       y = "Repetitions") + theme(legend.position = "none")

Speed of Intervention

This last debate also had the effect of spreading the gap between Trump and his opponents in terms of speed in interventions. Every time he talks, he always says less words, but this was even more apparent in Las Vegas…

# order the debates...
info_all$debate <- factor(info_all$debate, 
                          levels = c("Ohio","California",
                                     "Colorado","Wisconsin",
                                     "Vegas"))

# average length of interventions
ggplot(info_all, 
       aes(x = debate, 
           y = average_intervention, 
           group = CANDIDATE)) + 
  geom_path(aes(colour = CANDIDATE)) + 
  theme_eem() + 
  scale_colour_eem(20) + 
  labs(x = "Debate", 
       y = "Words", 
       title = "Average words per intervention")

This can also be an indication of how popular he is or how much “hits” he’s taking. When you need to counter an argument, sometimes only a few words is enough. If you do this constantly more than the others, the average is bound to go down.

The Data

As Alan Jordan suggested, i’ve left this data openly available via github, so anyone can play around with it and find a few more insights. Here is the link.

The all_debates data.frame contains two columns: candidate and message. This is all of the debates.
debate_h is the Ohio debate.
debate_c is the California debate.
debate_b is the Boulder debate.
debate_w is the Wisconsin debate.
debate_v is the Las Vegas debate.
The info_all data.frame is the aggregate stats of contenders by debate. It contains word counts, unique word counts, etc.

To leave a comment for the author, please follow the link and comment on their blog: En El Margen - R-English.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

More republican debate analysis with R

Getting and cleaning the data

Analyzing

Overall word-cloud

Las Vegas

Shifts in speech

Aggregate stats

Speed of Intervention

The Data

Related

Getting and cleaning the data

Analyzing

Overall word-cloud

Las Vegas

Shifts in speech

Aggregate stats

Speed of Intervention

The Data

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)